"Design a chat app" (WhatsApp, Messenger, Slack) is the canonical real-time system design question. The crux is maintaining millions of persistent connections and routing each message to the exact server holding the recipient's connection — plus presence, delivery receipts, and durable history.
Here's the full walkthrough with a diagram, covering the WebSocket gateway, connection routing, and delivery guarantees.
1. Clarify the requirements
Functional requirements
- One-to-one and group messaging
- Online/last-seen presence
- Delivery and read receipts
- Message history and sync across devices
- Push notifications when the recipient is offline
Non-functional requirements
- Real-time delivery (low latency)
- Durable — messages are never lost
- Scale to millions of concurrent connections
- Ordered delivery within a conversation
Back-of-the-envelope scale: Assume 50M concurrent connections and billions of messages/day. The hard part isn't throughput — it's holding tens of millions of long-lived connections and finding the right one per message.
2. API design
Chat is connection-oriented, not request/response. Clients hold a WebSocket; messages flow as events over it.
# Persistent connection
WS /connect (upgraded, authenticated)
# Events over the socket
-> { type: "send", to: "userB", text: "hi" }
<- { type: "message", from: "userA", text: "hi", id, ts }
<- { type: "receipt", messageId, status: "delivered" | "read" }
3. High-level architecture
Clients open a WebSocket to a connection gateway. Because connections are long-lived and sticky, the system must know which gateway server currently holds each user's socket — that mapping lives in a fast store (Redis) updated on connect/disconnect.
When A sends a message, its gateway forwards it to the chat service, which persists it and looks up B's connection server. It delivers via that server if B is online; if B is offline, it enqueues a push notification. Cross-server delivery uses a pub/sub channel so any gateway can hand a message to any other.
Presence is tracked with heartbeats: clients ping periodically, and a Redis entry with a TTL marks them online; a missed heartbeat expires the entry. Read/delivery receipts are just small status events flowing back through the same path.
4. Data model & storage
Messages are stored partitioned by conversation ID and ordered by time, so loading a chat is a single efficient range scan. A wide-column store (Cassandra/HBase) fits this access pattern at scale.
Per-device sync tracks the last message each device has seen, so a phone and laptop both catch up correctly. Group messages fan out to each member's delivery path.
5. Scaling and bottlenecks
- Many stateless gateway servers behind a load balancer, each holding a slice of connections; scale out by adding servers.
- Connection registry in Redis (user → server) so messages route to the right gateway in one lookup.
- Pub/sub backbone (Redis/Kafka) for cross-server message handoff.
- Shard message storage by conversation ID; archive old messages to cheaper storage.
Key trade-offs the interviewer probes
- WebSocket vs long-polling. WebSockets give true low-latency push and are the standard answer; long-polling is a fallback for restricted networks but wastes resources.
- Delivery guarantees. Aim for at-least-once delivery with client-side de-duplication by message ID — exactly-once is impractical, so dedupe instead.
- Group fan-out. Small groups can fan out per-member; very large groups may need a pull model so one message doesn't trigger a huge write burst — the same push/pull tension as the news feed.
Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.
Handle real-time design with live AI support
CoPilot Interview surfaces a structured design skeleton — requirements, API, data model, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, invisible on screen-share.
Download freeFAQ
Why use WebSockets for a chat app?
WebSockets provide a persistent, bidirectional connection so the server can push messages to clients instantly without polling. That low-latency push is exactly what real-time chat needs; long-polling is only a fallback for networks that block WebSockets.
How do you route a message to the right server?
Maintain a connection registry (user to gateway server) in a fast store like Redis, updated on connect and disconnect. When a message arrives, look up the recipient's server and hand the message off, using a pub/sub backbone so any gateway can deliver to any other.
How do you handle offline users?
Persist every message durably first. If the recipient has no active connection, enqueue a push notification (APNs/FCM) and deliver the message when they reconnect and sync. Persistence-before-delivery is what guarantees no message is lost.
How is presence (online status) implemented?
With heartbeats: clients send a periodic ping, and the server stores an online marker with a TTL in Redis. A missed heartbeat lets the entry expire, flipping the user to offline. It's cheap and self-healing.
How do you guarantee messages aren't lost or duplicated?
Persist messages before attempting delivery, and use at-least-once delivery with client-side de-duplication by message ID. Exactly-once delivery is impractical in distributed systems, so deduplicating on a unique ID is the standard approach.