Design Search Autocomplete (Typeahead): System Design Walkthrough

"Design search autocomplete" — also called typeahead or search-as-you-type — is a favorite system design interview question because it hides a deceptively hard problem behind a simple feature. Every keystroke is a query, the latency budget is brutal, and the right answer hinges on one clever idea: precomputing results instead of computing them on demand. Below is a structured typeahead system design walkthrough you can mirror in a real interview, narrating each stage as you go. Keep the system design interview cheat sheet open alongside for the underlying template.

1. Clarify the requirements

Don't start drawing boxes. Spend the first few minutes scoping the autocomplete system design so you optimize the right thing.

Functional requirements

As a user types a prefix, return the top suggestions (say 5–10) that complete it.
Suggestions are ordered by relevance — popularity is the default signal.
Suggestions update on every keystroke.
The suggestion corpus reflects what people actually search, and trending queries should surface over time.

Non-functional requirements

Extremely low latency: suggestions must appear within ~100ms to keep pace with typing.
Read-heavy: every keystroke is a read; query writes (search logs) are far less frequent than suggestion reads.
High availability: a stale suggestion is fine; a slow or missing one ruins the feature.
Scale: billions of queries per day across a large vocabulary, with a heavy popularity skew.

I'll de-scope spell correction, multi-word semantic ranking, and personalization depth for now — calling that out shows judgment without burning time.

2. Scale estimates (back-of-the-envelope)

Rough numbers justify the architecture; precision isn't the point.

Quantity	Assumption	Result
Searches per day	~5B queries/day	~60K searches/sec
Keystrokes per search	~20 chars, request every keystroke (debounced)	~5–20× the search rate in suggestion calls
Suggestion QPS	~10× amplification, peak 2× average	~600K–1.2M reads/sec at peak
Unique prefixes	large but bounded vocabulary	tens to hundreds of millions of nodes
Read:write ratio	reads dominate massively	~1000:1+

The headline takeaway: this is an aggressively read-heavy system with a tiny latency budget. That single fact pushes us to do all the heavy work offline and serve from precomputed, in-memory structures — which leads straight to the core design decision in section 6. Like the read-heavy reasoning behind designing Twitter's timeline, the design optimizes the read path above all else.

3. API design

The surface is tiny. One read endpoint dominates, and a logging path feeds the offline pipeline.

GET  /v1/suggest?q={prefix}&limit=10&locale=en
       -> { "suggestions": ["...", "...", ...] }   // pre-ranked, top-K

POST /v1/search-log   { "query": "...", "ts": ... }   // fire-and-forget

The client debounces keystrokes (e.g., wait ~50ms of idle typing) and cancels in-flight requests so it never floods the backend or renders stale results. Note /suggest returns an already-sorted list — no ranking happens at request time.

4. The core data structure: the trie

A trie (prefix tree) is the natural fit. Each edge is a character, so the path from the root to any node spells a prefix, and every completion of that prefix lives in the subtree below it.

The naive version walks the prefix to its node, then traverses the entire subtree to gather and rank all completions on every keystroke. That subtree can be enormous, and doing it per keystroke blows the latency budget instantly. The fix is the key insight of the whole problem:

Precompute top-K per prefix. Store, directly on each prefix node, the K highest-ranked completions of that prefix as an already-sorted list. A query then becomes an O(length of prefix) walk down the trie plus a single read of that node's cached list — no subtree traversal at request time. This trades extra memory and offline compute for blazing-fast reads, exactly the trade a read-heavy, latency-bound system wants.

5. Architecture diagram

Two decoupled paths: a fast serving path that only reads precomputed data, and a slow offline path that rebuilds the index from logs.

 SERVING PATH (hot, in-memory, sub-100ms)
 ┌────────┐   q=pre   ┌──────────┐   route    ┌───────────────────────────┐
 │ Client │ ────────► │ Suggest  │ ─────────► │  Trie shard (by prefix)   │
 │ (debounce)         │ service  │ ◄───────── │  node → cached top-K list │
 └────────┘  top-K    └──────────┘   list     └───────────────────────────┘
      │ logs search                                      ▲ publish
      ▼                                                  │
 ┌──────────┐   aggregate   ┌───────────────┐   rebuild trie + top-K
 │  Query   │ ────────────► │  Offline /    │ ──────────────────────────►
 │  log     │   counts      │  streaming    │   (batch or near-real-time)
 │ pipeline │               │  builder      │
 └──────────┘               └───────────────┘
 OFFLINE PATH (cold, eventual, minutes→hours)

Decoupled serving and offline paths. The hot path only reads precomputed top-K lists; the cold path aggregates logs and rebuilds the trie.

6. Ranking and the offline pipeline

Where do the scores come from? The base signal is historical query frequency — how often a query has been searched. Around that you layer recency (weight recent activity higher so trends surface), locale/language, and optionally light personalization. Crucially, all of this is computed offline:

Log & aggregate. The query-log pipeline counts query occurrences over a sliding window (with time decay so old popularity fades).
Score & rebuild. A batch (e.g., MapReduce/Spark) or streaming job computes each query's score, then walks the trie computing the top-K list for every prefix node.
Publish. The rebuilt trie (or just the updated top-K caches) is shipped to the serving tier, which swaps it in atomically.

Because ranking is precomputed, the serving path never scores candidates — it reads an already-sorted list. That's what keeps each keystroke under budget.

7. Updating frequencies and freshness

A natural follow-up: how does a brand-new trending query show up? You explicitly accept eventual consistency. New frequencies don't touch the live trie; they flow through the offline pipeline and appear after the next rebuild. For most autocomplete a delay of minutes to a few hours is perfectly acceptable.

If the interviewer pushes for fresher trends, propose a hybrid: keep the bulk index on a slow daily/hourly rebuild, and run a lightweight streaming job that maintains top-K for a small set of hot, fast-moving prefixes in near real time, merged at serve time. Don't over-engineer this unless asked — naming the trade-off is the high-signal move.

8. Sharding and caching the trie

One trie won't fit in a single machine's memory or serve 1M reads/sec alone, so partition and replicate:

Shard by prefix. Route by the first character or two — everything under "a" to one shard, "b" to another. A request hashes its leading characters to the owning shard. Watch for skew (some prefixes are far hotter) and split hot ranges further.
Replicate each shard. Reads scale horizontally by adding read replicas of each shard behind a load balancer; this also gives availability.
Cache aggressively. The whole point is in-memory serving, but you can also cache hot prefix responses at the edge/CDN with a short TTL, since suggestions change slowly. A debounced client plus edge caching absorbs much of the keystroke amplification.

9. Latency targets, bottlenecks, and trade-offs

~100ms end to end. Budget keystroke debounce + network + an in-memory trie walk + cache read. The design exists to keep the server portion to a few milliseconds.
Precompute vs. memory. Storing top-K on every node costs memory and offline compute but is the only way to hit the latency target at this read volume.
Freshness vs. cost. Faster rebuilds surface trends sooner but cost more; eventual consistency is the right default.
Skew. Both the social-style popularity distribution and prefix sharding create hotspots; replicate hot shards and cache hot prefixes.
Client discipline. Debouncing and request cancellation on the client cut backend load dramatically — mention it; interviewers notice.

Framework reminder: Autocomplete follows the same arc as every system design answer — requirements → estimates → API → data model → high-level design → the core decision → trade-offs. The precompute-and-cache pattern here echoes the read-path optimization in Design Twitter, and the offline-pipeline-feeds-fast-serving shape mirrors how a web crawler separates heavy background work from query-time reads.

Practice typeahead and other system designs with live AI support

CoPilot Interview surfaces a structured design skeleton — requirements, scale, API, data structure, and the core trade-off — in about 4 seconds during real Zoom and Teams calls, so you can practice narrating a clean answer. Free for Windows and macOS. Learn more about the free AI interview assistant or browse the full system design interview guide.

Download free

FAQ

What data structure is used for search autocomplete?

A trie (prefix tree). Each path from the root spells a prefix, so all completions of what the user has typed live in the subtree below that prefix node. To keep lookups fast you precompute and cache the top-K suggestions directly on each prefix node, turning a query into a single O(prefix length) walk plus a cache read instead of a subtree traversal.

How do you rank autocomplete suggestions?

Most systems rank by historical query frequency or popularity as the base signal, then layer on recency, personalization, and locale. The score is precomputed offline, and the top-K results per prefix are materialized so the serving path only reads an already-sorted list rather than scoring candidates at request time.

What latency should search autocomplete target?

Suggestions should appear within roughly 100 milliseconds so they keep pace with typing. That budget is why the top-K per prefix is precomputed and cached in memory: the serving path does an in-memory trie walk and a cache hit, not a database query, on every keystroke.

How do you update the autocomplete index with new query frequencies?

Frequencies update offline, not on the live serving path. A logging pipeline aggregates query counts over a window, a batch or streaming job recomputes the top-K per prefix, and the rebuilt trie or updated cache is published to the serving tier. Autocomplete tolerates eventual consistency, so a delay of minutes to hours before a trending query surfaces is acceptable.

How do you scale autocomplete to a huge vocabulary?

Shard the trie by prefix - for example route everything starting with 'a' to one shard - and replicate each shard for read throughput and availability. A request is hashed on its first character or two to the owning shard. The top-K cache makes each shard cheap to serve, so you scale reads horizontally by adding replicas.

Design Search Autocomplete (Typeahead) — System Design Walkthrough

1. Clarify the requirements

2. Scale estimates (back-of-the-envelope)

3. API design

4. The core data structure: the trie

5. Architecture diagram

6. Ranking and the offline pipeline

7. Updating frequencies and freshness

8. Sharding and caching the trie

9. Latency targets, bottlenecks, and trade-offs

Practice typeahead and other system designs with live AI support

FAQ

Related guides