HomeBlog › Design Search Autocomplete

Design Search Autocomplete (Typeahead) — System Design Walkthrough

A senior-level walkthrough: requirements, scale math, the trie, top-K precomputation per prefix, ranking, sharding, and the sub-100ms latency target the whole design turns on.

"Design search autocomplete" — also called typeahead or search-as-you-type — is a favorite system design interview question because it hides a deceptively hard problem behind a simple feature. Every keystroke is a query, the latency budget is brutal, and the right answer hinges on one clever idea: precomputing results instead of computing them on demand. Below is a structured typeahead system design walkthrough you can mirror in a real interview, narrating each stage as you go. Keep the system design interview cheat sheet open alongside for the underlying template.

1. Clarify the requirements

Don't start drawing boxes. Spend the first few minutes scoping the autocomplete system design so you optimize the right thing.

Functional requirements

Non-functional requirements

I'll de-scope spell correction, multi-word semantic ranking, and personalization depth for now — calling that out shows judgment without burning time.

2. Scale estimates (back-of-the-envelope)

Rough numbers justify the architecture; precision isn't the point.

QuantityAssumptionResult
Searches per day~5B queries/day~60K searches/sec
Keystrokes per search~20 chars, request every keystroke (debounced)~5–20× the search rate in suggestion calls
Suggestion QPS~10× amplification, peak 2× average~600K–1.2M reads/sec at peak
Unique prefixeslarge but bounded vocabularytens to hundreds of millions of nodes
Read:write ratioreads dominate massively~1000:1+

The headline takeaway: this is an aggressively read-heavy system with a tiny latency budget. That single fact pushes us to do all the heavy work offline and serve from precomputed, in-memory structures — which leads straight to the core design decision in section 6. Like the read-heavy reasoning behind designing Twitter's timeline, the design optimizes the read path above all else.

3. API design

The surface is tiny. One read endpoint dominates, and a logging path feeds the offline pipeline.

GET  /v1/suggest?q={prefix}&limit=10&locale=en
       -> { "suggestions": ["...", "...", ...] }   // pre-ranked, top-K

POST /v1/search-log   { "query": "...", "ts": ... }   // fire-and-forget

The client debounces keystrokes (e.g., wait ~50ms of idle typing) and cancels in-flight requests so it never floods the backend or renders stale results. Note /suggest returns an already-sorted list — no ranking happens at request time.

4. The core data structure: the trie

A trie (prefix tree) is the natural fit. Each edge is a character, so the path from the root to any node spells a prefix, and every completion of that prefix lives in the subtree below it.

The naive version walks the prefix to its node, then traverses the entire subtree to gather and rank all completions on every keystroke. That subtree can be enormous, and doing it per keystroke blows the latency budget instantly. The fix is the key insight of the whole problem:

Precompute top-K per prefix. Store, directly on each prefix node, the K highest-ranked completions of that prefix as an already-sorted list. A query then becomes an O(length of prefix) walk down the trie plus a single read of that node's cached list — no subtree traversal at request time. This trades extra memory and offline compute for blazing-fast reads, exactly the trade a read-heavy, latency-bound system wants.

5. Architecture diagram

Two decoupled paths: a fast serving path that only reads precomputed data, and a slow offline path that rebuilds the index from logs.

 SERVING PATH (hot, in-memory, sub-100ms)
 ┌────────┐   q=pre   ┌──────────┐   route    ┌───────────────────────────┐
 │ Client │ ────────► │ Suggest  │ ─────────► │  Trie shard (by prefix)   │
 │ (debounce)         │ service  │ ◄───────── │  node → cached top-K list │
 └────────┘  top-K    └──────────┘   list     └───────────────────────────┘
      │ logs search                                      ▲ publish
      ▼                                                  │
 ┌──────────┐   aggregate   ┌───────────────┐   rebuild trie + top-K
 │  Query   │ ────────────► │  Offline /    │ ──────────────────────────►
 │  log     │   counts      │  streaming    │   (batch or near-real-time)
 │ pipeline │               │  builder      │
 └──────────┘               └───────────────┘
 OFFLINE PATH (cold, eventual, minutes→hours)
Decoupled serving and offline paths. The hot path only reads precomputed top-K lists; the cold path aggregates logs and rebuilds the trie.

6. Ranking and the offline pipeline

Where do the scores come from? The base signal is historical query frequency — how often a query has been searched. Around that you layer recency (weight recent activity higher so trends surface), locale/language, and optionally light personalization. Crucially, all of this is computed offline:

Because ranking is precomputed, the serving path never scores candidates — it reads an already-sorted list. That's what keeps each keystroke under budget.

7. Updating frequencies and freshness

A natural follow-up: how does a brand-new trending query show up? You explicitly accept eventual consistency. New frequencies don't touch the live trie; they flow through the offline pipeline and appear after the next rebuild. For most autocomplete a delay of minutes to a few hours is perfectly acceptable.

If the interviewer pushes for fresher trends, propose a hybrid: keep the bulk index on a slow daily/hourly rebuild, and run a lightweight streaming job that maintains top-K for a small set of hot, fast-moving prefixes in near real time, merged at serve time. Don't over-engineer this unless asked — naming the trade-off is the high-signal move.

8. Sharding and caching the trie

One trie won't fit in a single machine's memory or serve 1M reads/sec alone, so partition and replicate:

9. Latency targets, bottlenecks, and trade-offs

Framework reminder: Autocomplete follows the same arc as every system design answer — requirements → estimates → API → data model → high-level design → the core decision → trade-offs. The precompute-and-cache pattern here echoes the read-path optimization in Design Twitter, and the offline-pipeline-feeds-fast-serving shape mirrors how a web crawler separates heavy background work from query-time reads.

Practice typeahead and other system designs with live AI support

CoPilot Interview surfaces a structured design skeleton — requirements, scale, API, data structure, and the core trade-off — in about 4 seconds during real Zoom and Teams calls, so you can practice narrating a clean answer. Free for Windows and macOS. Learn more about the free AI interview assistant or browse the full system design interview guide.

Download free

FAQ

What data structure is used for search autocomplete?

A trie (prefix tree). Each path from the root spells a prefix, so all completions of what the user has typed live in the subtree below that prefix node. To keep lookups fast you precompute and cache the top-K suggestions directly on each prefix node, turning a query into a single O(prefix length) walk plus a cache read instead of a subtree traversal.

How do you rank autocomplete suggestions?

Most systems rank by historical query frequency or popularity as the base signal, then layer on recency, personalization, and locale. The score is precomputed offline, and the top-K results per prefix are materialized so the serving path only reads an already-sorted list rather than scoring candidates at request time.

What latency should search autocomplete target?

Suggestions should appear within roughly 100 milliseconds so they keep pace with typing. That budget is why the top-K per prefix is precomputed and cached in memory: the serving path does an in-memory trie walk and a cache hit, not a database query, on every keystroke.

How do you update the autocomplete index with new query frequencies?

Frequencies update offline, not on the live serving path. A logging pipeline aggregates query counts over a window, a batch or streaming job recomputes the top-K per prefix, and the rebuilt trie or updated cache is published to the serving tier. Autocomplete tolerates eventual consistency, so a delay of minutes to hours before a trending query surfaces is acceptable.

How do you scale autocomplete to a huge vocabulary?

Shard the trie by prefix - for example route everything starting with 'a' to one shard - and replicate each shard for read throughput and availability. A request is hashed on its first character or two to the owning shard. The top-K cache makes each shard cheap to serve, so you scale reads horizontally by adding replicas.