Design YouTube: System Design Interview Walkthrough

"Design YouTube" (or Netflix, Vimeo, any video-on-demand platform) is a staple system design question because it stresses parts of an architecture most apps never touch — large binary uploads, distributed media processing, and serving petabytes of data to viewers cheaply. The whole answer hinges on two ideas: a transcoding pipeline that prepares each video for streaming, and a CDN that absorbs the read traffic.

Here is the full senior-level walkthrough, following the same arc every system design answer should: requirements, estimates, API, the upload and transcoding pipeline, storage and delivery, metadata and view counts, then bottlenecks and trade-offs.

1. Clarify the requirements

Scope the problem before drawing anything. State the functional surface, then the non-functional constraints that actually drive the design.

Functional requirements

Users can upload a video (with a title, description, and thumbnail)
Users can stream and watch a video on demand, on any device and connection
Users can search for videos by title and metadata
The system tracks view counts per video
Out of scope but worth naming: recommendations, comments, and live streaming — call these out so the interviewer knows you see them, then set them aside.

Non-functional requirements

Massive read scale — a video is uploaded once and watched millions of times; reads dwarf writes
High availability — playback should rarely fail; the watch path must degrade gracefully, not go down
Durable storage — uploaded videos must never be lost
Low buffering — start fast and adapt to the viewer's bandwidth to avoid stalls

2. Capacity estimates

Back-of-the-envelope numbers justify the expensive parts of the design. Assume a large platform and round aggressively — the interviewer wants reasoning, not precision.

Quantity	Assumption	Result
Uploads/day	~500K videos uploaded daily, avg 300 MB raw	~150 TB/day raw ingest
Storage (with transcodes)	Each video re-encoded to ~5 resolutions, roughly 2× raw	~300 TB/day → ~100 PB/year and growing
Watch traffic	~5B views/day, avg 5 min watched at ~2 Mbps	Hundreds of Tbps of peak egress
Read : write ratio	Views vastly exceed uploads	Optimize the read/delivery path first

The headline: egress bandwidth is the dominant cost. Storage in petabytes is large but cheap per byte; pushing those bytes out to billions of viewers every day is what drives the bill. This single observation is why a CDN is non-negotiable, and naming it early is high-signal.

3. API design

Uploads are large and unreliable, so the upload API is multi-step rather than a single request. Playback returns a manifest, not raw bytes.

POST /api/videos                 -> create video record, return uploadId + presigned URLs
PUT  /api/upload/{uploadId}/part  { partNumber, bytes }   (chunked, resumable)
POST /api/upload/{uploadId}/complete   -> assemble + trigger transcoding
GET  /api/videos/{videoId}        -> metadata (title, status, manifest URL)
GET  /api/videos/{videoId}/manifest.m3u8   -> HLS/DASH playlist of segments
GET  /api/search?q=...            -> ranked list of videos

The client uploads directly to blob storage via presigned URLs so large files never pass through application servers.

4. The upload & transcoding pipeline

This is the heart of the design and where senior candidates separate themselves.

Chunked upload. The client splits the file into parts and uploads them in parallel (and resumably) to a blob store. A flaky mobile connection can retry a single failed chunk instead of restarting a 300 MB upload. When complete is called, the parts are assembled into the raw source object.

Trigger transcoding. Assembly emits an event onto a queue. A pool of transcoding workers picks up the job — decoupling upload from processing so a spike in uploads just lengthens the queue instead of failing requests.

Transcode to multiple resolutions and bitrates. The raw file is split into short segments, and each segment is encoded in parallel across workers into several quality levels (for example 240p, 480p, 720p, 1080p, 4K) at different bitrates. Parallelizing per-segment is what makes processing a long video fast. The output is many short segments per resolution plus a manifest describing them.

While transcoding runs, the video's status is processing; once segments and the manifest are written, it flips to ready and becomes watchable.

5. Storage & delivery

Blob/object store. Raw uploads and all transcoded segments live in an object store — durable, replicated, and cheap per byte. This is the source of truth for media.

CDN for delivery. A CDN sits in front of blob storage and caches popular segments at edge locations near viewers. Because most views concentrate on a small fraction of videos, the CDN serves the overwhelming majority of bytes from cache — directly attacking the egress cost from the estimates and cutting latency.

Adaptive bitrate streaming (HLS/DASH). Because video is stored as segmented quality levels described by a manifest, the player can measure available bandwidth and switch quality up or down between segments on the fly. A viewer on a weak connection drops to 480p and keeps watching instead of buffering; a viewer on fiber gets 4K. This is how you deliver on the low-buffering requirement.

Why segments matter: storing video as short, independently-fetchable segments per resolution is the design decision that makes adaptive bitrate streaming, CDN caching, and parallel transcoding all possible at once. If you remember one structural choice, remember this one.

6. Metadata & view counts

Metadata DB. Title, description, owner, duration, upload status, and the manifest location live in a metadata database, kept separate from the media bytes. It is read-heavy on the watch path (load video page) and backs search; it is modest in size compared to the media and can be cached aggressively.

View-count handling. View counting is write-heavy — every play is a write — and, crucially, it does not need to be exact. Incrementing a single row per view would create a hot-key write bottleneck on popular videos. Instead, views are emitted as events and aggregated asynchronously (buffered and batched, or run through a streaming aggregation pipeline), trading exact precision for approximate, eventually consistent counts. A count that lags by seconds or is off by a rounding margin is completely acceptable; a write storm on every popular video is not.

7. Bottlenecks & trade-offs

Egress bandwidth (the dominant cost). Lean on the CDN and cache popular segments at the edge; only cache misses hit the origin blob store.
Transcoding throughput. A processing backlog is absorbed by the queue and a worker pool that scales horizontally; per-segment parallelism keeps any single video fast.
Hot-key writes on view counts. Aggregate asynchronously and accept approximate counts rather than incrementing synchronously per view.
Consistency vs availability. Watching is eventually consistent — a newly uploaded video being unavailable for a minute while it transcodes is fine, and view counts lagging is fine. Prioritize availability of the watch path.
Storage cost. Storing five renditions of every video is expensive; lifecycle policies and on-demand transcoding for rarely-watched resolutions are common refinements.

Framework reminder: every system design answer follows the same arc — requirements → estimates → API → high-level design → data model → scale → trade-offs. Keep the system design cheat sheet in mind and narrate which stage you're in.

Structure the YouTube design live with AI support

CoPilot Interview surfaces a structured design skeleton — requirements, estimates, API, pipeline, and scaling — in about 4 seconds during real Zoom and Teams calls. Free for Windows and macOS, invisible on screen-share. See how the AI interview assistant works.

Download free

FAQ

Why is egress bandwidth the dominant cost in designing YouTube?

A video is uploaded once but watched millions of times, so read traffic dwarfs write traffic. Serving petabytes of video out to viewers every day - the egress - is the single largest cost line, which is why a CDN that caches popular segments close to users is non-negotiable.

How does the transcoding pipeline work?

After a chunked upload assembles the raw file in blob storage, an event triggers a distributed transcoding pipeline. The video is split into segments, and each segment is encoded in parallel into multiple resolutions and bitrates (for example 240p through 4K). The output segments and a manifest are written back to blob storage for delivery.

What is adaptive bitrate streaming?

Adaptive bitrate streaming (HLS or DASH) delivers video as short segments at several quality levels described by a manifest. The player measures available bandwidth and switches segment quality up or down on the fly, reducing buffering on slow connections while using higher quality on fast ones.

How do you handle view counts at YouTube scale?

View counting is write-heavy and does not need to be exact. Counts are typically aggregated asynchronously through a streaming pipeline or buffered and batched, and exact precision is traded for approximate, eventually consistent counts. Strong per-view consistency would not scale.

Where are videos actually stored?

The raw upload and all transcoded segments live in a blob/object store, which is durable and cheap. Metadata - title, owner, duration, manifest location - lives in a separate metadata database. The CDN sits in front of blob storage to cache and serve popular segments from the edge.

Design YouTube — System Design Interview Walkthrough

1. Clarify the requirements

2. Capacity estimates

3. API design

4. The upload & transcoding pipeline

5. Storage & delivery

6. Metadata & view counts

7. Bottlenecks & trade-offs

Structure the YouTube design live with AI support

FAQ

Related guides