EngineeringApr 10, 2026

How we achieve sub-100ms TTFT on Llama 3 with vLLM

Time to first token (TTFT) is the single most important latency metric for interactive LLM applications. Users feel it immediately — anything above 300ms starts feeling sluggish. Our goal from day one was sub-100ms TTFT for Llama 3 8B, and we hit it. Here's how.

The baseline problem

Out of the box, a naive vLLM deployment of Llama 3 8B on an A100 80GB gives you roughly 150–250ms TTFT depending on batch size and prompt length. That's acceptable for batch workloads, but it kills the feel of real-time applications like chatbots, copilots, and code assistants.

The problem has three layers: model loading latency, KV cache misses, and scheduler overhead. We had to attack all three simultaneously.

Layer 1: Flash Attention 2 + PagedAttention

We run vLLM with FlashAttention-2 enabled, which reduces the memory I/O bottleneck in the attention computation by roughly 2x. Combined with vLLM's PagedAttention — which manages KV cache like virtual memory pages — we eliminate the primary cause of TTFT spikes under concurrent load: cache fragmentation.

The key insight is that KV cache eviction is the hidden latency killer at scale. When pages get evicted under memory pressure and need to be recomputed, TTFT can spike 3–5x. PagedAttention's paging strategy lets us serve 4–6x more concurrent requests on the same GPU before we hit that ceiling.

Layer 2: Continuous batching with tight scheduler tuning

vLLM's continuous batching scheduler has several knobs that most deployments leave at defaults. We tuned three in particular:

max_num_batched_tokens: We cap this at 8192 for interactive workloads. Higher values improve throughput but increase TTFT variance as longer prefill passes block new requests from being scheduled.
max_num_seqs: We run 128 concurrent sequences rather than the default 256. This reduces scheduler overhead and keeps TTFT consistent under load.
preemption_mode: We use recompute instead of swap for our A100 instances. Swapping to CPU adds 30–80ms of latency on eviction — recompute is faster for short sequences.

Layer 3: Tensor parallelism across 2 A100s for 70B

For Llama 3 70B, a single A100 80GB isn't enough memory for the full model in FP16 (it's ~140GB). We use tensor parallelism across 2× A100s with tensor_parallel_size=2. The inter-GPU communication overhead is only ~3ms on NVLink, which is well within our budget.

For 8B, we run on a single A100 with no tensor parallelism — splitting across GPUs for a model this size adds more communication latency than it saves.

Results

After applying all three layers:

Llama 3 8B: p50 TTFT = 42ms, p99 = 87ms at 50 concurrent requests
Llama 3 70B: p50 TTFT = 68ms, p99 = 114ms at 20 concurrent requests
Throughput increase: ~3.2x over baseline for the same TTFT budget

The most surprising find: scheduler tuning (layer 2) had more impact on TTFT consistency than FlashAttention-2 alone. FA2 reduces mean latency; scheduler tuning collapses the tail.

What's next

We're actively testing speculative decoding for the 70B model — early results suggest we can get p50 TTFT down to ~45ms without touching the accuracy profile. We'll share those results in a follow-up post.

If you're building latency-sensitive LLM applications, try Cloudach free. You'll hit these numbers on your first deploy.