vLLM Deep Dive: The Engine — How vLLM Turns a Single GPU into a Serving Machine

Posted Jun 12, 2026 Updated Jun 20, 2026

vLLM Deep Dive Series — Part 1: The Engine

By Amulya Bhatia 11 min read

This article is Part 1 of the vLLM Deep Dive Series. Based on vLLM v0.18+ (June 2026). If you’re new to LLM serving concepts, the Generative AI in Depth series covers the first-principles foundations this series builds on.

vLLM is the most widely deployed open-source LLM inference server. It powers everything from single-GPU side projects to multi-node production clusters serving millions of requests per day. But very few people who use it understand why it’s fast.

This three-part series breaks down every major feature in vLLM — not just what they do, but the first-principles reasoning behind them. Part 1 covers the core engine: the scheduling, memory management, and kernel optimizations that make a single GPU serve 10x more requests than a naive implementation.

PagedAttention: The Foundation

Every LLM generates tokens by attending to all previous tokens. The intermediate results of this attention — the Key and Value vectors — are stored in the KV cache. This cache is the single largest consumer of GPU memory during inference, often exceeding the model weights themselves.

The naive approach allocates a contiguous block of GPU memory for each request’s maximum possible sequence length. A 70B model serving 128K context allocates ~40GB of KV cache per request — even if the actual sequence only uses 2K tokens. This wastes 98% of the allocated memory.

PagedAttention treats KV cache like virtual memory. Instead of one big contiguous allocation, it divides the cache into fixed-size pages (blocks of 16 tokens). Pages are allocated on demand as the sequence grows, and freed when the request completes. Two requests with a shared prefix can point to the same physical pages.

The result: vLLM achieves near-zero memory waste, compared to 60-80% waste with static allocation. This directly translates to more concurrent requests on the same hardware.

Published as Efficient Memory Management for Large Language Model Serving with PagedAttention at SOSP 2023.

Concept covered in depth: The Memory Math covers why the KV cache is the dominant memory consumer at long contexts, how it scales, and the per-layer arithmetic behind it. LLM Serving in Depth covers PagedAttention, continuous batching, and prefix caching from first principles.

Continuous Batching: No Wasted Slots

With static batching, you group N requests together and run them all to completion before starting the next batch. If request A needs 10 tokens and request B needs 500, A finishes at step 10 — but the GPU can’t accept new requests until B finishes at step 500. The GPU isn’t idle (it’s still generating for B), but A’s batch slot is wasted. The GPU could be processing a new request in that slot but isn’t:

gantt
    title Static Batching — A's slot wastes 490 steps
    dateFormat X
    axisFormat %s
    section Batch 1
    Request A (10 tokens)         :a, 0, 10
    A's slot sits empty           :crit, 10, 500
    Request B (500 tokens)        :b, 0, 500
    section Batch 2
    Request C (starts only after B finishes) :c, 500, 700
    Request D                     :d, 500, 600

With continuous batching, the scheduler checks after every single token generation step whether any request has finished. If A finishes at step 10, a new request C immediately fills A’s slot:

gantt
    title Continuous Batching — slots immediately reused
    dateFormat X
    axisFormat %s
    section Slot 1
    Request A             :a, 0, 10
    Request C (fills A's slot) :c, 10, 60
    Request E             :e, 60, 120
    section Slot 2
    Request B             :b, 0, 500
    Request D             :d, 500, 600

The GPU does the same amount of compute per step, but it’s doing useful work in every batch slot instead of wasting slots on completed requests. This is why vLLM achieves 2-4x higher throughput than static batching.

Chunked Prefill: Don’t Block the Queue

Before generating tokens, the model must process the entire input prompt — this is called prefill. For a 10K token prompt, prefill might take 2 seconds of solid GPU compute. During those 2 seconds, ALL other users’ decode steps (token generation) are blocked. Their stream of tokens just freezes.

Instead of processing all 10K tokens at once, chunked prefill breaks it into chunks (e.g., 512 tokens each):

gantt
    title Without Chunked Prefill — B and C freeze for 2s
    dateFormat X
    axisFormat %s
    section GPU
    Prefill A (10K tokens, 2 seconds) :crit, 0, 2000
    Decode B :b, 2000, 2050
    Decode C :c, 2050, 2100

gantt
    title With Chunked Prefill — interleaved, nobody freezes
    dateFormat X
    axisFormat %s
    section GPU
    Prefill A chunk 1   :a1, 0, 100
    Decode B,C          :b1, 100, 130
    Prefill A chunk 2   :a2, 130, 230
    Decode B,C          :b2, 230, 260
    Prefill A chunk 3   :a3, 260, 360
    Decode B,C          :b3, 360, 390

The prefill for A takes slightly longer overall (overhead of switching), but every other user’s latency stays smooth.

Can you start decoding A early?

After processing chunk 1 (tokens 0-511), those tokens have valid KV entries. But to generate A’s first output token, the model needs the hidden state of the last input token — and that hidden state depends on attention to ALL input tokens. token_9999’s representation = attention(token_9999, all of tokens 0-9998). You can’t compute that until all chunks are processed.

The KV entries from chunk 1 ARE valid and reused (the model doesn’t recompute them in chunk 2), but they’re insufficient to produce any output. Chunked prefill helps other users during A’s prefill — it doesn’t help A itself start generating sooner.

Automatic Prefix Caching (APC)

When the model processes tokens, it computes KV vectors for each token. APC detects when two requests share the same prefix tokens, and reuses the already-computed KV cache instead of recomputing it.

But KV depends on context, not just one token

Correct — a token’s K and V values at layer L depend on the hidden state at layer L, which depends on all preceding tokens via attention. But in a causal (autoregressive) transformer, token at position i can ONLY attend to tokens at positions 0..i (the causal mask blocks everything after):

flowchart LR
    subgraph "Request 1"
        direction LR
        A1["System prompt"] --> B1["Tool defs"] --> C1["User: refactor auth"]
    end
    subgraph "Request 2"
        direction LR
        A2["System prompt"] --> B2["Tool defs"] --> C2["User: fix login bug"]
    end
    style A1 fill:#4CAF50,color:white
    style B1 fill:#4CAF50,color:white
    style A2 fill:#4CAF50,color:white
    style B2 fill:#4CAF50,color:white
    style C1 fill:#FF9800,color:white
    style C2 fill:#2196F3,color:white

At the prefix tokens (green), the causal mask means token 500 in the system prompt only sees tokens 0-500. It has no knowledge that “refactor auth” or “fix login bug” comes later — those are masked. So the hidden states, and therefore the K/V values, for ALL prefix tokens are mathematically identical across both requests. Only at the divergence point (the user message) do the hidden states differ.

vLLM divides the KV cache into fixed-size blocks (e.g., 16 tokens). Each block is hashed by its token content. When a new request arrives, vLLM hashes its prefix blocks and checks: “have I computed KV for this exact block before?” If yes, skip computation and reuse.

	Without APC	With APC
Request 1	Compute full 2000 prefix + msg → 400ms	Compute full 2000 prefix + msg → 400ms (miss)
Request 2	Compute full 2000 prefix + msg → 400ms	Reuse prefix KV, only compute msg → 20ms (hit!)
Request 3	Compute full 2000 prefix + msg → 400ms	Reuse prefix KV, only compute msg → 20ms (hit!)
Total	1200ms	440ms (2.7x faster)

For agentic workloads, APC is transformative. The system prompt + tool definitions + conversation history is identical across every turn. Only the new user message changes. APC means turn 2+ skips almost all prefill work. Enabled by default in vLLM v1.

Concept covered in depth: LLM Serving in Depth covers why prefix caching works — the causal attention property that makes shared prefix KV mathematically identical — along with continuous batching and the full scheduling pipeline.

CUDA Graph Capture: Eliminating Launch Overhead

A single forward pass through a transformer launches hundreds of GPU kernels — matrix multiplications, attention, layer norms. Each kernel launch has CPU overhead (~10μs). At 200 kernels per step, that’s 2ms of pure overhead per token — which can dominate at small batch sizes.

CUDA Graphs record the entire sequence of kernel launches once, then replay it as a single unit:

Without CUDA Graphs:
  200 kernel launches × ~10μs each = 2ms overhead per token

With CUDA Graphs:
  1 graph replay = ~10μs overhead per token (200x reduction)

vLLM supports two modes:

Full graph capture: Records the entire forward pass as one graph. Fastest replay but can’t handle variable batch sizes without re-capturing.
Piecewise graph capture: Records individual sections (attention, MLP) as separate graphs. Slightly more overhead but handles dynamic shapes by composing pieces.

HIP Graphs are the AMD ROCm equivalent.

Concept covered in depth: CUDA Kernels and FlashAttention covers CUDA Graph Capture in detail — how the capture/replay mechanism works, why static shapes are required, how vLLM uses batch-size buckets to work around this, and how graph capture interacts with torch.compile.

torch.compile: Fusing the Gaps

PyTorch 2.0 introduced torch.compile() — a JIT compiler that traces operations and generates optimized GPU kernels automatically. Instead of running Python → PyTorch → CUDA kernel by kernel, it fuses multiple operations into single kernels:

flowchart LR
    subgraph "Without torch.compile (3 kernels, 3 memory round-trips)"
        X1["Read x from\nGPU memory"] --> K1["Kernel 1:\nlayer_norm"] --> W1["Write result\nto memory"]
        W1 --> R2["Read result"] --> K2["Kernel 2:\nadd residual"] --> W2["Write result"]
        W2 --> R3["Read result"] --> K3["Kernel 3:\ngelu"] --> W3["Write result"]
    end

flowchart LR
    subgraph "With torch.compile (1 fused kernel, 1 round-trip)"
        X["Read x + residual\nfrom GPU memory"] --> K["Fused kernel:\nnorm + add + gelu"] --> W["Write result"]
    end

Since GPU inference is often memory-bandwidth-bound, reducing memory reads/writes by fusing ops gives 10-30% speedup.

torch.compile does NOT replace FlashAttention. Attention already has hand-written optimized kernels. torch.compile handles the other ~60% — residual connections, normalizations, MLP activations, position encodings — that would otherwise be separate small kernels.

Quantization: Trading Precision for Capacity

vLLM supports one of the widest ranges of quantization methods of any inference server:

Weight Quantization

Method	Bits	Notes
FP8 (E4M3/E5M2)	8-bit float	Best speed/quality on Hopper+. Native support.
MXFP8 / MXFP4	8/4-bit float	Microscaling formats
NVFP4	4-bit float	NVIDIA’s FP4 for Blackwell
INT8 / INT4	8/4-bit int	Per-channel/per-tensor or weight-only
GPTQ	3/4/8-bit	Post-training, widely available models
AWQ	4-bit	Activation-aware, fast inference
GGUF	2-8 bit	Compatible with llama.cpp quantizations
compressed-tensors	Various	Neural Magic’s format
TorchAO / AutoRound	Various	PyTorch-native / Intel PTQ

KV Cache Quantization

Model weights are static — you quantize them once. But the KV cache grows dynamically and can dominate GPU memory at long contexts. A 70B model’s weights use ~35GB, but its KV cache for 128K context can use 40GB+.

FP8 KV cache — halves KV memory with <1% quality loss. Production-ready.
TurboQuant — pushes to 4-bit or 2-bit KV cache. Still under active research — quality degrades on needle-in-a-haystack tasks at extreme compression, and the optimal configuration is model-dependent.

Concept covered in depth: A Quantization Primer explains GPTQ, AWQ, FP8, and KV cache quantization from first principles — and how model architecture determines which methods work well.

Attention Kernels: Why There Are Five

The attention mechanism (Q·K^T·V) is the most compute-intensive operation in a transformer. A “kernel” is a specific GPU implementation of this math.

The naive implementation materializes an [seq_len × seq_len] attention matrix — for 8K context, that’s 268MB per layer. FlashAttention avoids this entirely by computing attention in tiles that fit in GPU SRAM (~20MB on-chip). The full matrix never exists in main memory: 2-4x faster, O(n) memory instead of O(n²).

vLLM has multiple kernel implementations because different situations need different specializations:

Kernel	What Makes It Special
FlashAttention	Best general-purpose. Supports GQA (broadcasting shared K/V to query heads), sliding window (skipping tiles outside the window), and various head dimensions (SRAM tile sizes must match).
FlashInfer	Optimized for serving. Natively understands PagedAttention’s non-contiguous memory layout — no gather/scatter needed.
TRTLLM-GEN	NVIDIA’s hand-tuned assembly for H100/B200. Fastest on NVIDIA hardware, not portable.
FlashMLA	For DeepSeek’s Multi-Latent Attention, where K/V are compressed into a latent space with completely different tensor shapes.
Triton	Written in Triton. Runs on both NVIDIA and AMD GPUs. Used as the ROCm fallback.

vLLM auto-selects the best kernel based on your hardware and model.

Concept covered in depth: CUDA Kernels and FlashAttention explains why memory bandwidth — not compute — is the bottleneck, and how FlashAttention’s tiling algorithm eliminates the O(n²) attention matrix.

What’s Next

Part 2 covers the performance multipliers: speculative decoding, all five parallelism strategies, disaggregated serving, and the hardware support matrix. Part 3 covers the 60+ supported model architectures, serving features, and the cutting-edge 2026 additions.

AI Infrastructure, Deep Dives

vLLM Deep Dive Series LLM Serving

This post is licensed under CC BY 4.0 by the author.