Inference 8
- vLLM Deep Dive Part 2: Scaling — Speculative Decoding, Parallelism, and Disaggregated Serving
- Which LLM Serving Framework Should You Use? A Practical Comparison
- LLM Serving in Depth: Batching, Scheduling, and Parallelism
- Speculative Decoding: Generating Multiple Tokens Per Step
- Training vs Inference: Why the Same Model Costs 10× More to Train
- The Memory Math: What Fits on a GPU?
- Inside LLM Inference: Every Calculation from Text to Token using Gemma 4 12B
- Chapter 7 - Efficient Inference and Quantization