Generative AI in Depth 15
- Which LLM Serving Framework Should You Use? A Practical Comparison
- LLM Evaluation in Depth: Benchmarks, Contamination, and What Actually Matters
- LLM Serving in Depth: Batching, Scheduling, and Parallelism
- Context Length Scaling: RoPE, YaRN, Ring Attention, and the Cost of Long Context
- Mixture of Experts: Routing, Sparse Activation, and Why MoE Dominates at Scale
- Speculative Decoding: Generating Multiple Tokens Per Step
- CUDA Kernels and FlashAttention: Why Memory Bandwidth Is the Bottleneck
- A Quantization Primer: Formats, Architecture Sensitivity, and a Gemma 4 Case Study
- Knowledge Distillation: Making Smaller Models That Punch Above Their Weight
- Fine-Tuning and Adaptation: LoRA, QLoRA, RLHF, and DPO in Depth
- Training vs Inference: Why the Same Model Costs 10× More to Train
- The Memory Math: What Fits on a GPU?
- Attention Mechanisms and KV Cache: From First Principles to Gemma 4's Architecture
- Inside LLM Inference: Every Calculation from Text to Token using Gemma 4 12B
- Tokenisation in Depth: BPE, SentencePiece, Vocabularies, and Why Tokens Are Not Words