Chapter 3 - Attention Is All You Need
This article is part of my book Generative AI Handbook. For any issues around this book or if you’d like the pdf/epub version, contact me on LinkedIn
In 2017, a team at Google Brain released a paper with a cocky title: “Attention Is All You Need.”
It was a rejection of everything we just learned in Chapter 1. The authors argued that Recurrent Neural Networks (RNNs)—the standard for years—were doing it wrong.
RNNs read text like a human reads a book: word by word, left to right. The Transformer (the architecture proposed in the paper) reads text like a human looking at a painting: it sees the entire image at once.
This chapter explains the mechanism that makes this possible: Self-Attention.
The Cocktail Party Effect
Imagine you are at a loud party. You are trying to listen to your friend, but the room is filled with noise.
Transformer Approach: You stand in the center of the room. You hear everyone simultaneously. However, your brain subconsciously tunes out the background noise and tunes in (attends) only to the voice of your friend.
This is Self-Attention. It allows a word at the beginning of a sentence to “look at” and “listen to” a word at the end of the sentence, instantly, without having to trudge through all the words in between.
The Mechanics: Query, Key, and Value
To make this work mathematically, the Transformer breaks every word into three distinct roles. This is often the most confusing part for beginners, so let’s use a Library Analogy.
Imagine a library database.
- Query (Q) Vector : This is what you are looking for. (e.g., “I need a book on ‘Physics’”).
- Key (K) Vector : This is the label on the book’s spine. (e.g., “Physics”, “Cooking”, “Fiction”).
- Value (V) Vector : This is the content inside the book.
In the Transformer, every token creates a Query, a Key, and a Value for itself for each attention head (We’ll see in the next chapter that there are multiple attention heads).
- The Query: “I am the word ‘Bank’. I am looking for context to help me define myself.”
- The Key: “I am the word ‘River’. If you are looking for nature words, I’m your guy.”
- The Value: “Here is the meaning of ‘River’.”
The model compares the Query of the current word against the Key of every other word.
---
title: The Library Analogy. The Query checks the Keys. If they match, it extracts the Value.
---
graph LR
Q[Query: 'Bank'] --> Match{Compatibility<br>Check}
subgraph "The Library (Other Words)"
K1[Key: 'River'] --- V1[Value: 'Nature Meaning']
K2[Key: 'Money'] --- V2[Value: 'Finance Meaning']
K3[Key: 'Sandwich'] --- V3[Value: 'Food Meaning']
end
K1 --> Match
K2 --> Match
K3 --> Match
Match -- "High Match" --> V1
Match -- "Medium Match" --> V2
Match -- "Zero Match" --> V3
V1 --> Output[Absorb Meaning]
V2 --> Output
V3 --> Output
style Match fill:#f9f,stroke:#333
style Output fill:#ccffcc,stroke:#333
The Compatibility Score (Dot Product)
How does the model know if “Bank” and “River” are compatible? It uses a mathematical operation called the Dot Product.
Think of the Dot Product as a Handshake.
- If two vectors point in the same direction, they have a firm handshake (High Score).
- If they point in different directions, they miss each other (Low Score).
The model calculates this handshake for every possible pair of words.
- Bank vs River: High Score (95).
- Bank vs Money: Medium Score (50).
- Bank vs Sandwich: Low Score (0).
Softmax: The Percentage Converter
The raw scores (95, 50, 0) are messy. Some might be negative, some might be huge. We need to turn them into percentages that add up to 100%.
We pass the scores through a function called Softmax.
- Raw:
[95, 50, 0] - Softmax:
[0.85, 0.14, 0.01]
This tells the word “Bank”: “Pay 85% of your attention to ‘River’, 14% to ‘Money’, and ignore ‘Sandwich’.”
sequenceDiagram
participant Q as Query (Bank)
participant K as Keys (River, Money)
participant V as Values (Meanings)
participant Out as Output
Note over Q, K: Step 1: The Handshake
Q->>K: "Are we related?" (Dot Product)
K->>Q: Score: 95 (High)
Note over Q: Step 2: Normalize
Q->>Q: Softmax(Scores) -> 85%
Note over Q, V: Step 3: Retrieval
Q->>V: "Give me 85% of your meaning."
V->>Out: Transfer Information
The Output: The Contextual Smoothie
So, what is the result of all this math?
Before Attention, the word “Bank” was just a generic vector that could mean “financial institution” or “river side.” After Attention, the model creates a Weighted Sum. It takes the vector for “Bank” and mixes in 85% of the essence of “River.”
The result is a new vector that is still “Bank,” but now it is flavored by its context. It is now a “River-Bank.”
Why this changed the world
In an RNN, if “River” was 50 words away from “Bank”, the flavor would have washed away by the time the model got there. In Attention, the distance doesn’t matter. The “flavor” is teleported instantly.
Masked Attention: The “No Spoilers” Rule
There is one catch.
When we train a model like GPT (Generative Pre-trained Transformer), we want it to predict the next word.
- Input: “The quick brown”
- Target: “fox”
If we let the model use standard Attention, it can “see” the whole sentence at once. It would see the word “fox” sitting right there in the training data and cheat. It wouldn’t learn anything; it would just copy the answer.
To fix this, we use a Mask.
Think of this like a piece of cardboard covering the right side of the page. The model is allowed to use Attention to look at past words, but it is strictly forbidden from attending to future words.
Mathematically, we set the Attention Score for all future words to Negative Infinity. When we run Softmax on negative infinity, the result is Zero.
---
title: The Mask. The grid shows which words can see each other. The red cells are the future—they are blocked so the model cannot cheat.
---
graph TD
subgraph MaskMatrix ["Mask Matrix"]
R1[Word 1] --- C1[See] --- C2[Block] --- C3[Block]
R2[Word 2] --- C1b[See] --- C2b[See] --- C3b[Block]
R3[Word 3] --- C1c[See] --- C2c[See] --- C3c[See]
end
style C2 fill:#ffaaaa
style C3 fill:#ffaaaa
style C3b fill:#ffaaaa
Summary
- RNNs struggle because they read sequentially.
- Transformers read everything at once using Self-Attention.
- Queries, Keys, and Values act like a library retrieval system to find relevant context.
- Softmax determines how much focus to put on each word to extract the meaning.
- Masking ensures the model doesn’t cheat by reading the end of the book before it writes the beginning.
Now that we have a mechanism to understand context, we need to stack these layers on top of each other to build a deep “brain.” That structure is the Transformer Block.
