Chapter 1 - The Pre-Generative Era
This article is part of my book Generative AI Handbook. For any issues around this book or if you’d like the pdf/epub version, contact me on LinkedIn
Before the rise of the Transformer and the Large Language Models (LLMs) that define the current era of AI, the field of Natural Language Processing (NLP) struggled with a fundamental problem: Memory.
To generate coherent text, a system must understand context. It needs to remember that a sentence starting with “The heavy rain…” might end with “…caused a flood,” even if those words are separated by a dozen adjectives.
For decades, AI was like a person with severe short-term amnesia. It could understand the word it was looking at right now, but it struggled to connect it to what it read ten seconds ago. This chapter explores the evolution of “teaching machines to read,” from simple statistical guessing games to the neural networks that ruled the 2010s, and explains exactly why they eventually hit a wall.
The Statistical Guessing Game (N-Grams)
In the early days, language generation wasn’t about “reasoning”; it was a game of probability. The goal was simple: look at the last few words and guess what comes next based on how often that sequence appears in books or articles.
This approach is called the N-Gram model (where N is the number of words the computer looks at).
- Unigram (1 word): The computer ignores context entirely. If you ask for a verb, it picks “is” simply because “is” is the most common verb in English.
- Bigram (2 words): It looks at the one word before the cursor. If you type “New,” it guesses “York.”
- Trigram (3 words): It looks at the two words before the cursor. If you type “The United,” it guesses “States.”
---
title: N-Gram Context Windows. As 'N' increases, the model gets smarter because it sees more history, but it requires exponentially more data.
---
graph LR
subgraph Trigram ["Trigram Model (N=3)"]
direction LR
T1(The) --> T2(quick)
T2 --> T3{???}
T3 --"Prob: High"--> T4(brown)
T3 --"Prob: Low"--> T5(red)
end
subgraph Bigram ["Bigram Model (N=2)"]
direction LR
B1(quick) --> B2{???}
B2 --"Maybe?"--> B3(brown)
B2 --"Maybe?"--> B4(run)
end
The “Sparsity” Trap
You might ask: “Why not just make ‘N’ bigger? Let’s make a 50-Gram model so it can remember a whole paragraph!”
The problem is the Curse of Dimensionality. Language is incredibly creative. If you look for a specific sequence of 50 words, you will likely find zero examples of it in human history.
The Curse of Dimensionality
Sparsity is the killer of N-Grams. If you increase N to capture more context, the number of possible word combinations explodes. Most of these combinations will never appear in your training data. If the model encounters a sequence it has never seen before, its probability calculation hits zero, and it has no idea what to write next.
Recurrent Neural Networks (RNNs)
To solve the sparsity problem, researchers moved away from counting exact word matches and invented the Recurrent Neural Network (RNN).
Imagine reading a book. You don’t read each word in isolation; you keep a running “mental summary” of the plot in your head as you move from sentence to sentence. The RNN was designed to mimic this specific behavior.
Unlike standard AI networks that process data in a straight line, an RNN loops back on itself. It maintains a Hidden State—effectively a “Short-Term Memory” vector.
- It reads Word A.
- It updates its memory.
- It passes that updated memory to help it understand Word B.
sequenceDiagram
participant Input as Inputs (x)
participant Hidden as Hidden State (h)
participant Output as Outputs (y)
Note over Input, Output: Time Step 1
Input->>Hidden: Read "The"
Hidden->>Hidden: Update Memory
Hidden->>Output: Predict "quick"
Note over Input, Output: Time Step 2
Input->>Hidden: Read "quick"
Hidden-->>Hidden: Recall "The"
Hidden->>Hidden: Update Memory
Hidden->>Output: Predict "brown"
Note over Input, Output: Time Step 3
Input->>Hidden: Read "brown"
Hidden-->>Hidden: Recall "The quick"
Hidden->>Hidden: Update Memory
Hidden->>Output: Predict "fox"
The “Game of Telephone” Problem
While RNNs were a huge leap forward, they suffered from a critical flaw known as the Vanishing Gradient.
Think of this like the children’s game “Telephone” (or Chinese Whispers). You whisper a complex story to the first person, who whispers it to the second, and so on. By the time the story reaches the 20th person, the details are gone.
In an RNN, as the “memory” is passed from word to word, the signal from the beginning of the sentence gets weaker and weaker. By the time the model processes the 50th word, it has completely forgotten the subject of the sentence.
---
title: The Vanishing Gradient Cliff. As the AI tries to learn from its errors, the signal fades out before it can correct the start of the sequence.
---
graph LR
subgraph "Forward Pass (Reading)"
F1((Start)) --> F2((Middle)) --> F3((End))
end
subgraph "Backward Pass (Learning)"
B4[Error Found] == "Strong Signal" ==> B3((End))
B3 == "Weak Signal" ==> B2((Middle))
B2 == "Signal Lost" ==> B1((Start))
B1 -.-> B0[Start remains uncorrected]
end
F3 -.-> B4
style B0 fill:#ffcccc,stroke:#333
style B4 fill:#ccffcc,stroke:#333
Long Short-Term Memory (LSTM)
In the late 90s, the LSTM was introduced to solve the “Telephone” problem.
If the standard RNN was a person trying to remember everything at once, the LSTM is like a CEO with a highly efficient Executive Assistant. The Assistant (the architecture) has a specific job: Decide what is important enough to write down, and what is safe to delete.
The LSTM introduced a “Superhighway” for memory (technically called the Cell State) that runs through the entire sequence. It uses mechanisms called Gates to manage this highway:
- Forget Gate: “The subject changed from ‘He’ to ‘They’. Forget the singular verb form.”
- Input Gate: “Here is a new name. Write it down.”
- Output Gate: “Based on what we know, predict the next word.”
---
title: Inside an LSTM Cell. The 'X' and '+' nodes are gates that add or remove information carefully.
---
flowchart LR
subgraph LSTM_Cell [LSTM Cell at Time t]
direction TB
%% Inputs
Xt((Input Word))
Ht_prev((Short Term Mem))
Ct_prev((Long Term Mem))
%% Gates Logic
Concat[Combine Info]
Xt --> Concat
Ht_prev --> Concat
%% Forget Gate
Concat --"Trash?"--> SigForget[Forget Gate]
SigForget --> MultForget((X))
Ct_prev --> MultForget
%% Input Gate
Concat --"Save?"--> SigInput[Input Gate]
Concat --> TanhInput[New Candidate Info]
SigInput --> MultInput((X))
TanhInput --> MultInput
%% Update Cell State
MultForget --> AddNew((Add))
MultInput --> AddNew
AddNew --> Ct_new((New Long Term Mem))
%% Output Gate
Concat --"Output?"--> SigOutput[Output Gate]
AddNew --> TanhOutput[Tanh]
SigOutput --> MultOutput((X))
TanhOutput --> MultOutput
MultOutput --> Ht_new((New Short Term Mem))
end
style Ct_prev fill:#f9f,stroke:#333,stroke-width:2px
style Ct_new fill:#f9f,stroke:#333,stroke-width:2px
style Ht_new fill:#bbf,stroke:#333,stroke-width:2px
Why LSTMs worked
The key innovation of the LSTM was that it allowed gradients (learning signals) to flow backwards through the network without vanishing. This “superhighway” allowed models to finally remember context over long paragraphs, powering a few generations of Google Translate and Siri.
The Encoder-Decoder Bottleneck
With LSTMs solving the memory problem, researchers built Seq2Seq (Sequence-to-Sequence) models. This is the architecture used for translation (e.g., English to French).
- Encoder: Reads the English sentence one word at a time and compresses it into a “Context Vector.”
- Decoder: Reads the Context Vector and unpacks it to write the French sentence.
However, this approach hit a hard limit.
The “Index Card” Problem
Imagine I ask you to read a complex chapter of a textbook. However, I tell you: “You cannot keep the book. You must summarize the entire chapter—every fact, date, and name—onto a single 3x5 index card.”
Then, I take the book away, give that index card to a friend, and say: “Rewrite the chapter perfectly using only this card.”
No matter how smart you are, you cannot compress that much information into a fixed-size card without losing details. This is exactly what the Encoder-Decoder did. It tried to shove the meaning of a long sentence into a single, fixed-size vector.
---
title: The Seq2Seq Bottleneck. The entire meaning of the input sentence must be compressed into the single fixed-size Context Vector (Z). This causes data loss.
---
graph LR
subgraph Encoder
E1(Input: 'How') --> H1[Hidden 1]
E2(Input: 'are') --> H2[Hidden 2]
H1 --> H2
E3(Input: 'you') --> H3[Hidden 3]
H2 --> H3
end
H3 == "The Index Card" ==> Bottleneck((Z))
subgraph Decoder
Bottleneck --> D1[Hidden 1]
D1 --> O1(Output: 'Como')
D1 --> D2[Hidden 2]
D2 --> O2(Output: 'estas')
end
style Bottleneck fill:#ff0000,stroke:#333,stroke-width:4px,color:#fff
The Speed Limit
Furthermore, RNNs and LSTMs are Sequential. To understand the 100th word, you must calculate the 99 words before it, one by one. You cannot parallelize this process. It’s like a single-lane road; no matter how much hardware you buy, the cars must go one after another.
This limitation—the inability to “look back” at specific words in the original sentence, and the inability to process data in parallel—set the stage for the greatest breakthrough in modern AI: Attention. But before we go there, it’s important to understand the concept of Tokenization and Embeddings which we will look at next.
