LLM Algorithm Evolution
From N-gram to Transformer, from GPT-3 to o1 and R1 — a time-ordered map of LLM algorithms: each chapter has an intro, questions to answer, and its key models, papers, techniques, and tools.
Pre-Deep-Learning Era
Before deep learning, the default language-modeling recipe was N-gram + smoothing (Kneser-Ney, Good-Turing): use the previous n-1 words to predict the next one. Fast and interpretable, but two hard problems — the curse of dimensionality (vocabulary^n space) and a short context window (beyond n=4 the statistics are extremely sparse). In 2003 Bengio's Neural Probabilistic Language Model mapped words into a continuous vector space. In 2013 Mikolov's word2vec engineered that idea into a standard tool and made word embeddings a baseline NLP component. The central idea from this era — distributed representation — is the bedrock for everything that followed.
After this chapter you should be able to answer
- How is N-gram probability computed? Why is Kneser-Ney smoothing more widely used than Laplace?
- What's the essential innovation in Bengio 2003's Neural Probabilistic Language Model vs N-gram?
- How do CBOW and Skip-gram differ as training objectives for word2vec?
- What computational problems do negative sampling and hierarchical softmax each address?
- What philosophical difference separates GloVe from word2vec?
- Why does the linear structure (king − man + woman ≈ queen) emerge?
- What's the fundamental limit of static word embeddings — why can 'bank' not represent both a river bank and a bank account simultaneously?
- What did FastText add over word2vec, and why is it friendlier to OOV tokens?
More questions (2)
- Whose quote is the distributional hypothesis "You shall know a word by the company it keeps"?
- What role do older methods like TF-IDF, LSA, and LDA still play today?
Key techniques
-
The classic statistical language model. Kneser-Ney smoothing was the industry default for 30 years.
-
Mapping words into dense vector spaces — the origin point of modern NLP.
-
Mikolov's two simplified models (CBOW, Skip-gram) made word embeddings trainable at scale.
- Negative sampling
Replaces softmax with binary classification, reducing Skip-gram training from O(V) to O(k).
-
Global co-occurrence matrix factorization — a complementary philosophy to word2vec.
-
Subword embeddings via character n-grams. Strong for morphologically rich languages and OOV.
- TF-IDF / LSA / LDA
Older text representations that still have a role in information retrieval and document clustering.
Key papers
-
The foundational paper of neural language modeling — it nailed down 'learn a word vector.'
-
First word2vec paper with CBOW and Skip-gram.
-
Second paper introducing negative sampling, hierarchical softmax, subsampling.
-
The Stanford GloVe paper.
-
The FastText paper.
Models & tools
-
The original C implementation — still fast on large corpora.
-
The easiest Python library for word2vec, LDA, LSI.
-
Industrial NLP pipelines with pretrained vectors and reliable tooling.
-
Teaching-oriented NLP toolbox — N-gram, HMM, PCFG are all there.
Further reading
-
The NLP textbook. The first few chapters cover N-gram and word vectors best.
-
Stanford's NLP course dedicates its first three lectures to word vectors.
-
A single post that captures the core intuition of distributed representations.
The RNN Era
Word embeddings handled lexical representation, but language is a sequence problem — word meaning depends on context. Recurrent Neural Networks (RNNs) unroll the same weights across time to process variable-length sequences, but vanishing gradients made long-range dependencies difficult. LSTM (Hochreiter 1997) introduced gating to carry information across long sequences; GRU (2014) is a lighter variant. In 2014 Sutskever's Seq2Seq brought the encoder-decoder architecture to machine translation, and Bahdanau's attention mechanism let the decoder peek back at any input position — a direct precursor to the Transformer.
After this chapter you should be able to answer
- What's the root cause of vanishing/exploding gradients in RNNs? How is BPTT performed?
- What do the three LSTM gates (forget, input, output) actually control?
- What does GRU leave out compared to LSTM? When is the performance difference negligible?
- What translation problem does the Seq2Seq encoder-decoder solve?
- How do Bahdanau and Luong attention differ?
- Why are RNN-style models hard to parallelize? What did that mean for production deployment?
- What's the relationship between teacher forcing and exposure bias?
- Where did ConvS2S (Facebook 2017) beat RNN Seq2Seq, and why did it never become mainstream?
More questions (2)
- How much does beam search width matter for translation quality?
- Why is ELMo often called a "transitional form"? What architecture does it use?
Key models
-
The simplest recurrent structure, held back in practice by vanishing gradients.
-
Hochreiter & Schmidhuber 1997. Gates + cell state are the core insight.
-
Cho 2014. Merges the three LSTM gates into two — fewer parameters, faster training.
-
Sutskever 2014's encoder-decoder framework — the skeleton for modern generative models.
-
Lets the decoder re-attend to arbitrary encoder positions. Direct precursor to Transformer.
-
Facebook 2017 used CNNs for Seq2Seq — easier to parallelize than RNNs, quickly overtaken by Transformer.
-
Bi-LSTM contextual embeddings — an early industrial demonstration of pretraining in NLP.
Key papers
-
The original paper on gated networks to combat vanishing gradients.
-
Google applied encoder-decoder to machine translation successfully.
-
The original attention paper.
-
Global vs local attention compared to Bahdanau-style.
-
The ELMo paper — the first glimpse of large-scale pretraining in NLP.
Tools
- PyTorch / TensorFlow
Early RNN implementations lived here and remain the teaching starting point.
-
Open-source neural MT framework, the industry choice during the RNN era.
-
Facebook's Seq2Seq toolkit, later extended to Transformers.
- torchtext / AllenNLP
Data loading and modeling components around the PyTorch ecosystem.
Further reading
-
The post that made millions of people get character-level RNNs.
-
The clearest visual explanation of LSTM gates.
-
Stanford's NLP course notes on RNN, LSTM, and attention.
Transformer & Pretraining
Google's 2017 paper 'Attention Is All You Need' introduced the Transformer: pure self-attention replacing RNNs/CNNs, fully parallel end-to-end. Core innovations: multi-head self-attention, positional encoding, residual connections + LayerNorm. In 2018 GPT-1 and BERT appeared almost simultaneously: GPT is autoregressive generation (decoder-only) and BERT is masked fill-in (encoder-only, MLM). T5 (2019) reframed every NLP task as text-to-text. The era's unifying trend is 'pretrain + fine-tune' — data and compute replaced task-specific architecture.
After this chapter you should be able to answer
- What roles do Q, K, and V play in self-attention? Is complexity O(n²) or O(n·d)?
- What do multi-head attention heads buy over a single head? Why 8 / 16 / 32?
- What are the motivations behind absolute (sinusoidal), relative, and RoPE positional encodings?
- Why the 80/10/10 mask/replace/unchanged mix in BERT's MLM?
- Where does GPT's next-token prediction beat BERT's MLM, and where does it lose?
- When do encoder-only, decoder-only, and encoder-decoder architectures each shine?
- Why was T5's 'everything is text-to-text' framing so influential?
- What are the differences between BPE, WordPiece, and SentencePiece tokenizers?
More questions (2)
- How much does LayerNorm placement (Post-LN vs Pre-LN) affect training stability?
- Why does the Transformer also dominate vision (ViT) and speech (Whisper)?
Key architectures
-
Vaswani 2017 — the common ancestor of every modern LLM.
-
Google 2018 encoder-only pretrained model with MLM.
-
OpenAI 2018. Established the decoder-only pretrain + fine-tune recipe.
-
Google 2019 text-to-text framework unifying all NLP tasks as generation.
-
FAIR optimized BERT training: more data, longer, dropping NSP.
-
Permutation LM combining AR and AE — elegant but engineering-heavy.
-
FAIR's encoder-decoder pretrained model with noise → denoising objective.
Core mechanisms
-
Attention under several 'views' computed in parallel — the Transformer's compute core.
- Positional encoding
Sinusoidal (original), learned, relative, RoPE — the main ways to express sequence order.
-
Critical for stable Transformer training. Pre-LN vs Post-LN matters.
- Masked Language Modeling
BERT's pretraining objective — randomly mask and predict.
- Next-token prediction
GPT's pretraining objective and the foundation of every modern generative LLM.
-
The three mainstream subword algorithms. Determines vocab and encoding efficiency.
Tools
-
The de facto pretrained model library with nearly every Transformer ported.
-
Meta's Seq2Seq/Transformer training framework, popular in academia.
-
Reference implementation of the original Transformer, now mostly a historical archive.
-
~300 lines of PyTorch GPT. The best way to internalize Transformer details.
Further reading
-
The most-read visual Transformer explainer.
-
BERT and GPT versions of the same series.
-
Lilian Weng's overview of Transformer variants.
-
Stanford's lectures on Transformer architecture.
Scaling Laws: GPT-2 to GPT-3
GPT-2 (2019) scaled Transformer to 1.5B parameters and showed zero-shot: the model could follow natural-language prompts without fine-tuning. GPT-3 (2020) scaled to 175B and revealed few-shot in-context learning — a handful of examples and the model picks up new tasks. In 2020, Kaplan's 'Scaling Laws for Neural Language Models' showed loss falls as a power law in compute, data, and parameters. In 2022 DeepMind's Chinchilla corrected Kaplan's finding — at fixed compute, data should scale in lockstep with parameters, not the other way around. This framework has decided the resource allocation for every subsequent large model.
After this chapter you should be able to answer
- What's the capability jump between GPT-2 (1.5B) and GPT-3 (175B)?
- What are the emergent abilities, and at what scales do they appear?
- How are Kaplan's three power laws (loss vs params, data, compute) derived?
- Which assumption did Chinchilla correct? What new ratio does it recommend?
- What's the approximate training cost (GPU-hours, electricity) of a 175B GPT-3?
- Why does in-context learning work? How does it differ from true fine-tuning?
- When did prompt engineering become a craft?
- What are the different ecosystem trajectories for open-source GPT-2 vs closed GPT-3?
More questions (2)
- Why did Chinchilla's finding enable LLaMA-class 'small-but-data-heavy' models?
- How has compute as a meta-resource shaped the AI industry?
Representative models
-
Zero-shot showed the power of pretraining and caused the 'too dangerous to release' debate.
-
NVIDIA's large-model training systems paper. Foundation of tensor parallelism.
-
The largest model trained in the T5 paper.
-
Microsoft's then-largest LLM, trained with DeepSpeed.
-
Few-shot in-context learning arrived and shocked the NLP world.
-
AI21's contemporary to GPT-3.
-
Google's 540B Pathways model — briefly the largest.
-
DeepMind's compute-optimal experiment: a 70B model beat 280B at matched compute.
Key papers
-
First systematic demonstration of zero-shot.
-
Original paper on compute/data/parameter power laws.
-
Classic paper on few-shot in-context learning.
-
Corrected Kaplan and introduced the new 20 tokens/param ratio.
-
Systematic record of capabilities that appear suddenly past a scale threshold.
Training systems
-
NVIDIA's large-model training toolkit, reference implementation of tensor parallelism.
-
Microsoft's training optimization library and primary ZeRO implementation.
-
EleutherAI's open-source replication — the first 'community' big models.
-
The de facto hub for models and checkpoints.
Further reading
-
OpenAI's introduction and early use-case showcase.
-
The original 'compute doubles every 3.4 months' analysis.
-
The most thorough external treatment of scaling — long but worth it.
-
Dwarkesh Patel's interviews with AI researchers on scaling.
Alignment: RLHF & Instruction Tuning
Pretrained models can complete sentences but aren't necessarily 'helpful' or 'safe'. In 2022, InstructGPT industrialized RLHF (Reinforcement Learning from Human Feedback) in three steps: SFT on human demonstrations, train a reward model on pairwise preferences, then fine-tune via PPO. In November 2022 ChatGPT let the public feel the difference an aligned model makes. Since 2023, cheaper alternatives emerged: Anthropic's Constitutional AI had models self-critique; DPO (Direct Preference Optimization) bypassed the reward model entirely; SimPO, KTO, and IPO kept simplifying. By 2024, RLAIF (AI feedback instead of human labels) broke the human-annotation bottleneck.
After this chapter you should be able to answer
- What does each step of RLHF actually do? What does PPO contribute in the third step?
- How are reward-model training pairwise preferences collected? Who are the annotators?
- Why is RLHF more effective than pure supervised fine-tuning? What is the ceiling of SFT alone?
- Which dimension did InstructGPT improve the most over raw GPT-3?
- What is Constitutional AI's self-critique flow?
- What does DPO simplify vs PPO, and how much does performance differ?
- What does reward hacking look like in RLHF? How is it mitigated?
- What are typical data volumes for SFT, RLHF, and DPO?
More questions (2)
- What's the central innovation of RLAIF over RLHF?
- Where do ReST, SimPO, KTO, and IPO sit in the 2024 alignment landscape?
Core techniques
- Supervised Fine-Tuning (SFT)
Fine-tuning on instruction-response pairs. The first step of every subsequent alignment pipeline.
- Reward Model (RM)
Takes answer pairs, emits preference scores. The referee in RLHF.
-
The RL algorithm InstructGPT used. Core idea: clipped objective to bound updates.
-
Replaces RL with a closed-form cross-entropy loss.
-
Anthropic's self-critique framework with AI feedback replacing most human labels.
- SimPO / KTO / IPO
2024 DPO variants that further simplify or correct the loss.
Key papers
-
The landmark paper for industrial RLHF.
-
Anthropic's alignment methodology with explicit value principles.
-
Replaced RL with a supervised loss — 2023's most important alignment paper.
-
Simplifies DPO further by removing the reference model.
-
Preference learning grounded in Kahneman-Tversky prospect theory.
Representative aligned models
-
The turning point that turned GPT-3.5 into a conversational partner.
-
OpenAI's flagship — multimodal with sharply improved alignment.
-
Constitutional AI plus RLHF — helpful, harmless, honest as design goals.
-
Meta's open aligned models. The most detailed RLHF recipes in public papers.
-
The domestic Chinese open-aligned flagship. R1 pushed RL-on-reasoning to the open-source frontier.
Further reading
-
OpenAI's alignment research portal.
-
First-hand source on Anthropic's alignment and interpretability work.
-
The most widely read RLHF explainer post.
-
Lil'Log's alignment coverage with the most complete technical detail.
Efficiency: MoE, Quantization, Long Context
Bigger models are smarter, but training and inference costs climb with them. Three paths compress those costs: (1) sparse activation (MoE) so each token uses only a fraction of parameters; (2) quantization (INT8 / INT4 / FP8) to shrink storage and compute; (3) attention engineering to make long contexts tractable. Mixtral 8x7B (Mistral 2023) showed MoE works in open source; GPT-4 is widely believed to be MoE; Gemini 1.5 Pro pushed context to 1M tokens with ring attention; DeepSeek-V3 combines MoE + MLA + FP8 — currently the most systematic efficiency case study.
After this chapter you should be able to answer
- How does top-k routing pick experts in MoE, and how is load imbalance handled?
- How do GShard, Switch Transformer, and Mixtral differ in MoE implementation?
- How much accuracy do INT8/INT4 cost? What are the different ideas behind AWQ, GPTQ, and SmoothQuant?
- How much speedup does FP8 training bring over FP16/BF16 on H100/Blackwell?
- Why does training only low-rank matrices in LoRA work? How does QLoRA add quantization?
- How does FlashAttention reduce attention HBM traffic — does it change the algorithm?
- What's the main benefit of PagedAttention (vLLM) managing KV cache like OS virtual memory?
- How do ring attention, Infini-attention, and Mamba each break the context-length barrier?
More questions (2)
- What is the speedup ceiling of Medusa, EAGLE, and Lookahead in speculative decoding?
- How does MLA (Multi-head Latent Attention) shrink KV cache in DeepSeek-V2/V3?
Core techniques
-
Shazeer 2017's sparse gated expert layer — the foundation of modern large-model efficiency.
-
Google's simplified MoE scaling to trillions of parameters.
-
Storing weights in 8-bit or 4-bit can cut inference cost by an order of magnitude.
-
Low-rank adaptation lets fine-tuning touch <1% of parameters.
-
Tri Dao's IO-aware attention kernel, near O(n) on HBM in practice.
-
KV cache as paged virtual memory — memory efficiency from 20-40% up to 90%+.
-
Cut attention-head count or share KV to reduce cache and bandwidth pressure.
-
Small model drafts, large model verifies — typically 2-4× inference latency gain.
Key papers
-
The first major MoE paper.
-
Google pushed MoE to 1.6T parameters in 2021.
-
The low-rank adaptation paper.
-
The landmark attention-kernel optimization paper.
-
The KV-cache management revolution.
-
Mistral's 8x7B — the open-source MoE milestone.
-
A complete case of MoE + MLA + FP8 done together.
Representative models
- Switch-C (1.6T, Google 2021)
The first trillion-parameter MoE.
-
Mistral's open-source MoE — ~13B active parameters, inference efficiency near Llama-70B.
-
The Chinese open-source flagship. MLA + MoE dropped training cost to a tenth of industry peers.
-
Google's flagship pushing context to 1M tokens.
-
Alibaba's open + closed MoE series with a strong efficiency focus.
Further reading
-
The FlashAttention author's technical blog.
-
The authoritative reference for PagedAttention and continuous batching.
-
ZeRO, MoE, and inference engineering content.
-
Efficiency-related chapters capture many engineering trade-offs.
Multimodality & Reasoning
Two directions ran in parallel through 2023-2024: multimodality and reasoning. Multimodality began with CLIP (2021) aligning images and text; then Flamingo (2022) mixing image + text input; GPT-4V (2023) native vision; Gemini (2023) handling text, image, audio, and video together. On reasoning, Chain-of-Thought prompting (Wei 2022) first showed that 'step-by-step thinking' lifts accuracy; in September 2024 OpenAI o1 made thinking itself a training target (RL on reasoning) with major gains in math and code; January 2025's DeepSeek R1 proved the approach can be reproduced in open source. The current frontier is fusing reasoning and multimodality.
After this chapter you should be able to answer
- How does CLIP's contrastive learning align images and text? How does zero-shot classification work?
- How do Flamingo, BLIP-2, and LLaVA differ in fusing vision and language?
- What are the engineering differences between native vision (GPT-4V) and bolted-on vision encoders?
- How does Whisper tokenize audio?
- Why does Chain-of-Thought prompting boost reasoning so much? Does it only help at large scale?
- What does self-consistency actually gain over greedy decoding?
- How do Tree of Thoughts and Graph of Thoughts relate to CoT?
- What signal does OpenAI o1's 'RL on reasoning' actually train on?
More questions (2)
- How does DeepSeek R1's GRPO simplify PPO?
- How is test-time compute scaling fundamentally different from pretrain scaling?
Multimodal models
-
Image-text contrastive learning opened the multimodal era and set the zero-shot classification baseline.
-
Insert vision tokens into a frozen LLM for few-shot VQA.
-
Q-Former bridges the vision encoder and the LLM at low training cost.
-
OpenAI's multimodal flagship — GPT-4o delivers native text, audio, and vision.
-
Google's family of native multimodal models.
-
The Visual Instruction Tuning reference point for the open community.
-
Alibaba's open multimodal series chasing GPT-4V.
Reasoning models
-
A single "let's think step by step" significantly boosts reasoning performance.
-
Sampling plus majority voting — a stronger CoT.
-
Expands CoT into a searchable reasoning tree.
-
Models trained with RL-on-reasoning. Expert-level on AIME and Codeforces.
-
The first open model to reproduce o1-class reasoning. GRPO became a new RL baseline.
-
Anthropic's explicit-thinking capability on Claude 3.7 / 4.
Key papers
-
The seminal multimodal learning paper.
-
The reference few-shot visual language model.
-
The first systematic study of step-by-step reasoning.
-
The CoT → ToT expansion.
-
The open-source canonical example of GRPO and multi-stage RL training.
-
OpenAI's scaling laws for inference-time compute.
Further reading
-
The official capability, safety, and benchmark report for o1.
-
All the details of GRPO and multi-stage RL training.
-
Ties CoT, ToT, and Agent together clearly.
-
Daily multimodal and reasoning evaluations from an engineering perspective.
Frontier: Agents, SSM, World Models
From 2024 on, LLM research has split into 'beyond Transformer' and 'become an Agent.' On architecture, Mamba / State Space Models explore linear-complexity sequence modeling; RetNet, Hyena, RWKV, and Jamba all try to replace quadratic attention. On agents, Anthropic's Computer Use (2024), OpenAI Operator (2025), and Claude Agents (2025) let models drive browsers and operating systems; tool use, long-term memory, and multi-agent setups have become product themes. World models (Genie, Sora, Cosmos) pull video and physics simulation into the LLM frame. The field is more uncertain and more volatile in 2025-2026 than at any point before.
After this chapter you should be able to answer
- Why is Mamba/SSM linear-complexity? What does its 'selective' mechanism mean?
- Can Mamba scale to GPT-4 levels? What is the largest SSM to date?
- How do RWKV, Hyena, RetNet, and Jamba position themselves relative to each other?
- How does the ReAct framework interleave reasoning and acting? Why is it the starting point for agents?
- How does MCP (Model Context Protocol) relate to function calling and tool use?
- What are the core engineering challenges of computer-use / browser-use agents?
- When do multi-agent systems (AutoGen, CrewAI) beat a single agent?
- What is the core mechanism of world models like Sora, Genie, and Cosmos?
More questions (2)
- What can mechanistic interpretability explain about LLM behavior today?
- How does 'data wall' (public-data exhaustion) affect scaling? Can synthetic data bridge it?
New architectures
-
Selective SSM with linear complexity and near-Transformer quality — the strongest current challenger.
-
RNN-style training with Transformer-style performance; a community-driven open project.
-
Microsoft's Retentive Network — another attention alternative.
-
Implicit long convolution replaces attention — part of the sub-quadratic wave.
-
Transformer + Mamba hybrid with an edge at long contexts.
Agent frameworks
-
Reasoning + acting interleaved — the skeleton behind most agent designs.
-
The paper that taught models to decide when to call APIs on their own.
-
Community projects that ignited agent imagination in 2023.
-
Anthropic-led open protocol to standardize tool connectivity.
-
Microsoft's and LangChain's multi-agent orchestration frameworks.
-
2025 products that turned 'models driving a computer' into real offerings.
World models & video generation
-
Diffusion Transformer pushing minute-length video to convincing quality.
-
Generate interactive 2D / 3D worlds from a single image.
-
NVIDIA's World Foundation Model aimed at robotics and autonomous driving.
-
Google's high-quality video generation model.
-
Leading commercial video-generation products.
Frontier research directions
-
Chris Olah-led Transformer Circuits work — opening up LLM internals.
-
A new scaling curve where 'thinking longer' replaces 'bigger model.'
- Self-play / Synthetic Data
As public data approaches exhaustion, self-generated training data is a hot alternative.
- Multi-agent Systems
Multiple specialized models / agents collaborating — already effective in code, research, and negotiation.
- Embodied / Robotics Foundation Models
RT-2, OpenVLA, and Gemini Robotics bring LLMs into robotics.
Further reading
-
Systems engineering detail behind Mamba.
-
The main site for mechanistic interpretability.
-
The most diligent blog tracking frontier products and models.
-
Anthropic co-founder's weekly with both policy and technical framing.
-
Annual reviews of AI industry evolution from major investors.