LLM · Evolution

LLM Algorithm Evolution

From N-gram to Transformer, from GPT-3 to o1 and R1 — a time-ordered map of LLM algorithms: each chapter has an intro, questions to answer, and its key models, papers, techniques, and tools.

01

Pre-Deep-Learning Era

1948 – 2013

Before deep learning, the default language-modeling recipe was N-gram + smoothing (Kneser-Ney, Good-Turing): use the previous n-1 words to predict the next one. Fast and interpretable, but two hard problems — the curse of dimensionality (vocabulary^n space) and a short context window (beyond n=4 the statistics are extremely sparse). In 2003 Bengio's Neural Probabilistic Language Model mapped words into a continuous vector space. In 2013 Mikolov's word2vec engineered that idea into a standard tool and made word embeddings a baseline NLP component. The central idea from this era — distributed representation — is the bedrock for everything that followed.

After this chapter you should be able to answer

  1. How is N-gram probability computed? Why is Kneser-Ney smoothing more widely used than Laplace?
  2. What's the essential innovation in Bengio 2003's Neural Probabilistic Language Model vs N-gram?
  3. How do CBOW and Skip-gram differ as training objectives for word2vec?
  4. What computational problems do negative sampling and hierarchical softmax each address?
  5. What philosophical difference separates GloVe from word2vec?
  6. Why does the linear structure (king − man + woman ≈ queen) emerge?
  7. What's the fundamental limit of static word embeddings — why can 'bank' not represent both a river bank and a bank account simultaneously?
  8. What did FastText add over word2vec, and why is it friendlier to OOV tokens?
More questions (2)
  1. Whose quote is the distributional hypothesis "You shall know a word by the company it keeps"?
  2. What role do older methods like TF-IDF, LSA, and LDA still play today?

Key techniques

  • The classic statistical language model. Kneser-Ney smoothing was the industry default for 30 years.

  • Mapping words into dense vector spaces — the origin point of modern NLP.

  • Mikolov's two simplified models (CBOW, Skip-gram) made word embeddings trainable at scale.

  • Negative sampling

    Replaces softmax with binary classification, reducing Skip-gram training from O(V) to O(k).

  • Global co-occurrence matrix factorization — a complementary philosophy to word2vec.

  • Subword embeddings via character n-grams. Strong for morphologically rich languages and OOV.

  • TF-IDF / LSA / LDA

    Older text representations that still have a role in information retrieval and document clustering.

Key papers

Models & tools

  • The original C implementation — still fast on large corpora.

  • The easiest Python library for word2vec, LDA, LSI.

  • Industrial NLP pipelines with pretrained vectors and reliable tooling.

  • Teaching-oriented NLP toolbox — N-gram, HMM, PCFG are all there.

Further reading

02

The RNN Era

1997 – 2017

Word embeddings handled lexical representation, but language is a sequence problem — word meaning depends on context. Recurrent Neural Networks (RNNs) unroll the same weights across time to process variable-length sequences, but vanishing gradients made long-range dependencies difficult. LSTM (Hochreiter 1997) introduced gating to carry information across long sequences; GRU (2014) is a lighter variant. In 2014 Sutskever's Seq2Seq brought the encoder-decoder architecture to machine translation, and Bahdanau's attention mechanism let the decoder peek back at any input position — a direct precursor to the Transformer.

After this chapter you should be able to answer

  1. What's the root cause of vanishing/exploding gradients in RNNs? How is BPTT performed?
  2. What do the three LSTM gates (forget, input, output) actually control?
  3. What does GRU leave out compared to LSTM? When is the performance difference negligible?
  4. What translation problem does the Seq2Seq encoder-decoder solve?
  5. How do Bahdanau and Luong attention differ?
  6. Why are RNN-style models hard to parallelize? What did that mean for production deployment?
  7. What's the relationship between teacher forcing and exposure bias?
  8. Where did ConvS2S (Facebook 2017) beat RNN Seq2Seq, and why did it never become mainstream?
More questions (2)
  1. How much does beam search width matter for translation quality?
  2. Why is ELMo often called a "transitional form"? What architecture does it use?

Key models

  • The simplest recurrent structure, held back in practice by vanishing gradients.

  • Hochreiter & Schmidhuber 1997. Gates + cell state are the core insight.

  • Cho 2014. Merges the three LSTM gates into two — fewer parameters, faster training.

  • Sutskever 2014's encoder-decoder framework — the skeleton for modern generative models.

  • Lets the decoder re-attend to arbitrary encoder positions. Direct precursor to Transformer.

  • Facebook 2017 used CNNs for Seq2Seq — easier to parallelize than RNNs, quickly overtaken by Transformer.

  • Bi-LSTM contextual embeddings — an early industrial demonstration of pretraining in NLP.

Key papers

Tools

  • PyTorch / TensorFlow

    Early RNN implementations lived here and remain the teaching starting point.

  • Open-source neural MT framework, the industry choice during the RNN era.

  • Facebook's Seq2Seq toolkit, later extended to Transformers.

  • torchtext / AllenNLP

    Data loading and modeling components around the PyTorch ecosystem.

Further reading

03

Transformer & Pretraining

2017 – 2020

Google's 2017 paper 'Attention Is All You Need' introduced the Transformer: pure self-attention replacing RNNs/CNNs, fully parallel end-to-end. Core innovations: multi-head self-attention, positional encoding, residual connections + LayerNorm. In 2018 GPT-1 and BERT appeared almost simultaneously: GPT is autoregressive generation (decoder-only) and BERT is masked fill-in (encoder-only, MLM). T5 (2019) reframed every NLP task as text-to-text. The era's unifying trend is 'pretrain + fine-tune' — data and compute replaced task-specific architecture.

After this chapter you should be able to answer

  1. What roles do Q, K, and V play in self-attention? Is complexity O(n²) or O(n·d)?
  2. What do multi-head attention heads buy over a single head? Why 8 / 16 / 32?
  3. What are the motivations behind absolute (sinusoidal), relative, and RoPE positional encodings?
  4. Why the 80/10/10 mask/replace/unchanged mix in BERT's MLM?
  5. Where does GPT's next-token prediction beat BERT's MLM, and where does it lose?
  6. When do encoder-only, decoder-only, and encoder-decoder architectures each shine?
  7. Why was T5's 'everything is text-to-text' framing so influential?
  8. What are the differences between BPE, WordPiece, and SentencePiece tokenizers?
More questions (2)
  1. How much does LayerNorm placement (Post-LN vs Pre-LN) affect training stability?
  2. Why does the Transformer also dominate vision (ViT) and speech (Whisper)?

Key architectures

  • Vaswani 2017 — the common ancestor of every modern LLM.

  • Google 2018 encoder-only pretrained model with MLM.

  • OpenAI 2018. Established the decoder-only pretrain + fine-tune recipe.

  • Google 2019 text-to-text framework unifying all NLP tasks as generation.

  • FAIR optimized BERT training: more data, longer, dropping NSP.

  • Permutation LM combining AR and AE — elegant but engineering-heavy.

  • FAIR's encoder-decoder pretrained model with noise → denoising objective.

Core mechanisms

  • Attention under several 'views' computed in parallel — the Transformer's compute core.

  • Positional encoding

    Sinusoidal (original), learned, relative, RoPE — the main ways to express sequence order.

  • Critical for stable Transformer training. Pre-LN vs Post-LN matters.

  • Masked Language Modeling

    BERT's pretraining objective — randomly mask and predict.

  • Next-token prediction

    GPT's pretraining objective and the foundation of every modern generative LLM.

  • The three mainstream subword algorithms. Determines vocab and encoding efficiency.

Tools

Further reading

04

Scaling Laws: GPT-2 to GPT-3

2019 – 2022

GPT-2 (2019) scaled Transformer to 1.5B parameters and showed zero-shot: the model could follow natural-language prompts without fine-tuning. GPT-3 (2020) scaled to 175B and revealed few-shot in-context learning — a handful of examples and the model picks up new tasks. In 2020, Kaplan's 'Scaling Laws for Neural Language Models' showed loss falls as a power law in compute, data, and parameters. In 2022 DeepMind's Chinchilla corrected Kaplan's finding — at fixed compute, data should scale in lockstep with parameters, not the other way around. This framework has decided the resource allocation for every subsequent large model.

After this chapter you should be able to answer

  1. What's the capability jump between GPT-2 (1.5B) and GPT-3 (175B)?
  2. What are the emergent abilities, and at what scales do they appear?
  3. How are Kaplan's three power laws (loss vs params, data, compute) derived?
  4. Which assumption did Chinchilla correct? What new ratio does it recommend?
  5. What's the approximate training cost (GPU-hours, electricity) of a 175B GPT-3?
  6. Why does in-context learning work? How does it differ from true fine-tuning?
  7. When did prompt engineering become a craft?
  8. What are the different ecosystem trajectories for open-source GPT-2 vs closed GPT-3?
More questions (2)
  1. Why did Chinchilla's finding enable LLaMA-class 'small-but-data-heavy' models?
  2. How has compute as a meta-resource shaped the AI industry?

Representative models

Key papers

Training systems

Further reading

05

Alignment: RLHF & Instruction Tuning

2022 – 2024

Pretrained models can complete sentences but aren't necessarily 'helpful' or 'safe'. In 2022, InstructGPT industrialized RLHF (Reinforcement Learning from Human Feedback) in three steps: SFT on human demonstrations, train a reward model on pairwise preferences, then fine-tune via PPO. In November 2022 ChatGPT let the public feel the difference an aligned model makes. Since 2023, cheaper alternatives emerged: Anthropic's Constitutional AI had models self-critique; DPO (Direct Preference Optimization) bypassed the reward model entirely; SimPO, KTO, and IPO kept simplifying. By 2024, RLAIF (AI feedback instead of human labels) broke the human-annotation bottleneck.

After this chapter you should be able to answer

  1. What does each step of RLHF actually do? What does PPO contribute in the third step?
  2. How are reward-model training pairwise preferences collected? Who are the annotators?
  3. Why is RLHF more effective than pure supervised fine-tuning? What is the ceiling of SFT alone?
  4. Which dimension did InstructGPT improve the most over raw GPT-3?
  5. What is Constitutional AI's self-critique flow?
  6. What does DPO simplify vs PPO, and how much does performance differ?
  7. What does reward hacking look like in RLHF? How is it mitigated?
  8. What are typical data volumes for SFT, RLHF, and DPO?
More questions (2)
  1. What's the central innovation of RLAIF over RLHF?
  2. Where do ReST, SimPO, KTO, and IPO sit in the 2024 alignment landscape?

Core techniques

  • Supervised Fine-Tuning (SFT)

    Fine-tuning on instruction-response pairs. The first step of every subsequent alignment pipeline.

  • Reward Model (RM)

    Takes answer pairs, emits preference scores. The referee in RLHF.

  • The RL algorithm InstructGPT used. Core idea: clipped objective to bound updates.

  • Replaces RL with a closed-form cross-entropy loss.

  • Anthropic's self-critique framework with AI feedback replacing most human labels.

  • SimPO / KTO / IPO

    2024 DPO variants that further simplify or correct the loss.

Key papers

Representative aligned models

Further reading

06

Efficiency: MoE, Quantization, Long Context

2021 – 2025

Bigger models are smarter, but training and inference costs climb with them. Three paths compress those costs: (1) sparse activation (MoE) so each token uses only a fraction of parameters; (2) quantization (INT8 / INT4 / FP8) to shrink storage and compute; (3) attention engineering to make long contexts tractable. Mixtral 8x7B (Mistral 2023) showed MoE works in open source; GPT-4 is widely believed to be MoE; Gemini 1.5 Pro pushed context to 1M tokens with ring attention; DeepSeek-V3 combines MoE + MLA + FP8 — currently the most systematic efficiency case study.

After this chapter you should be able to answer

  1. How does top-k routing pick experts in MoE, and how is load imbalance handled?
  2. How do GShard, Switch Transformer, and Mixtral differ in MoE implementation?
  3. How much accuracy do INT8/INT4 cost? What are the different ideas behind AWQ, GPTQ, and SmoothQuant?
  4. How much speedup does FP8 training bring over FP16/BF16 on H100/Blackwell?
  5. Why does training only low-rank matrices in LoRA work? How does QLoRA add quantization?
  6. How does FlashAttention reduce attention HBM traffic — does it change the algorithm?
  7. What's the main benefit of PagedAttention (vLLM) managing KV cache like OS virtual memory?
  8. How do ring attention, Infini-attention, and Mamba each break the context-length barrier?
More questions (2)
  1. What is the speedup ceiling of Medusa, EAGLE, and Lookahead in speculative decoding?
  2. How does MLA (Multi-head Latent Attention) shrink KV cache in DeepSeek-V2/V3?

Core techniques

Key papers

Representative models

Further reading

07

Multimodality & Reasoning

2021 – 2025

Two directions ran in parallel through 2023-2024: multimodality and reasoning. Multimodality began with CLIP (2021) aligning images and text; then Flamingo (2022) mixing image + text input; GPT-4V (2023) native vision; Gemini (2023) handling text, image, audio, and video together. On reasoning, Chain-of-Thought prompting (Wei 2022) first showed that 'step-by-step thinking' lifts accuracy; in September 2024 OpenAI o1 made thinking itself a training target (RL on reasoning) with major gains in math and code; January 2025's DeepSeek R1 proved the approach can be reproduced in open source. The current frontier is fusing reasoning and multimodality.

After this chapter you should be able to answer

  1. How does CLIP's contrastive learning align images and text? How does zero-shot classification work?
  2. How do Flamingo, BLIP-2, and LLaVA differ in fusing vision and language?
  3. What are the engineering differences between native vision (GPT-4V) and bolted-on vision encoders?
  4. How does Whisper tokenize audio?
  5. Why does Chain-of-Thought prompting boost reasoning so much? Does it only help at large scale?
  6. What does self-consistency actually gain over greedy decoding?
  7. How do Tree of Thoughts and Graph of Thoughts relate to CoT?
  8. What signal does OpenAI o1's 'RL on reasoning' actually train on?
More questions (2)
  1. How does DeepSeek R1's GRPO simplify PPO?
  2. How is test-time compute scaling fundamentally different from pretrain scaling?

Multimodal models

Reasoning models

Key papers

Further reading

08

Frontier: Agents, SSM, World Models

2024 –

From 2024 on, LLM research has split into 'beyond Transformer' and 'become an Agent.' On architecture, Mamba / State Space Models explore linear-complexity sequence modeling; RetNet, Hyena, RWKV, and Jamba all try to replace quadratic attention. On agents, Anthropic's Computer Use (2024), OpenAI Operator (2025), and Claude Agents (2025) let models drive browsers and operating systems; tool use, long-term memory, and multi-agent setups have become product themes. World models (Genie, Sora, Cosmos) pull video and physics simulation into the LLM frame. The field is more uncertain and more volatile in 2025-2026 than at any point before.

After this chapter you should be able to answer

  1. Why is Mamba/SSM linear-complexity? What does its 'selective' mechanism mean?
  2. Can Mamba scale to GPT-4 levels? What is the largest SSM to date?
  3. How do RWKV, Hyena, RetNet, and Jamba position themselves relative to each other?
  4. How does the ReAct framework interleave reasoning and acting? Why is it the starting point for agents?
  5. How does MCP (Model Context Protocol) relate to function calling and tool use?
  6. What are the core engineering challenges of computer-use / browser-use agents?
  7. When do multi-agent systems (AutoGen, CrewAI) beat a single agent?
  8. What is the core mechanism of world models like Sora, Genie, and Cosmos?
More questions (2)
  1. What can mechanistic interpretability explain about LLM behavior today?
  2. How does 'data wall' (public-data exhaustion) affect scaling? Can synthetic data bridge it?

New architectures

Agent frameworks

World models & video generation

Frontier research directions

  • Chris Olah-led Transformer Circuits work — opening up LLM internals.

  • A new scaling curve where 'thinking longer' replaces 'bigger model.'

  • Self-play / Synthetic Data

    As public data approaches exhaustion, self-generated training data is a hot alternative.

  • Multi-agent Systems

    Multiple specialized models / agents collaborating — already effective in code, research, and negotiation.

  • Embodied / Robotics Foundation Models

    RT-2, OpenVLA, and Gemini Robotics bring LLMs into robotics.

Further reading