LLM Algorithm Evolution

01

Pre-Deep-Learning Era

1948 – 2013

Before deep learning, the default language-modeling recipe was N-gram + smoothing (Kneser-Ney, Good-Turing): use the previous n-1 words to predict the next one. Fast and interpretable, but two hard problems — the curse of dimensionality (vocabulary^n space) and a short context window (beyond n=4 the statistics are extremely sparse). In 2003 Bengio's Neural Probabilistic Language Model mapped words into a continuous vector space. In 2013 Mikolov's word2vec engineered that idea into a standard tool and made word embeddings a baseline NLP component. The central idea from this era — distributed representation — is the bedrock for everything that followed.

After this chapter you should be able to answer

How is N-gram probability computed? Why is Kneser-Ney smoothing more widely used than Laplace?
What's the essential innovation in Bengio 2003's Neural Probabilistic Language Model vs N-gram?
How do CBOW and Skip-gram differ as training objectives for word2vec?
What computational problems do negative sampling and hierarchical softmax each address?
What philosophical difference separates GloVe from word2vec?
Why does the linear structure (king − man + woman ≈ queen) emerge?
What's the fundamental limit of static word embeddings — why can 'bank' not represent both a river bank and a bank account simultaneously?
What did FastText add over word2vec, and why is it friendlier to OOV tokens?

More questions (2)

Whose quote is the distributional hypothesis "You shall know a word by the company it keeps"?
What role do older methods like TF-IDF, LSA, and LDA still play today?

Key techniques

N-gram + smoothing

The classic statistical language model. Kneser-Ney smoothing was the industry default for 30 years.
Distributed representation

Mapping words into dense vector spaces — the origin point of modern NLP.
word2vec

Mikolov's two simplified models (CBOW, Skip-gram) made word embeddings trainable at scale.
Negative sampling

Replaces softmax with binary classification, reducing Skip-gram training from O(V) to O(k).
GloVe

Global co-occurrence matrix factorization — a complementary philosophy to word2vec.
FastText

Subword embeddings via character n-grams. Strong for morphologically rich languages and OOV.
TF-IDF / LSA / LDA

Older text representations that still have a role in information retrieval and document clustering.

Key papers

A Neural Probabilistic Language Model (Bengio 2003)

The foundational paper of neural language modeling — it nailed down 'learn a word vector.'
Efficient Estimation of Word Representations (Mikolov 2013)

First word2vec paper with CBOW and Skip-gram.
Distributed Representations of Words and Phrases (Mikolov 2013)

Second paper introducing negative sampling, hierarchical softmax, subsampling.
GloVe (Pennington 2014)

The Stanford GloVe paper.
Enriching Word Vectors with Subword Information (Bojanowski 2017)

The FastText paper.

Models & tools

word2vec (Google C)

The original C implementation — still fast on large corpora.
gensim

The easiest Python library for word2vec, LDA, LSI.
spaCy

Industrial NLP pipelines with pretrained vectors and reliable tooling.
NLTK

Teaching-oriented NLP toolbox — N-gram, HMM, PCFG are all there.

The RNN Era

1997 – 2017

Word embeddings handled lexical representation, but language is a sequence problem — word meaning depends on context. Recurrent Neural Networks (RNNs) unroll the same weights across time to process variable-length sequences, but vanishing gradients made long-range dependencies difficult. LSTM (Hochreiter 1997) introduced gating to carry information across long sequences; GRU (2014) is a lighter variant. In 2014 Sutskever's Seq2Seq brought the encoder-decoder architecture to machine translation, and Bahdanau's attention mechanism let the decoder peek back at any input position — a direct precursor to the Transformer.

After this chapter you should be able to answer

What's the root cause of vanishing/exploding gradients in RNNs? How is BPTT performed?
What do the three LSTM gates (forget, input, output) actually control?
What does GRU leave out compared to LSTM? When is the performance difference negligible?
What translation problem does the Seq2Seq encoder-decoder solve?
How do Bahdanau and Luong attention differ?
Why are RNN-style models hard to parallelize? What did that mean for production deployment?
What's the relationship between teacher forcing and exposure bias?
Where did ConvS2S (Facebook 2017) beat RNN Seq2Seq, and why did it never become mainstream?

More questions (2)

How much does beam search width matter for translation quality?
Why is ELMo often called a "transitional form"? What architecture does it use?

Key models

RNN

The simplest recurrent structure, held back in practice by vanishing gradients.
LSTM

Hochreiter & Schmidhuber 1997. Gates + cell state are the core insight.
GRU

Cho 2014. Merges the three LSTM gates into two — fewer parameters, faster training.
Seq2Seq

Sutskever 2014's encoder-decoder framework — the skeleton for modern generative models.
Attention (Bahdanau)

Lets the decoder re-attend to arbitrary encoder positions. Direct precursor to Transformer.
ConvS2S

Facebook 2017 used CNNs for Seq2Seq — easier to parallelize than RNNs, quickly overtaken by Transformer.
ELMo

Bi-LSTM contextual embeddings — an early industrial demonstration of pretraining in NLP.

Key papers

Long Short-Term Memory (Hochreiter 1997)

The original paper on gated networks to combat vanishing gradients.
Sequence to Sequence Learning (Sutskever 2014)

Google applied encoder-decoder to machine translation successfully.
NMT by Jointly Learning to Align and Translate (Bahdanau 2014)

The original attention paper.
Effective Approaches to Attention-based NMT (Luong 2015)

Global vs local attention compared to Bahdanau-style.
Deep Contextualized Word Representations (Peters 2018)

The ELMo paper — the first glimpse of large-scale pretraining in NLP.

Tools

PyTorch / TensorFlow

Early RNN implementations lived here and remain the teaching starting point.
OpenNMT

Open-source neural MT framework, the industry choice during the RNN era.
Fairseq

Facebook's Seq2Seq toolkit, later extended to Transformers.
torchtext / AllenNLP

Data loading and modeling components around the PyTorch ecosystem.

Transformer & Pretraining

2017 – 2020

Google's 2017 paper 'Attention Is All You Need' introduced the Transformer: pure self-attention replacing RNNs/CNNs, fully parallel end-to-end. Core innovations: multi-head self-attention, positional encoding, residual connections + LayerNorm. In 2018 GPT-1 and BERT appeared almost simultaneously: GPT is autoregressive generation (decoder-only) and BERT is masked fill-in (encoder-only, MLM). T5 (2019) reframed every NLP task as text-to-text. The era's unifying trend is 'pretrain + fine-tune' — data and compute replaced task-specific architecture.

After this chapter you should be able to answer

What roles do Q, K, and V play in self-attention? Is complexity O(n²) or O(n·d)?
What do multi-head attention heads buy over a single head? Why 8 / 16 / 32?
What are the motivations behind absolute (sinusoidal), relative, and RoPE positional encodings?
Why the 80/10/10 mask/replace/unchanged mix in BERT's MLM?
Where does GPT's next-token prediction beat BERT's MLM, and where does it lose?
When do encoder-only, decoder-only, and encoder-decoder architectures each shine?
Why was T5's 'everything is text-to-text' framing so influential?
What are the differences between BPE, WordPiece, and SentencePiece tokenizers?

More questions (2)

How much does LayerNorm placement (Post-LN vs Pre-LN) affect training stability?
Why does the Transformer also dominate vision (ViT) and speech (Whisper)?

Key architectures

Transformer

Vaswani 2017 — the common ancestor of every modern LLM.
BERT

Google 2018 encoder-only pretrained model with MLM.
GPT-1

OpenAI 2018. Established the decoder-only pretrain + fine-tune recipe.
T5

Google 2019 text-to-text framework unifying all NLP tasks as generation.
RoBERTa

FAIR optimized BERT training: more data, longer, dropping NSP.
XLNet

Permutation LM combining AR and AE — elegant but engineering-heavy.
BART

FAIR's encoder-decoder pretrained model with noise → denoising objective.

Core mechanisms

Multi-head self-attention

Attention under several 'views' computed in parallel — the Transformer's compute core.
Positional encoding

Sinusoidal (original), learned, relative, RoPE — the main ways to express sequence order.
Layer Normalization

Critical for stable Transformer training. Pre-LN vs Post-LN matters.
Masked Language Modeling

BERT's pretraining objective — randomly mask and predict.
Next-token prediction

GPT's pretraining objective and the foundation of every modern generative LLM.
Tokenization (BPE / WordPiece / SentencePiece)

The three mainstream subword algorithms. Determines vocab and encoding efficiency.

Tools

HuggingFace Transformers

The de facto pretrained model library with nearly every Transformer ported.
Fairseq

Meta's Seq2Seq/Transformer training framework, popular in academia.
Tensor2Tensor

Reference implementation of the original Transformer, now mostly a historical archive.
Karpathy nanoGPT

~300 lines of PyTorch GPT. The best way to internalize Transformer details.

Scaling Laws: GPT-2 to GPT-3

2019 – 2022

GPT-2 (2019) scaled Transformer to 1.5B parameters and showed zero-shot: the model could follow natural-language prompts without fine-tuning. GPT-3 (2020) scaled to 175B and revealed few-shot in-context learning — a handful of examples and the model picks up new tasks. In 2020, Kaplan's 'Scaling Laws for Neural Language Models' showed loss falls as a power law in compute, data, and parameters. In 2022 DeepMind's Chinchilla corrected Kaplan's finding — at fixed compute, data should scale in lockstep with parameters, not the other way around. This framework has decided the resource allocation for every subsequent large model.

After this chapter you should be able to answer

What's the capability jump between GPT-2 (1.5B) and GPT-3 (175B)?
What are the emergent abilities, and at what scales do they appear?
How are Kaplan's three power laws (loss vs params, data, compute) derived?
Which assumption did Chinchilla correct? What new ratio does it recommend?
What's the approximate training cost (GPU-hours, electricity) of a 175B GPT-3?
Why does in-context learning work? How does it differ from true fine-tuning?
When did prompt engineering become a craft?
What are the different ecosystem trajectories for open-source GPT-2 vs closed GPT-3?

More questions (2)

Why did Chinchilla's finding enable LLaMA-class 'small-but-data-heavy' models?
How has compute as a meta-resource shaped the AI industry?

Representative models

GPT-2 (2019, 1.5B)

Zero-shot showed the power of pretraining and caused the 'too dangerous to release' debate.
Megatron-LM (2019, 8.3B)

NVIDIA's large-model training systems paper. Foundation of tensor parallelism.
T5-11B (2019)

The largest model trained in the T5 paper.
Turing-NLG (2020, 17B)

Microsoft's then-largest LLM, trained with DeepSpeed.
GPT-3 (2020, 175B)

Few-shot in-context learning arrived and shocked the NLP world.
Jurassic-1 (2021, 178B)

AI21's contemporary to GPT-3.
PaLM (2022, 540B)

Google's 540B Pathways model — briefly the largest.
Chinchilla (2022, 70B)

DeepMind's compute-optimal experiment: a 70B model beat 280B at matched compute.

Key papers

Language Models are Unsupervised Multitask Learners (GPT-2, Radford 2019)

First systematic demonstration of zero-shot.
Scaling Laws for Neural Language Models (Kaplan 2020)

Original paper on compute/data/parameter power laws.
Language Models are Few-Shot Learners (GPT-3, Brown 2020)

Classic paper on few-shot in-context learning.
Training Compute-Optimal LLMs (Chinchilla, Hoffmann 2022)

Corrected Kaplan and introduced the new 20 tokens/param ratio.
Emergent Abilities of LLMs (Wei 2022)

Systematic record of capabilities that appear suddenly past a scale threshold.

Training systems

Megatron-LM

NVIDIA's large-model training toolkit, reference implementation of tensor parallelism.
DeepSpeed

Microsoft's training optimization library and primary ZeRO implementation.
GPT-NeoX

EleutherAI's open-source replication — the first 'community' big models.
Hugging Face Model Hub

The de facto hub for models and checkpoints.

Alignment: RLHF & Instruction Tuning

2022 – 2024

Pretrained models can complete sentences but aren't necessarily 'helpful' or 'safe'. In 2022, InstructGPT industrialized RLHF (Reinforcement Learning from Human Feedback) in three steps: SFT on human demonstrations, train a reward model on pairwise preferences, then fine-tune via PPO. In November 2022 ChatGPT let the public feel the difference an aligned model makes. Since 2023, cheaper alternatives emerged: Anthropic's Constitutional AI had models self-critique; DPO (Direct Preference Optimization) bypassed the reward model entirely; SimPO, KTO, and IPO kept simplifying. By 2024, RLAIF (AI feedback instead of human labels) broke the human-annotation bottleneck.

After this chapter you should be able to answer

What does each step of RLHF actually do? What does PPO contribute in the third step?
How are reward-model training pairwise preferences collected? Who are the annotators?
Why is RLHF more effective than pure supervised fine-tuning? What is the ceiling of SFT alone?
Which dimension did InstructGPT improve the most over raw GPT-3?
What is Constitutional AI's self-critique flow?
What does DPO simplify vs PPO, and how much does performance differ?
What does reward hacking look like in RLHF? How is it mitigated?
What are typical data volumes for SFT, RLHF, and DPO?

More questions (2)

What's the central innovation of RLAIF over RLHF?
Where do ReST, SimPO, KTO, and IPO sit in the 2024 alignment landscape?

Core techniques

Supervised Fine-Tuning (SFT)

Fine-tuning on instruction-response pairs. The first step of every subsequent alignment pipeline.
Reward Model (RM)

Takes answer pairs, emits preference scores. The referee in RLHF.
PPO (Proximal Policy Optimization)

The RL algorithm InstructGPT used. Core idea: clipped objective to bound updates.
DPO (Direct Preference Optimization)

Replaces RL with a closed-form cross-entropy loss.
Constitutional AI / RLAIF

Anthropic's self-critique framework with AI feedback replacing most human labels.
SimPO / KTO / IPO

2024 DPO variants that further simplify or correct the loss.

Key papers

InstructGPT (Ouyang 2022)

The landmark paper for industrial RLHF.
Constitutional AI (Bai 2022)

Anthropic's alignment methodology with explicit value principles.
DPO (Rafailov 2023)

Replaced RL with a supervised loss — 2023's most important alignment paper.
SimPO (Meng 2024)

Simplifies DPO further by removing the reference model.
KTO (Ethayarajh 2024)

Preference learning grounded in Kahneman-Tversky prospect theory.

Representative aligned models

InstructGPT / ChatGPT

The turning point that turned GPT-3.5 into a conversational partner.
GPT-4 / GPT-4 Turbo

OpenAI's flagship — multimodal with sharply improved alignment.
Claude series (Anthropic)

Constitutional AI plus RLHF — helpful, harmless, honest as design goals.
Llama 2 Chat / Llama 3 Instruct

Meta's open aligned models. The most detailed RLHF recipes in public papers.
DeepSeek Chat / R1

The domestic Chinese open-aligned flagship. R1 pushed RL-on-reasoning to the open-source frontier.

Efficiency: MoE, Quantization, Long Context

2021 – 2025

Bigger models are smarter, but training and inference costs climb with them. Three paths compress those costs: (1) sparse activation (MoE) so each token uses only a fraction of parameters; (2) quantization (INT8 / INT4 / FP8) to shrink storage and compute; (3) attention engineering to make long contexts tractable. Mixtral 8x7B (Mistral 2023) showed MoE works in open source; GPT-4 is widely believed to be MoE; Gemini 1.5 Pro pushed context to 1M tokens with ring attention; DeepSeek-V3 combines MoE + MLA + FP8 — currently the most systematic efficiency case study.

After this chapter you should be able to answer

How does top-k routing pick experts in MoE, and how is load imbalance handled?
How do GShard, Switch Transformer, and Mixtral differ in MoE implementation?
How much accuracy do INT8/INT4 cost? What are the different ideas behind AWQ, GPTQ, and SmoothQuant?
How much speedup does FP8 training bring over FP16/BF16 on H100/Blackwell?
Why does training only low-rank matrices in LoRA work? How does QLoRA add quantization?
How does FlashAttention reduce attention HBM traffic — does it change the algorithm?
What's the main benefit of PagedAttention (vLLM) managing KV cache like OS virtual memory?
How do ring attention, Infini-attention, and Mamba each break the context-length barrier?

More questions (2)

What is the speedup ceiling of Medusa, EAGLE, and Lookahead in speculative decoding?
How does MLA (Multi-head Latent Attention) shrink KV cache in DeepSeek-V2/V3?

Core techniques

Mixture of Experts (MoE)

Shazeer 2017's sparse gated expert layer — the foundation of modern large-model efficiency.
Switch Transformer

Google's simplified MoE scaling to trillions of parameters.
Quantization (INT8 / INT4 / FP8)

Storing weights in 8-bit or 4-bit can cut inference cost by an order of magnitude.
LoRA / QLoRA / DoRA

Low-rank adaptation lets fine-tuning touch <1% of parameters.
FlashAttention

Tri Dao's IO-aware attention kernel, near O(n) on HBM in practice.
PagedAttention (vLLM)

KV cache as paged virtual memory — memory efficiency from 20-40% up to 90%+.
GQA / MQA / MLA

Cut attention-head count or share KV to reduce cache and bandwidth pressure.
Speculative decoding

Small model drafts, large model verifies — typically 2-4× inference latency gain.

Key papers

Outrageously Large Neural Networks (Shazeer 2017)

The first major MoE paper.
Switch Transformer (Fedus 2021)

Google pushed MoE to 1.6T parameters in 2021.
LoRA (Hu 2021)

The low-rank adaptation paper.
FlashAttention (Dao 2022)

The landmark attention-kernel optimization paper.
vLLM / PagedAttention (Kwon 2023)

The KV-cache management revolution.
Mixtral of Experts (Jiang 2024)

Mistral's 8x7B — the open-source MoE milestone.
DeepSeek-V3 technical report (2024)

A complete case of MoE + MLA + FP8 done together.

Representative models

Switch-C (1.6T, Google 2021)

The first trillion-parameter MoE.
Mixtral 8x7B / 8x22B

Mistral's open-source MoE — ~13B active parameters, inference efficiency near Llama-70B.
DeepSeek-V2 / V3

The Chinese open-source flagship. MLA + MoE dropped training cost to a tenth of industry peers.
Gemini 1.5 Pro (long context)

Google's flagship pushing context to 1M tokens.
Qwen2.5 / Qwen2.5-Max

Alibaba's open + closed MoE series with a strong efficiency focus.

Multimodality & Reasoning

2021 – 2025

Two directions ran in parallel through 2023-2024: multimodality and reasoning. Multimodality began with CLIP (2021) aligning images and text; then Flamingo (2022) mixing image + text input; GPT-4V (2023) native vision; Gemini (2023) handling text, image, audio, and video together. On reasoning, Chain-of-Thought prompting (Wei 2022) first showed that 'step-by-step thinking' lifts accuracy; in September 2024 OpenAI o1 made thinking itself a training target (RL on reasoning) with major gains in math and code; January 2025's DeepSeek R1 proved the approach can be reproduced in open source. The current frontier is fusing reasoning and multimodality.

After this chapter you should be able to answer

How does CLIP's contrastive learning align images and text? How does zero-shot classification work?
How do Flamingo, BLIP-2, and LLaVA differ in fusing vision and language?
What are the engineering differences between native vision (GPT-4V) and bolted-on vision encoders?
How does Whisper tokenize audio?
Why does Chain-of-Thought prompting boost reasoning so much? Does it only help at large scale?
What does self-consistency actually gain over greedy decoding?
How do Tree of Thoughts and Graph of Thoughts relate to CoT?
What signal does OpenAI o1's 'RL on reasoning' actually train on?

More questions (2)

How does DeepSeek R1's GRPO simplify PPO?
How is test-time compute scaling fundamentally different from pretrain scaling?

Multimodal models

CLIP (OpenAI 2021)

Image-text contrastive learning opened the multimodal era and set the zero-shot classification baseline.
Flamingo (DeepMind 2022)

Insert vision tokens into a frozen LLM for few-shot VQA.
BLIP-2 (Salesforce 2023)

Q-Former bridges the vision encoder and the LLM at low training cost.
GPT-4V / GPT-4o (OpenAI 2023-2024)

OpenAI's multimodal flagship — GPT-4o delivers native text, audio, and vision.
Gemini 1.0 / 1.5 / 2 (Google 2023+)

Google's family of native multimodal models.
LLaVA (open multimodal)

The Visual Instruction Tuning reference point for the open community.
Qwen-VL / Qwen2.5-VL

Alibaba's open multimodal series chasing GPT-4V.

Reasoning models

Chain-of-Thought (Wei 2022)

A single "let's think step by step" significantly boosts reasoning performance.
Self-Consistency (Wang 2022)

Sampling plus majority voting — a stronger CoT.
Tree of Thoughts (Yao 2023)

Expands CoT into a searchable reasoning tree.
OpenAI o1 / o3 (2024-2025)

Models trained with RL-on-reasoning. Expert-level on AIME and Codeforces.
DeepSeek R1 (2025)

The first open model to reproduce o1-class reasoning. GRPO became a new RL baseline.
Claude Extended Thinking

Anthropic's explicit-thinking capability on Claude 3.7 / 4.

Key papers

CLIP (Radford 2021)

The seminal multimodal learning paper.
Flamingo (Alayrac 2022)

The reference few-shot visual language model.
Chain-of-Thought Prompting (Wei 2022)

The first systematic study of step-by-step reasoning.
Tree of Thoughts (Yao 2023)

The CoT → ToT expansion.
DeepSeek-R1 (2025)

The open-source canonical example of GRPO and multi-stage RL training.
Scaling Laws for Test-Time Compute (2024)

OpenAI's scaling laws for inference-time compute.

Frontier: Agents, SSM, World Models

2024 –

From 2024 on, LLM research has split into 'beyond Transformer' and 'become an Agent.' On architecture, Mamba / State Space Models explore linear-complexity sequence modeling; RetNet, Hyena, RWKV, and Jamba all try to replace quadratic attention. On agents, Anthropic's Computer Use (2024), OpenAI Operator (2025), and Claude Agents (2025) let models drive browsers and operating systems; tool use, long-term memory, and multi-agent setups have become product themes. World models (Genie, Sora, Cosmos) pull video and physics simulation into the LLM frame. The field is more uncertain and more volatile in 2025-2026 than at any point before.

After this chapter you should be able to answer

Why is Mamba/SSM linear-complexity? What does its 'selective' mechanism mean?
Can Mamba scale to GPT-4 levels? What is the largest SSM to date?
How do RWKV, Hyena, RetNet, and Jamba position themselves relative to each other?
How does the ReAct framework interleave reasoning and acting? Why is it the starting point for agents?
How does MCP (Model Context Protocol) relate to function calling and tool use?
What are the core engineering challenges of computer-use / browser-use agents?
When do multi-agent systems (AutoGen, CrewAI) beat a single agent?
What is the core mechanism of world models like Sora, Genie, and Cosmos?

More questions (2)

What can mechanistic interpretability explain about LLM behavior today?
How does 'data wall' (public-data exhaustion) affect scaling? Can synthetic data bridge it?

New architectures

Mamba (Gu & Dao 2023)

Selective SSM with linear complexity and near-Transformer quality — the strongest current challenger.
RWKV

RNN-style training with Transformer-style performance; a community-driven open project.
RetNet (MSR 2023)

Microsoft's Retentive Network — another attention alternative.
Hyena (Poli 2023)

Implicit long convolution replaces attention — part of the sub-quadratic wave.
Jamba (AI21 2024)

Transformer + Mamba hybrid with an edge at long contexts.

Agent frameworks

ReAct (Yao 2022)

Reasoning + acting interleaved — the skeleton behind most agent designs.
Toolformer (Schick 2023)

The paper that taught models to decide when to call APIs on their own.
AutoGPT / BabyAGI

Community projects that ignited agent imagination in 2023.
Model Context Protocol (MCP)

Anthropic-led open protocol to standardize tool connectivity.
AutoGen / LangGraph

Microsoft's and LangChain's multi-agent orchestration frameworks.
OpenAI Operator / Claude Agents

2025 products that turned 'models driving a computer' into real offerings.

World models & video generation

Sora (OpenAI 2024)

Diffusion Transformer pushing minute-length video to convincing quality.
Genie / Genie 2 (DeepMind 2024)

Generate interactive 2D / 3D worlds from a single image.
Cosmos (NVIDIA 2025)

NVIDIA's World Foundation Model aimed at robotics and autonomous driving.
Veo (Google)

Google's high-quality video generation model.
Kling / Runway Gen-3

Leading commercial video-generation products.

Frontier research directions

Mechanistic Interpretability

Chris Olah-led Transformer Circuits work — opening up LLM internals.
Test-time Compute Scaling

A new scaling curve where 'thinking longer' replaces 'bigger model.'
Self-play / Synthetic Data

As public data approaches exhaustion, self-generated training data is a hot alternative.
Multi-agent Systems

Multiple specialized models / agents collaborating — already effective in code, research, and negotiation.
Embodied / Robotics Foundation Models

RT-2, OpenVLA, and Gemini Robotics bring LLMs into robotics.

Pre-Deep-Learning Era

After this chapter you should be able to answer

Key techniques

Key papers

Models & tools

Further reading

The RNN Era

After this chapter you should be able to answer

Key models

Key papers

Tools

Further reading

Transformer & Pretraining

After this chapter you should be able to answer

Key architectures

Core mechanisms

Tools

Further reading

Scaling Laws: GPT-2 to GPT-3

After this chapter you should be able to answer

Representative models

Key papers

Training systems

Further reading

Alignment: RLHF & Instruction Tuning

After this chapter you should be able to answer

Core techniques

Key papers

Representative aligned models

Further reading

Efficiency: MoE, Quantization, Long Context

After this chapter you should be able to answer

Core techniques

Key papers

Representative models

Further reading

Multimodality & Reasoning

After this chapter you should be able to answer

Multimodal models

Reasoning models

Key papers

Further reading

Frontier: Agents, SSM, World Models

After this chapter you should be able to answer

New architectures

Agent frameworks

World models & video generation

Frontier research directions

Further reading