The purpose of this curriculum is to help new Elicit employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
If you don’t work at Elicit yet - we’re hiring ML and software engineers.
Recommended reading order:
- Read “Tier 1” for all topics
- Read “Tier 2” for all topics
- Etc
✨ = Added after 2025/11/26
- Fundamentals
- Reasoning and runtime strategies
- Applications
- ML in practice
- Advanced topics
- The big picture
- Maintainer
Tier 1
- A short introduction to machine learning
- But what is a neural network?
- Gradient descent, how neural networks learn
Tier 2
- An intuitive understanding of backpropagation
- What is backpropagation really doing?
- An introduction to deep reinforcement learning
Tier 3
- The spelled-out intro to neural networks and backpropagation: building micrograd (Karpathy)
- Backpropagation calculus
Tier 1
- ✨ Intro to Large Language Models (Karpathy)
- But what is a GPT? Visual intro to transformers
- Attention in transformers, visually explained
- Attention? Attention!
- The Illustrated Transformer
Tier 2
- ✨ Deep Dive into LLMs like ChatGPT (Karpathy)
- Let's build the GPT Tokenizer (Karpathy)
- The Illustrated GPT-2 (Visualizing Transformer Language Models)
- Neural Machine Translation by Jointly Learning to Align and Translate
- Attention Is All You Need
Tier 3
- ✨ The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
- The Annotated Transformer
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- A Mathematical Framework for Transformer Circuits
Tier 4+
Tier 1
- Language Models are Unsupervised Multitask Learners (GPT-2)
- Language Models are Few-Shot Learners (GPT-3)
Tier 2
- ✨ DeepSeek-R1 (DeepSeek-R1)
- ✨ DeepSeek-V3 Technical Report (DeepSeek-V3)
- ✨ The Llama 3 Herd of Models (Llama 3)
- LLaMA: Open and Efficient Foundation Language Models (LLaMA)
- Training language models to follow instructions with human feedback (OpenAI Instruct)
Tier 3
- ✨ LLaMA 2: Open Foundation and Fine-Tuned Chat Models (LLaMA 2)
- ✨ Qwen2.5 Technical Report (Qwen2.5)
- ✨ Titans: Learning to Memorize at Test Time
- ✨ Byte Latent Transformer
- ✨ Phi-4 Technical Report (phi-4)
Tier 4+
- Evaluating Large Language Models Trained on Code (OpenAI Codex)
- Mistral 7B (Mistral)
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
- Gemini: A Family of Highly Capable Multimodal Models (Gemini)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba)
- Scaling Instruction-Finetuned Language Models (Flan)
- Efficiently Modeling Long Sequences with Structured State Spaces (video) (S4)
- Consistency Models
- Model Card and Evaluations for Claude Models (Claude 2)
- OLMo: Accelerating the Science of Language Models
- PaLM 2 Technical Report (Palm 2)
- Textbooks Are All You Need II: phi-1.5 technical report (phi 1.5)
- Visual Instruction Tuning (LLaVA)
- A General Language Assistant as a Laboratory for Alignment
- Finetuned Language Models Are Zero-Shot Learners (Google Instruct)
- Galactica: A Large Language Model for Science
- LaMDA: Language Models for Dialog Applications (Google Dialog)
- OPT: Open Pre-trained Transformer Language Models (Meta GPT-3)
- PaLM: Scaling Language Modeling with Pathways (PaLM)
- Program Synthesis with Large Language Models (Google Codex)
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Gopher)
- Solving Quantitative Reasoning Problems with Language Models (Minerva)
- UL2: Unifying Language Learning Paradigms (UL2)
Tier 2
- ✨ Deep Learning Tuning Playbook
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- Learning to summarise with human feedback
- Training Verifiers to Solve Math Word Problems
Tier 3
- ✨ Better & Faster Large Language Models via Multi-token Prediction
- ✨ LoRA vs Full Fine-tuning: An Illusion of Equivalence
- ✨ QLoRA: Efficient Finetuning of Quantized LLMs
- Pretraining Language Models with Human Preferences
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Tier 4+
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- Improving Code Generation by Training with Natural Language Feedback
- Language Modeling Is Compression
- LIMA: Less Is More for Alignment
- Learning to Compress Prompts with Gist Tokens
- Lost in the Middle: How Language Models Use Long Contexts
- LoRA: Low-Rank Adaptation of Large Language Models
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- Reinforced Self-Training (ReST) for Language Modeling
- Solving olympiad geometry without human demonstrations
- Tell, don't show: Declarative facts influence how LLMs generalize
- Textbooks Are All You Need
- TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- Training Language Models with Language Feedback at Scale
- Turing Complete Transformers: Two Transformers Are More Powerful Than One
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers
- Diffusion-LM Improves Controllable Text Generation
- ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
- Efficient Training of Language Models to Fill in the Middle
- ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
- True Few-Shot Learning with Prompts -- A Real-World Perspective
Tier 2
- ✨ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Chain of Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models are Zero-Shot Reasoners (Let's think step by step)
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
Tier 3
- ✨ s1: Simple test-time scaling
- ✨ Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
- ✨ The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
- ✨ Large Language Models Cannot Self-Correct Reasoning Yet
- Chain-of-Thought Reasoning Without Prompting
Tier 4+
- Why think step-by-step? Reasoning emerges from the locality of experience
- Baldur: Whole-Proof Generation and Repair with Large Language Models
- Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
- Certified Reasoning with Language Models
- Hypothesis Search: Inductive Reasoning with Language Models
- LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
- Stream of Search (SoS): Learning to Search in Language
- Training Chain-of-Thought via Latent-Variable Inference
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right
Tier 1
Tier 2
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Factored cognition
- Iterated Distillation and Amplification
- Recursively Summarizing Books with Human Feedback
- Solving math word problems with process-based and outcome-based feedback
Tier 3
- Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
- Faithful Reasoning Using Large Language Models
- Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
- Language Model Cascades
Tier 4+
- Decontextualization: Making Sentences Stand-Alone
- Factored Cognition Primer
- Graph of Thoughts: Solving Elaborate Problems with Large Language Models
- Parsel: A Unified Natural Language Framework for Algorithmic Reasoning
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
- Challenging BIG-Bench tasks and whether chain-of-thought can solve them
- Evaluating Arguments One Step at a Time
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
- Measuring and narrowing the compositionality gap in language models
- PAL: Program-aided Language Models
- ReAct: Synergizing Reasoning and Acting in Language Models
- Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
- Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents
- Thinksum: probabilistic reasoning over sets using large language models
Tier 2
Tier 3
- ✨ Avoiding Obfuscation with Prover-Estimator Debate
- ✨ Improving Factuality and Reasoning in Language Models through Multiagent Debate
- ✨ Prover-Verifier Games Improve Legibility of LLM Outputs
- Debate Helps Supervise Unreliable Experts
Tier 4+
Tier 2
- Measuring the impact of post-training enhancements
- WebGPT: Browser-assisted question-answering with human feedback
Tier 3
- ✨ Executable Code Actions Elicit Better LLM Agents
- ✨ GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
- ✨ TextGrad: Automatic "Differentiation" via Text
- AI capabilities can be significantly improved without expensive retraining
- Automated Statistical Model Discovery with Language Models
Tier 4+
- DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
- Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
- Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Tier 2
Tier 3
- What Evidence Do Language Models Find Convincing?
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Tier 4+
Tier 2
- ✨ AlphaEvolve: A coding agent for scientific and algorithmic discovery
- ✨ AlphaFold 3: Accurate structure prediction of biomolecular interactions
Tier 3
- ✨ Towards an AI Co-Scientist
- Can large language models provide useful feedback on research papers? A large-scale empirical analysis
- Large Language Models Encode Clinical Knowledge
- The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
- A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Tier 4+
- Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
- Nougat: Neural Optical Understanding for Academic Documents
- Scim: Intelligent Skimming Support for Scientific Papers
- SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design
- Towards Accurate Differential Diagnosis with Large Language Models
- Towards a Benchmark for Scientific Understanding in Humans and Machines
- A Search Engine for Discovery of Scientific Challenges and Directions
- A full systematic review was completed in 2 weeks using automation tools: a case study
- Fact or Fiction: Verifying Scientific Claims
- Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
- PEER: A Collaborative Language Model
- PubMedQA: A Dataset for Biomedical Research Question Answering
- SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts
- SciTail: A Textual Entailment Dataset from Science Question Answering
Tier 3
- ✨ Consistency Checks for Language Model Forecasters
- ✨ LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language
- AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
- Approaching Human-Level Forecasting with Language Models
- Forecasting Future World Events with Neural Networks
Tier 2
- Learning Dense Representations of Phrases at Scale
- Text and Code Embeddings by Contrastive Pre-Training (OpenAI embeddings)
Tier 3
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- Not All Vector Databases Are Made Equal
- REALM: Retrieval-Augmented Language Model Pre-Training
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Task-aware Retrieval with Instructions
Tier 4+
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- Some Common Mistakes In IR Evaluation, And How They Can Be Avoided
- Boosting Search Engines with Interactive Agents
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
- UnifiedQA: Crossing Format Boundaries With a Single QA System
Tier 1
- Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI
- Machine Learning: The High Interest Credit Card of Technical Debt
Tier 2
Tier 2
- ✨ GAIA: a benchmark for General AI Assistants
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
Tier 3
- ✨ RE-Bench: Evaluating Frontier AI R&D Capabilities
- ✨ SimpleQA: Measuring Short-Form Factuality
- ✨ ARC Prize 2024: Technical Report
- ✨ FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
- Measuring Massive Multitask Language Understanding
Tier 4+
- FLEX: Unifying Evaluation for Few-Shot NLP
- Holistic Evaluation of Language Models (HELM)
- True Few-Shot Learning with Language Models
- ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
- Measuring Mathematical Problem Solving With the MATH Dataset
- QuALITY: Question Answering with Long Input Texts, Yes!
- SCROLLS: Standardized CompaRison Over Long Language Sequences
- What Will it Take to Fix Benchmarking in Natural Language Understanding?
Tier 2
Tier 3
- ✨ FineWeb: Decanting the Web for the Finest Text Data at Scale
- Dialog Inpainting: Turning Documents into Dialogs
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
- Microsoft Academic Graph
Tier 3
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
- From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
- Language Models Represent Space and Time
Tier 4+
- Amortizing intractable inference in large language models
- CLADDER: Assessing Causal Reasoning in Language Models
- Causal Bayesian Optimization
- Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
- Generative Agents: Interactive Simulacra of Human Behavior
- Passive learning of active causal strategies in agents and language models
Tier 4+
Tier 2
- Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
- A Simple Baseline for Bayesian Uncertainty in Deep Learning
- Plex: Towards Reliability using Pretrained Large Model Extensions
Tier 3
- ✨ Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
- Active Preference Inference using Language Models and Probabilistic Reasoning
- Eliciting Human Preferences with Language Models
- Describing Differences between Text Distributions with Natural Language
- Teaching Models to Express Their Uncertainty in Words
Tier 4+
- Active Learning by Acquiring Contrastive Examples
- Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning
- STaR-GATE: Teaching Language Models to Ask Clarifying Questions
- Active Testing: Sample-Efficient Model Evaluation
- Uncertainty Estimation for Language Reward Models
Tier 2
- ✨ Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- ✨ Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
- Discovering Latent Knowledge in Language Models Without Supervision
Tier 3
- ✨ Scaling and Evaluating Sparse Autoencoders
- ✨ Opening the AI black box: program synthesis via mechanistic interpretability
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- Representation Engineering: A Top-Down Approach to AI Transparency
- Studying Large Language Model Generalization with Influence Functions
Tier 4+
- Codebook Features: Sparse and Discrete Interpretability for Neural Networks
- Eliciting Latent Predictions from Transformers with the Tuned Lens
- How do Language Models Bind Entities in Context?
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- Uncovering mesa-optimization algorithms in Transformers
- Fast Model Editing at Scale
- Git Re-Basin: Merging Models modulo Permutation Symmetries
- Locating and Editing Factual Associations in GPT
- Mass-Editing Memory in a Transformer
Tier 2
- ✨ DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Reflexion: Language Agents with Verbal Reinforcement Learning
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)
- MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Tier 3
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- AlphaStar: mastering the real-time strategy game StarCraft II
- Decision Transformer
- Mastering Atari Games with Limited Data (EfficientZero)
- Mastering Stratego, the classic game of imperfect information (DeepNash)
Tier 4+
- AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
- Bayesian Reinforcement Learning with Limited Cognitive Load
- Contrastive Prefence Learning: Learning from Human Feedback without RL
- Grandmaster-Level Chess Without Search
- A data-driven approach for learning to control computers
- Acquisition of Chess Knowledge in AlphaZero
- Player of Games
- Retrieval-Augmented Reinforcement Learning
Tier 1
Tier 2
- AI and compute
- Scaling Laws for Transfer
- Training Compute-Optimal Large Language Models (Chinchilla)
Tier 3
- ✨ Pre-training under Infinite Compute
- Emergent Abilities of Large Language Models
- Transcending Scaling Laws with 0.1% Extra Compute (U-PaLM)
Tier 4+
- Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
- Power Law Trends in Speedrunning and Machine Learning
- Scaling laws for single-agent reinforcement learning
- Beyond neural scaling laws: beating power law scaling via data pruning
- Emergent Abilities of Large Language Models
- Scaling Scaling Laws with Board Games
Tier 1
- Three impacts of machine intelligence
- What failure looks like
- Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Tier 2
- An Overview of Catastrophic AI Risks
- Clarifying “What failure looks like” (part 1)
- Deep RL from human preferences
- The alignment problem from a deep learning perspective
Tier 3
- ✨ Alignment Faking in Large Language Models
- ✨ Constitutional Classifiers: Defending against Universal Jailbreaks
- ✨ Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
- ✨ Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
- Scheming AIs: Will AIs fake alignment during training in order to get power?
Tier 4+
- Towards a Law of Iterated Expectations for Heuristic Estimators
- Measuring Progress on Scalable Oversight for Large Language Models
- Scalable agent alignment via reward modelling
- AI Deception: A Survey of Examples, Risks, and Potential Solutions
- Benchmarks for Detecting Measurement Tampering
- Chess as a Testing Grounds for the Oracle Approach to AI Safety
- Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence
- Model evaluation for extreme risks
- Responsible Reporting for Frontier AI Development
- Safety Cases: How to Justify the Safety of Advanced AI Systems
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- Tools for Verifying Neural Models' Training Data
- Towards a Cautious Scientist AI with Convergent Safety Bounds
- Alignment of Language Agents
- Eliciting Latent Knowledge
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Red Teaming Language Models with Language Models
- Risks from Learned Optimization in Advanced Machine Learning Systems
- Unsolved Problems in ML Safety
Tier 2
- ✨ AI 2027
- ✨ Situational Awareness (Aschenbrenner)
Tier 3
- Explosive growth from AI automation: A review of the arguments
- Language Models Can Reduce Asymmetry in Information Markets
Tier 4+
- Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero
- Foundation Models and Fair Use
- GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
- Levels of AGI: Operationalizing Progress on the Path to AGI
- Opportunities and Risks of LLMs for Scalable Deliberation with Polis
- On the Opportunities and Risks of Foundation Models
Tier 2