AIRE
a curriculum for the AI research engineer
The landscape of artificial intelligence has shifted dramatically from the era of isolated academic curiosity to an industrial arms race driven by "frontier labs"—organizations like OpenAI, Google DeepMind, and Anthropic that operate at the cutting edge of capability. For a university student aspiring to join these ranks, the standard "data science" curriculum is no longer sufficient. The modern Research Engineer (RE) at a frontier lab is a hybrid archetype: part mathematician, part systems engineer, and part experimental scientist. They must possess the theoretical intuition to diagnose why a loss curve is diverging and the systems-level expertise to implement a fix across a cluster of thousands of GPUs.
This report outlines a rigorous, exhaustive, and self-directed curriculum designed to bridge the gap between a student with basic software knowledge and a candidate capable of contributing to the development of next-generation foundation models. This is not a path of least resistance; it is a path of maximum depth. The curriculum prioritizes "first-principles" understanding over high-level API usage. While it is possible to train a classifier in three lines of Python, doing so without understanding the underlying calculus, optimization dynamics, and memory hierarchy renders one incapable of pushing the boundary of what is possible. As identified in industry analyses, the most successful research engineers often hold advanced degrees or equivalent self-study depth in computer science, mathematics, and physics.
The curriculum is structured into distinct phases, each building upon the previous. It begins with the bedrock of rigorous mathematics—linear algebra, calculus, and probability—treated not as prerequisites to be rushed through, but as the primary language of the field. It progresses through the fundamentals of computer science, essential for the non-CS major to write efficient production code. It then traverses classical machine learning theory, deep learning architectures, and finally, the specialized systems engineering and frontier research topics (LLMs, Generative AI) that define the current era.
Curriculum Overview
| Phase | Primary Focus | Key Competency Goal |
|---|---|---|
| 1. The Mathematical Substrate | Rigorous Proofs & Geometry | Ability to derive gradients and visualize high-dimensional spaces. |
| 2. CS Fundamentals & Systems | Algorithms & Systems | Optimization of compute and memory; writing efficient kernels. |
| 3. Classical ML Theory | Statistical Learning | Understanding bias, variance, and generalization bounds. |
| 4. Deep Learning | Architectures (Transformers) | Intuitive grasp of modern layers, attention, and normalization. |
| 5. Frontier Systems | Distributed Training (CUDA) | Training models beyond single-GPU limits; scaling laws. |
| 6. Frontier Research Topics | LLMs, Generative AI | Implementing papers from scratch; contributing to open science. |
| 7. Research & Portfolio | Reproduction & Innovation | Implementing papers from scratch; contributing to open science. |
Phase 1: The Mathematical Substrate
Objective: The barrier to entry for reading frontier research papers—such as those detailing diffusion probabilistic models or geometric deep learning—is almost universally mathematical maturity. A superficial familiarity with matrix multiplication is insufficient. To debug a neural network that fails to learn, one must understand the geometry of the loss landscape (calculus), the transformation of data manifolds (linear algebra), and the uncertainty of the underlying process (probability).
Key Checkpoint: Linear Algebra Done Right (Axler)
2.1 Linear Algebra: The Geometry of Data
Linear algebra is the "assembly language" of deep learning. Neural networks are fundamentally compositions of linear transformations interspersed with non-linear activation functions. A robust understanding of vector spaces allows a researcher to conceptualize how data moves through these transformations and how high-dimensional features are represented.
Coordinate-Free vs. Matrix-Centric
In selecting a text, a critical distinction exists between the "matrix-centric" approach (common in engineering) and the "coordinate-free" approach (common in pure mathematics). For a frontier researcher, the coordinate-free approach is increasingly vital for developing intuition about high-dimensional latent spaces where specific bases are arbitrary.
Resources
- primary Linear Algebra Done Right by Sheldon Axler
Axler's text is renowned for its decision to banish determinants to the end of the book. This forces the student to understand linear maps, eigenvalues, and inner product spaces based on their geometric properties rather than algebraic formulas. This "operator-centric" view aligns perfectly with modern deep learning, where layers are viewed as operators acting on function spaces. It builds the mental models necessary to understand concepts like Low-Rank Adaptation (LoRA) and the spectral properties of weight matrices, which are crucial for understanding model stability and compression.
- secondary Introduction to Linear Algebra by Gilbert Strang
While Axler provides rigor, Strang provides the connection to computation. His focus on the "Four Fundamental Subspaces" provides a concrete mental image of how matrices manipulate data.
Curriculum & Key Concepts
- Vector Spaces and Subspaces
Understanding linear independence and dimension is critical for dimensionality reduction techniques.
Application: The "Manifold Hypothesis" in AI suggests real-world data lies on low-dimensional subspaces within high-dimensional ambient spaces. - Linear Maps and the Rank-Nullity TheoremApplication: Deep networks often map inputs to lower-dimensional embeddings. The Null Space represents information lost in this transformation—crucial for understanding autoencoders.
- Eigenvalues, Eigenvectors, and DiagonalizationApplication: Eigenvalues determine the stability of recurrent neural networks (RNNs). If the spectral radius of the recurrent weight matrix is greater than 1, gradients explode; if less than 1, they vanish.
- Inner Product Spaces and OrthogonalityApplication: Attention mechanisms in Transformers rely on the dot product (inner product) to measure similarity between Query and Key vectors.
- The Spectral Theorem and Singular Value Decomposition (SVD)Deep Dive: SVD is the cornerstone of many compression techniques. It allows a matrix to be decomposed into interpretable components. Understanding SVD is essential for implementing techniques like LoRA (Low-Rank Adaptation) for fine-tuning Large Language Models efficiently.
2.2 Multivariate Calculus: The Engine of Optimization
Deep learning training is fundamentally an optimization process on a high-dimensional, non-convex surface. To navigate this surface, one must master the calculus of many variables. The standard undergraduate "Calc 3" is often insufficient because it lacks rigor regarding differentiability in higher dimensions.
Resources
- primary Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach by John H. Hubbard, Barbara Burke Hubbard
This book is legendary among mathematics enthusiasts for treating the derivative not just as a number or a vector, but as a linear transformation (the Jacobian matrix) that best approximates a function near a point. This viewpoint is exactly how automatic differentiation engines (like PyTorch's autograd) operate—computing Jacobian-Vector products. It integrates linear algebra and calculus seamlessly, which is how they appear in machine learning. It provides proofs that allow a researcher to understand *when* optimization might fail (e.g., non-differentiable points like ReLU at 0, saddle points).
- alternative Calculus on Manifolds by Michael Spivak
A concise, dense classic. While elegant, Hubbard & Hubbard is generally preferred for self-study due to its more explanatory nature and unified approach.
Curriculum & Key Concepts
- The Total Derivative & The Jacobian MatrixApplication: In backpropagation, the "gradient" is passed backward. For vector-valued functions (like a layer in a neural net), this gradient is technically a Jacobian matrix. Understanding the shape and properties of the Jacobian is vital for debugging tensor mismatches.
- Taylor's Theorem in Multivariable CalculusApplication: Second-order optimization methods (like Newton's method) and trust-region methods rely on the quadratic approximation of the loss function, provided by the Hessian matrix (second derivatives).
- The Inverse and Implicit Function TheoremsResearch Insight: These theorems underpin modern research in "Implicit Layers" (Deep Equilibrium Models), where the output of a layer is defined as the fixed point of an equation rather than an explicit computation.
- Lagrange Multipliers and Constrained OptimizationApplication: Essential for understanding Support Vector Machines (SVMs) and regularization constraints (e.g., ensuring weights do not grow too large).
2.3 Probability and Statistics: The Language of Uncertainty
Machine learning is essentially statistical inference at scale. A neural network is a probabilistic model parameterized by weights. To work at the frontier, one must transition from a "deterministic" view of code to a "probabilistic" view of functions.
Resources
- primary Introduction to Probability by Joseph Blitzstein, Jessica Hwang
Based on the famous Harvard Stat 110 course. This book is unrivaled in building *intuition*. It emphasizes "story proofs"—understanding *why* a formula works through narrative logic rather than algebraic manipulation.
- reference All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
This book covers a massive amount of ground—from basic probability to VC dimension and bootstrapping—very quickly. It is an excellent bridge to the "Elements of Statistical Learning."
Curriculum & Key Concepts
- Probability Spaces and Conditional Probability
Bayes' Theorem is the foundation of generative modeling and inference.
- Random Variables and ExpectationsApplication: The "Linearity of Expectation" is used constantly to derive gradients for loss functions.
- Distributions (Discrete and Continuous)Research Insight: The Central Limit Theorem explains why initialization schemes (like Xavier/Glorot initialization) are critical. It ensures that the variance of activations remains stable as data propagates through deep networks, preventing vanishing/exploding gradients.Deep Dive: Understanding conjugacy (e.g., Beta-Binomial) is useful for Bayesian neural networks.
- Limit Theorems (LLN and CLT)Application: The Central Limit Theorem explains why initialization schemes (like Xavier/Glorot initialization) are critical. It ensures that the variance of activations remains stable as data propagates through deep networks, preventing vanishing/exploding gradients.
- Information Theory (Entropy, KL Divergence)Application: "Cross-Entropy Loss," the standard for training classifiers and LLMs, is mathematically equivalent to minimizing the KL Divergence between the predicted distribution and the true distribution. Understanding this link allows researchers to design custom loss functions for novel tasks.
- Markov ChainsResearch Insight: Markov chains are the mathematical foundation of Diffusion Models, which generate images by reversing a Markovian noise process.
Phase 2: CS Fundamentals & Systems
Objective: A Research Engineer at a lab like Anthropic is not just a mathematician; they are a software engineer building systems that must run for months on thousands of GPUs. The "self-taught" path often neglects the rigorous CS theory that enables efficient code. Since the user has "basic programming knowledge," this phase focuses on elevating that to a professional systems-level understanding.
Key Checkpoint: The Algorithm Design Manual (Skiena)
3.1 Algorithms and Data Structures
Efficient data loading, tokenization, and graph traversal require a solid grasp of algorithmic complexity.
Resources
- primary The Algorithm Design Manual by Steven Skiena
Unlike the standard *Introduction to Algorithms* (CLRS), which is encyclopedic and theoretical, Skiena's book focuses on the *design* process and practical "war stories." It teaches you how to recognize a problem type and select the right tool, which is critical for research interviews and actual engineering work.
Curriculum & Key Concepts
- Big O Notation and Complexity Analysis
Distinguishing between , , and is critical when dealing with sequence lengths in Transformers (where attention is quadratic).
- Hashing and Hash Tables
Essential for efficient tokenization and looking up embeddings.
- Trees and GraphsApplication: Computational graphs in frameworks like PyTorch and TensorFlow are Directed Acyclic Graphs (DAGs). Understanding topological sort is necessary to understand how autograd engines execute operations.
- Dynamic ProgrammingApplication: The basis for algorithms like Beam Search (used in decoding LLM outputs) and the Viterbi algorithm.
3.2 Systems Programming and Architecture
The bottleneck in modern AI is often not compute, but memory bandwidth. Understanding how data moves from disk to RAM to GPU VRAM is essential.
Resources
- primary Computer Systems: A Programmer's Perspective by Randal E. Bryant, David R. O'Hallaron
This is the standard text for understanding how software interacts with hardware.
Curriculum & Key Concepts
- Memory Hierarchy
Registers → L1/L2/L3 Cache → RAM → Disk.
Research Insight: "FlashAttention," a breakthrough in Transformer efficiency, works entirely by optimizing memory access patterns to keep data in the fast GPU SRAM (cache) rather than slow HBM (VRAM). - Pointers and Memory Management (C++)
While Python is the interface, PyTorch is written in C++. To read and modify the source code of operations, C++ literacy is mandatory.
- Concurrency and Parallelism
Threads, processes, and locks. Essential for understanding data loaders that prepare batches in parallel with GPU computation.
Phase 3: Classical ML Theory
Objective: Before training billion-parameter models, one must master the fundamentals of learning from data. "Deep Learning" is a subset of Machine Learning, and many "new" ideas are adaptations of classical concepts.
Key Checkpoint: Pattern Recognition & ML (Bishop)
4.1 Theoretical Frameworks
Resources
- primary Pattern Recognition and Machine Learning by Christopher Bishop
This book is the gold standard for the **Bayesian** perspective. It explains regularization not just as a heuristic, but as a prior belief on the model parameters.
- alternative The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman
This text is more "frequentist" and statistical, excellent for understanding the bias-variance tradeoff and decision trees.
- reference Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Shai Ben-David
This book is mathematically dense and focuses on **PAC Learning** (Probably Approximately Correct). It answers the fundamental question: "Under what conditions is learning even possible?"
Curriculum & Key Concepts
- The Bias-Variance Tradeoff
The fundamental tension in all modeling.
Research Insight: Deep learning often operates in the "double descent" regime, where massive over-parameterization actually reduces test error, challenging classical bias-variance intuition. - Linear Models (Regression & Classification)
Maximum Likelihood Estimation (MLE) vs. Maximum A Posteriori (MAP).
- Kernel Methods and SVMs
The "Kernel Trick" allows linear models to learn non-linear boundaries by implicitly mapping data to infinite-dimensional spaces.
- Ensemble MethodsResearch Insight: For tabular data, Gradient Boosted Trees often still outperform Deep Learning. Understanding why (handling heterogeneous features, decision boundaries) is a mark of a mature researcher.Deep Dive: Random Forests and Gradient Boosting (XGBoost).
- Unsupervised Learning
PCA (Principal Component Analysis) and K-Means. Connecting PCA to the Singular Value Decomposition (SVD).
4.2 Implementation Projects (From Scratch)
To verify understanding, the student must implement algorithms without using high-level libraries like Scikit-Learn.
Implementation Projects
- Linear Regression from Scratch
Implement Linear Regression using (a) the closed-form Normal Equation and (b) Stochastic Gradient Descent (SGD) in pure NumPy. Compare convergence speed.
- Gaussian Mixture Model (GMM)
Implement a Gaussian Mixture Model (GMM) using the Expectation-Maximization (EM) algorithm. This builds intuition for latent variable models.
Phase 4: Deep Learning
Objective: This phase marks the transition to modern AI. The goal is to demystify neural networks—stripping away the "magic" to reveal the linear algebra and calculus underneath.
Key Checkpoint: Understanding Deep Learning (Prince)
5.1 Foundations of Neural Networks
Resources
- primary Understanding Deep Learning by Simon Prince
While Goodfellow's *Deep Learning* (2016) is a classic, it predates the Transformer revolution. Prince's book is modern, visually intuitive, and covers Transformers, Diffusion, and Generative AI. It is the superior choice for a student starting in 2025.
-
For a gentle introduction to backpropagation.
Curriculum & Key Concepts
- Multilayer Perceptrons (MLPs)
The Universal Approximation Theorem.
- BackpropagationResearch Insight: Exercise: Derive the gradients for a 2-layer network by hand on paper. Then implement "MicroGrad" following Andrej Karpathy's tutorial to build a tiny autograd engine.Deep Dive: The Chain Rule applied to computation graphs.
- OptimizationResearch Insight: SGD, Momentum, RMSProp, and Adam. Understanding AdamW (Adam with decoupled weight decay) is critical, as it is the standard optimizer for training LLMs.
- Regularization & NormalizationDeep Dive: Batch Normalization vs. Layer Normalization: Transformers use LayerNorm. Why? (Independence from batch size, suitability for sequence data). Dropout: Interpreted as training an ensemble of subnetworks.
- Convolutional Neural Networks (CNNs)
While less central to LLMs, concepts like translation invariance, pooling, and strides are foundational.
5.2 Sequence Modeling and The Transformer
The Transformer is the architecture of the current AI boom. It must be understood at the tensor level.
Resources
Curriculum & Key Concepts
- Tokenization
Byte-Pair Encoding (BPE). How text is converted into integers.
- Embeddings
Converting integers to dense vectors.
- Positional EncodingsResearch Insight: Rotary Positional Embeddings (RoPE). This is the modern standard (used in LLaMA, PaLM) which encodes position by rotating the query/key vectors in complex space.Deep Dive: Since self-attention is permutation invariant, order must be injected.
- Self-Attention MechanismApplication: Intuition: A differentiable key-value store. The dot product measures similarity (relevance) between the query and the key.Deep Dive: Formula:
- Multi-Head Attention
Allowing the model to attend to information from different representation subspaces (e.g., one head tracks grammar, another tracks factual consistency).
- The Feed-Forward Network (FFN)Application: Often acts as a "key-value memory" storing facts, while attention moves information between tokens.
Phase 5: Frontier Systems
Objective: This phase differentiates the data scientist from the **Research Engineer**. Research at frontier labs involves training models that do not fit on a single GPU. It requires engineering at the limits of hardware.
Key Checkpoint: CMU 10-714 (Needle)
6.1 Deep Learning Systems and Compilers
Resources
-
This is arguably the most valuable course for an aspiring RE. You build a deep learning library (called "Needle") from scratch.
Curriculum & Key Concepts
- Automatic Differentiation (Reverse Mode)
Implement automatic differentiation (reverse mode).
- GPU Kernels for Matrix Multiplication
Write efficient GPU kernels for matrix multiplication.
- Optimizers and Data Loaders
Implement optimizers and data loaders.
- Transformer from Scratch
Build a Transformer from your own library.
6.2 GPU Programming (CUDA)
To make training faster, REs often write custom "kernels" (functions that run on the GPU).
Resources
-
Practical, modern GPU optimization. Community-driven resource with lectures, reading groups, and an extensive collection of CUDA/GPU programming materials.
- secondary Programming Massively Parallel Processors by David B. Kirk, Wen-mei W. Hwu
Curriculum & Key Concepts
- GPU Architecture
Threads, Warps, Blocks, Streaming Multiprocessors (SMs).
- Memory Model
Global Memory (slow) vs. Shared Memory (fast).
- Tiling
The fundamental technique for optimizing matrix multiplication by loading data into Shared Memory in chunks.
- TritonResearch Insight: A language from OpenAI that simplifies writing high-performance GPU kernels. Learning Triton is a high-leverage skill in 2025.
6.3 Distributed Training
Curriculum & Key Concepts
- Data Parallelism (DDP)
Replicating the model across GPUs and averaging gradients.
- Tensor Parallelism (TP)
Splitting a single large matrix multiplication across multiple GPUs (intra-layer parallelism).
- Pipeline Parallelism (PP)
Placing different layers on different GPUs (inter-layer parallelism).
- Sharding (ZeRO)
Partitioning optimizer states, gradients, and parameters to save memory.
- Mixed Precision Training
Using FP16 (half-precision) or BF16 (Brain Floating Point) to double throughput and reduce memory usage, while using loss scaling to preserve numerical stability.
Phase 6: Frontier Research Topics
Objective: With the foundations laid, the curriculum turns to the specific technologies driving the current AI wave.
Key Checkpoint: Stanford CS324 (LLMs)
7.1 Large Language Models (LLMs)
Resources
Curriculum & Key Concepts
- Scaling LawsResearch Insight: Read Kaplan et al. (2020) and Hoffmann et al. (Chinchilla, 2022). Understand the power-law relationship between compute, dataset size, and performance. This is the economic engine of modern AI.
- Alignment & RLHFDeep Dive: RLHF (Reinforcement Learning from Human Feedback): How to steer models to be helpful and harmless. PPO (Proximal Policy Optimization): The standard RL algorithm for fine-tuning. DPO (Direct Preference Optimization): A more recent, stable method that optimizes the language model directly on preference data without a separate reward model.
- Efficient Fine-Tuning (PEFT)
LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA).
7.2 Reinforcement Learning (RL)
RL is crucial not just for robotics, but for the "Agentic" future of LLMs (e.g., reasoning chains, tool use).
Resources
- primary Reinforcement Learning: An Introduction by Richard S. Sutton, Andrew G. Barto
The foundational text of the field.
-
While the original repo is older, forks and modern implementations (CleanRL) are the best way to learn PPO, DQN, and SAC.
7.3 Generative Models (Diffusion)
Diffusion models (like Stable Diffusion, Sora) have replaced GANs.
Curriculum & Key Concepts
- DDPM (Denoising Diffusion Probabilistic Models)
Learning to reverse a gradual noise-addition process.
- Score-Based Generative Modeling
Viewing generation as solving a Stochastic Differential Equation (SDE).
- Flow Matching
The modern generalization of diffusion used in newer models.
Phase 7: Research & Portfolio
Objective: To work at a frontier lab, you must demonstrate the ability to do the work. This is proven through a portfolio of reproduced papers and novel experiments.
Key Checkpoint: Reproduction Project
8.1 The Art of Reading Papers
You cannot read every paper. You must filter and read strategically.
Curriculum & Key Concepts
- The 3-Pass Approach
Pass 1 (Scan): Title, Abstract, Figures, Conclusion. Decide if it's relevant. Pass 2 (Grasp): Read intro and methods. Ignore proofs. Grasp the core idea. Pass 3 (Deep Dive): Re-derive the math. Implement the code.
- Verification
Always ask, "What is the baseline?" and "Is the improvement statistically significant?"
8.2 Reproducibility Checklist
When reproducing a paper for your portfolio, adhere to rigorous standards:
Curriculum & Key Concepts
- Code Standards
Is the model architecture exactly as described?
- Hyperparameters
Are learning rates, batch sizes, and initialization seeds documented?
- Data Integrity
Is the train/test split clean? (Avoid data leakage).
- Compute Tracking
Report the GPU hours required.
8.3 Portfolio Projects
Build 2-3 significant projects. "Toy" projects (e.g., MNIST) are disregarded.
Implementation Projects
- LLM Pre-training Run
Train a 100M+ parameter model on a dataset like *TinyStories*. Implement the tokenizer, data loader, and training loop (with DDP) from scratch. Log metrics to Weights & Biases.
- Custom Kernel Implementation
Write a fused attention kernel in Triton or CUDA. Benchmark its speed against standard PyTorch.
- Paper Reproduction
Select a recent paper (e.g., from NeurIPS or ICLR). Re-implement it. Reproduce the main results table. Write a blog post explaining the implementation challenges.