1 Phase 1: The Mathematical Substrate

Dec 11, 2025 – Mar 15, 2026
1.1

The Operator-Theoretic View of Neural Networks

Not Started
Mathematical Proofs & Conceptual Synthesis Start: Jan 10, 2026 Deadline: Feb 7, 2026
Checkpoint: Mastery of Chapters 1–7 of Linear Algebra Done Right (Axler). You must be comfortable with the concept of a linear map independent of its matrix representation.

Context & Rationale

The curriculum emphasizes a "coordinate-free" approach to linear algebra. In modern research (e.g., LoRA, Geometric Deep Learning), we treat layers as operators acting on function spaces. This assignment tests your ability to think about properties like "rank" and "eigenvalues" as intrinsic geometric properties rather than artifacts of a specific basis.

Problems & Tasks

The Geometry of Vanishing Gradients (Spectral Radius)

Let be a finite-dimensional vector space over , and let be a linear operator (representing a recurrent weight matrix).

  • Proof: Prove that if the spectral radius , then for all , regardless of the norm chosen.
  • Application: Consider a Recurrent Neural Network (RNN) with a linear activation function . Using the result above, formally explain why the state vector decays to zero if all eigenvalues of lie inside the unit circle.
  • Extension: Why does the Spectral Theorem (Axler Ch 7) imply that for symmetric (Hermitian) weight matrices, the operator norm is exactly equal to the largest eigenvalue? How does this simplify the analysis of gradient explosion?
Singular Value Decomposition (SVD) and Information Compression

The "Manifold Hypothesis" suggests data lies on low-dimensional subspaces. SVD is the tool to find them.

  • Derivation: Starting from the Spectral Theorem for the positive operator , rigorously derive the Singular Value Decomposition for an arbitrary operator .
  • Low-Rank Approximation: Prove the Eckart-Young-Mirsky theorem for the Frobenius norm: The best rank- approximation of a matrix is given by truncating its SVD to the top singular values.
Blog Post: "The Compression Instinct"
  • Implement SVD from scratch in NumPy (using np.linalg.eig on as a primitive, but assembling the yourself).
  • Take a high-dimensional weight matrix from a pre-trained open-source model (e.g., a layer from BERT-tiny or a small ResNet). Compute its singular value spectrum. Plot the cumulative energy of the singular values.
  • Write a post discussing "Intrinsic Dimensionality." If 90% of the variance is captured by the top 10% of singular values, what does this imply about the redundancy of neural networks? Relate this to the "Low-Rank Adaptation" (LoRA) technique.
The Geometry of High Dimensions
  • Volume Concentration: Derive the formula for the volume of a -dimensional hypersphere of radius . Prove that as , the volume of the unit sphere concentrates at the equator (or that the ratio of the sphere's volume to the enclosing cube's volume goes to zero).
  • Implication: In a short essay, explain how this "curse of dimensionality" affects nearest-neighbor search in vector databases (a key component of RAG systems). Why does Euclidean distance lose meaning in 1000-dimensional spaces?

Deliverables

  • A LaTeX-typeset problem set (PDF)
  • A companion blog post visualization
1.2

The Calculus of Optimization (Autodiff Foundations)

Not Started
Mathematical Derivation & Implementation Start: Feb 1, 2026 Deadline: Mar 15, 2026
Checkpoint: Completion of Hubbard & Hubbard (Chapters on Derivatives as Linear Maps/Jacobians).

Context & Rationale

Deep learning training is optimization on a high-dimensional, non-convex manifold. The "gradient" is not just a vector of numbers; it is a linear map (the Jacobian) that best approximates the function locally. You must understand this to debug tensor shape mismatches in PyTorch.

Problems & Tasks

The Jacobian of the Softmax

The Softmax function is central to the Transformer's attention mechanism.

  • Derivation: Derive the Jacobian matrix where .
  • Expression: Show that , or in vector notation, .
  • Properties: Prove that this Jacobian is symmetric. What does the term subtract from the diagonal? (Hint: Relate this to probability mass conservation).
Taylor Expansion in $N$ Dimensions
  • Hessian Analysis: For a loss function , write the second-order Taylor expansion around a point .
  • Saddle Points: Explain mathematically why a critical point where the Hessian has both positive and negative eigenvalues is a saddle point.
  • Implementation: Create a synthetic 2D surface . Implement "Newton's Method" optimization . Show via simulation how Newton's method fails or behaves chaotically near the saddle point compared to standard Gradient Descent.

Deliverables

  • Python Library (minigrad_tensor.py)
  • Derivations

2 Phase 2: CS Fundamentals & Systems

Mar 16, 2026 – May 30, 2026
2.1

The Computational Graph Optimizer

Not Started
Algorithmic Implementation (C++ preferred, Python accepted) Start: Apr 1, 2026 Deadline: Apr 30, 2026
Checkpoint: The Algorithm Design Manual (Skiena), Graph Algorithms chapters.

Context & Rationale

Deep learning frameworks (PyTorch/TensorFlow) represent models as Directed Acyclic Graphs (DAGs). To run these efficiently, the framework must "schedule" the operations. This is a classic graph theory problem.

Scenario: You are building the scheduler for a new inference engine.

Problems & Tasks

Topological Sort Implementation

Implement Kahn's Algorithm to determine a valid execution order.

Memory Optimization
  • The "Live Interval" Problem: The "Live Interval" of a tensor starts when it is produced and ends when its last consumer finishes.
  • Goal: Find a topological ordering that minimizes the maximum total memory active at any instant.
  • Algorithm: This is an NP-hard problem. Implement a Greedy Heuristic: At each step, if multiple nodes are ready to execute, choose the one that frees the most memory (or increases live memory the least).
Analysis and Benchmarking
  • Testing: Generate random DAGs (N=100, N=1000). Compare the peak memory usage of your optimized schedule versus a random valid schedule.
  • Blog Component: Explain how topological sorting determines the order of CUDA kernel launches and how "operator fusion" (merging nodes) could further reduce memory overhead.

Deliverables

  • A CLI tool for graph scheduling
  • Blog Output: "Graph Theory in the VRAM."
2.2

The "Strided Memory" Challenge (Systems)

Not Started
Systems Programming (C/C++) Start: May 1, 2026 Deadline: May 30, 2026
Checkpoint: CS:APP (Bryant & O'Hallaron), chapters on Memory Hierarchy/Cache.

Context & Rationale

As noted in the curriculum, the bottleneck in AI is often memory bandwidth. "FlashAttention" works by optimizing memory access patterns to keep data in fast SRAM (cache) rather than slow HBM (VRAM). You must demonstrate you understand "locality of reference."

Problems & Tasks

Naive Implementation

Implement the standard triple-loop algorithm.

Cache-Aware Optimization
  • Tiled Implementation: Implement "Block Matrix Multiplication."
  • Block Size Determination: Determine the optimal block size based on your CPU's L1 cache size (e.g., 32kb).
  • Analysis: Explain why tiling reduces cache misses (Theoretical analysis of arithmetic intensity).
SIMD Vectorization

Use AVX2 (x86) or NEON (ARM) intrinsics to compute 8 floats at a time in the inner loop.

Benchmarking
  • Performance Testing: Run matrix sizes .
  • Metrics: Plot GFLOPS (Giga-Floating Point Operations Per Second) for: Naive vs. Tiled vs. Tiled+SIMD vs. NumPy (Reference).
  • Cache Analysis: Use a tool like perf (Linux) or Instruments (Mac) to measure "L1-dcache-load-misses."

Deliverables

  • Benchmarking Report
  • GitHub Repository

3 Phase 3: Classical ML Theory

Jun 1, 2026 – Aug 15, 2026
3.1

Kernel Methods and Dual Representations

Not Started
Theoretical Derivation & Implementation Start: Jun 15, 2026 Deadline: Jul 15, 2026
Checkpoint: Pattern Recognition and Machine Learning (Bishop) Ch 6 (Kernel Methods).

Context & Rationale

Understanding kernel methods and dual representations is crucial for modern machine learning. The dual formulation allows us to work in high-dimensional feature spaces without explicitly computing coordinates, enabling non-linear classification and regression.

Problems & Tasks

The Dual Formulation
  • Consider the Ridge Regression objective: .
  • Dual Solution: Prove that the optimal weights lie in the span of the data points . Derive the "Dual Solution" involving the Gram matrix .
  • Kernel Insight: Explain why this allows us to use infinite-dimensional feature spaces via the "Kernel Trick" without ever computing the coordinates explicitly.
Implementation
  • Kernel Ridge Regressor: Implement a Kernel Ridge Regressor in Python using only NumPy.
  • RBF Kernel: Use the RBF (Gaussian) Kernel: .
  • Experiment: Train on a non-linear dataset (e.g., a sine wave with noise). Show how changing (bandwidth) shifts the model from High Bias (underfitting) to High Variance (overfitting).

Deliverables

  • Theoretical derivations of the dual formulation
  • Python implementation of Kernel Ridge Regression
3.2

Latent Variable Models (GMMs)

Not Started
Algorithm Implementation (from scratch) Start: Jul 15, 2026 Deadline: Aug 15, 2026
Checkpoint: Bishop Ch 9 (Mixture Models & EM).

Context & Rationale

The Expectation-Maximization (EM) algorithm is the archetype for training models with "hidden" (latent) variables. This is the intellectual ancestor of Variational Autoencoders (VAEs) and Diffusion models.

Problems & Tasks

Data Preparation
  • Synthetic Data: Generate synthetic data from 3 overlapping Gaussian distributions.
  • Visualization Setup: Prepare for creating an animation (GIF) showing the Gaussian contours shifting to fit the data over time.
EM Algorithm Implementation
  • E-Step: Calculate the "responsibilities" (posterior probability that point belongs to cluster ).
  • M-Step: Re-estimate means, covariances, and mixing coefficients using the responsibilities.
  • Convergence: Plot the Log-Likelihood of the data over iterations. It must be monotonically increasing (prove this property in your write-up).
Results and Communication

Create an animation (GIF) showing the Gaussian contours shifting to fit the data over time. This is a perfect asset for your technical blog.

Deliverables

  • Blog Post: "The Expectation-Maximization Algorithm Visualized."

4 Phase 4: Deep Learning

Aug 16, 2026 – Nov 30, 2026
4.1

"MicroGrad++" – The Autograd Engine

Not Started
Software Engineering / Library Building Start: Aug 16, 2026 Deadline: Oct 15, 2026
Checkpoint: Understanding Deep Learning (Prince) Ch 7; Karpathy's "Micrograd".

Context & Rationale

As stated in the curriculum, "Understanding backpropagation requires building a computation graph engine." This assignment bridges Phase I (Calculus) and Phase II (Graphs).

Problems & Tasks

Tensor Class with Operations

Wrap NumPy arrays. Implement __add__, __mul__, matmul, relu, transpose.

Computation Graph

As operations are performed, dynamically build a graph of Tensor objects.

Backpropagation Engine
  • backward() Method: Implement backward() which performs a topological sort and propagates gradients using the Chain Rule.
  • Broadcasting: Handle broadcasting correctly in the backward pass (summing out gradients along broadcasted dimensions). This is the hardest part.
Verification
  • MNIST Training: Train a 3-layer MLP on the MNIST digits dataset using your engine.
  • Accuracy Goal: Achieve >90% accuracy.
Extension

Implement the Adam optimizer from scratch within your framework.

Deliverables

  • A Python library on GitHub
4.2

The Transformer from Scratch

Not Started
Model Implementation Start: Oct 16, 2026 Deadline: Nov 30, 2026
Checkpoint: Stanford CS25, "Attention Is All You Need" paper.

Context & Rationale

The Transformer is the architecture of the modern era. You must understand the tensor shapes at every stage, particularly in Multi-Head Attention.

Constraints:
  • You may NOT use nn.Transformer or nn.MultiheadAttention. You must write the scaled_dot_product_attention logic manually.

Problems & Tasks

Rotary Positional Embeddings

Implement RoPE (as used in LLaMA) instead of absolute sinusoidal embeddings. This requires complex number manipulation of query/key vectors.

Causal Masking

Ensure your attention mechanism correctly masks future tokens (the lower-triangular mask).

Forward Pass Architecture

Input Indices → Embedding → N Layers (Attn + FFN + Norm) → Logits.

Training
  • Model Size: Train a "baby GPT" (~10M parameters).
  • Dataset: Use the "TinyShakespeare" dataset.
  • Implementation: Implement the tokenizer, data loader, and training loop from scratch.
Generation

Implement a sampling loop with "Temperature" and "Top-k" sampling.

Analysis

Visualize the attention weights of your trained model. Show specific examples where a head attends to the previous word vs. the beginning of a sentence.

Deliverables

  • PyTorch Implementation and Training Report

5 Phase 5: Frontier Systems

Dec 1, 2026 – Feb 28, 2027
5.1

High-Performance Kernels (CUDA/Triton)

Not Started
GPU Programming Start: Dec 1, 2026 Deadline: Jan 30, 2027
Checkpoint: GPU Mode videos; Programming Massively Parallel Processors.

Context & Rationale

PyTorch is fast, but custom kernels are faster. This assignment introduces OpenAI's **Triton** language, which is becoming a standard for writing high-performance GPU kernels.

Problems & Tasks

Fused Softmax Kernel
  • The Problem: Standard Softmax reads data, computes exponentials, writes to memory, reads back, sums, divides, writes back. This is memory inefficient.
  • The Fusion: Write a kernel that keeps the row data in SRAM (Shared Memory), computes the exponential and the sum in one go, and writes only the final result.
Benchmarking

Compare the throughput (GB/s) of your Triton kernel against torch.softmax for varying tensor sizes.

Advanced Extension

Implement a simplified "FlashAttention" forward pass in Triton (tiling the matrices).

Deliverables

  • Triton kernel implementation
  • Benchmarking report comparing performance against PyTorch
5.2

Distributed Training Simulation

Not Started
Systems Simulation Start: Feb 1, 2027 Deadline: Feb 28, 2027
Checkpoint: CMU 10-714; Papers on ZeRO/Megatron.

Context & Rationale

To scale beyond single-GPU limits, understanding distributed training paradigms is essential. This assignment simulates Data Parallelism (DDP) to demonstrate how gradients are synchronized across multiple GPUs.

Problems & Tasks

Data Parallelism Implementation
  • Simulation: Use torch.multiprocessing to spawn processes (simulating 2 GPUs).
  • Model Replication: Initialize the same model on both processes.
  • Data Splitting: Split a data batch in half.
Gradient Synchronization
  • All-Reduce Operation: Use dist.all_reduce to sum the gradients across processes.
  • Weight Update: Update weights after synchronization.
Verification and Analysis
  • Correctness Check: Show that the weights on Rank 0 and Rank 1 remain identical after 100 steps.
  • Communication Analysis: Write a report on the communication overhead. If the model size is and network bandwidth is , what is the theoretical latency of an All-Reduce step?

Deliverables

  • Distributed training simulation code
  • Report on communication overhead analysis

6 Phase 6: Frontier Research Topics

Mar 1, 2027 – May 15, 2027
6.1

Alignment (RLHF / DPO)

Not Started
Model Fine-Tuning Start: Mar 1, 2027 Deadline: Apr 15, 2027
Checkpoint: Stanford CS324; Spinning Up in Deep RL.

Context & Rationale

Pre-training creates the base capability; alignment directs it. You will implement Direct Preference Optimization (DPO), the modern stable alternative to PPO.

Problems & Tasks

Dataset Preparation

Create (or download) a small dataset of pairs: where the winner is polite and the loser is rude.

DPO Implementation
  • Loss Function: Implement the DPO loss:
  • Reference Model: Manage the memory to keep a frozen copy of the reference model for probability computation.
Results Analysis

Qualitative analysis. Show prompts where the base model is rude and the DPO model is polite.

Deliverables

  • DPO implementation code
  • Qualitative analysis of aligned model behavior
6.2

Diffusion Models (Generative AI)

Not Started
Implementation Start: Apr 16, 2027 Deadline: May 15, 2027
Checkpoint: Understanding Deep Learning Ch 18.

Context & Rationale

Diffusion models (like Stable Diffusion, Sora) have replaced GANs as the state-of-the-art in generative AI. Understanding the denoising diffusion probabilistic model (DDPM) framework is essential for modern generative modeling.

Problems & Tasks

Noise Schedule

Define the linear variance schedule .

Training

Train a U-Net to predict the noise added to an image at timestep .

Sampling

Implement the reverse process to generate digits from pure Gaussian noise.

Visualization

Create a grid visualization showing the "reverse diffusion" process step-by-step, turning static into a number.

Deliverables

  • Blog Post

7 Phase 7: Research & Portfolio

May 16, 2027 – Jun 30, 2027
7

The Paper Reproduction

Not Started
Research Implementation Start: May 16, 2027 Deadline: Jun 30, 2027
Checkpoint: Mastery of the "3-Pass Reading Approach".

Context & Rationale

To work at a frontier lab, you must demonstrate the ability to do the work. This capstone assignment requires reproducing a recent research paper from scratch, showing both engineering and research capabilities.

Problems & Tasks

Paper Selection and Analysis

Select a research paper from the last 12 months (NeurIPS/ICLR).

Reproducibility Standards
  • Code: Is the model architecture exactly as described?
  • Hyperparameters: Are learning rates, batch sizes, and initialization seeds documented?
  • Data: Is the train/test split clean? (Avoid data leakage).
  • Compute: Report the GPU hours required.
Implementation and Ablation
  • Reproduction: Re-implement the paper from scratch.
  • Results: Reproduce the main results table.
  • Ablation: Change one thing (e.g., activation function, initialization) and measure the effect.
Communication

Write a high-quality blog post explaining the intuition, the math, and the implementation hurdles.

Deliverables

  • Clean, documented repository
  • Ablation study
  • High-quality blog post