Programme

Locations of conference venue (Zeeman) and dinner venue (Radcliffe)

All talks are held in-person in MS.01 in the Zeeman BuildingLink opens in a new window. Registration and coffee breaks are held outside MS.01.

To contact the organisers, you may email probai-scaling-26@googlegroups.com.

Schedule

More details will be made available closer to the workshop.

Day	Time	Activity
22 Jun (Mon)	13:00 - 13:45	☕ Registration and coffee
	13:45 - 14:00	Opening remarks & ProbAI Hub address
	14:00 - 15:30	Tutorial: The Proportional Depth-Width Scaling Limit of Neural Networks (Mufan Li) NotesLink opens in a new window VideoLink opens in a new window Abstract: We study the scaling limit of neural networks without skip connects, where the depth d and width n approach infinity at a constant ratio d/n. In this limiting regime, we can review each layer of the neural network as a time discretization, and derive a limiting SDE for the feature covariance matrix.
	15:30 - 16:00	☕ Coffee and social
	16:00 - 17:30	Tutorial: Dynamical Mean Field Theory, Random Matrices and Learning in High Dimensions (Blake Bordelon) NotesLink opens in a new window VideoLink opens in a new window Abstract: In this tutorial, I will introduce Dynamical Mean Field Theory (DMFT), a powerful framework that enables exact asymptotic descriptions of high dimensional disordered dynamical systems. First, I will introduce a few classic historical applications of DMFT in statistical physics and computational neuroscience. Second, I will examine linear dynamical systems and describe connections between DMFT correlation and response functions and objects that arise naturally in random matrix theory. Both cavity and Martin-Siggia-Rose path integral formalisms will be introduced and worked out in simple cases. One relevant application of this setting that I will describe is test and train loss dynamics for gradient flow in a random feature model. Lastly, after discussing how these ideas relate to infinite width feature learning neural networks trained from random initialization, I will introduce a recently developed method for computing spectral outliers for spiked matrix ensembles where spikes depend on the random initial bulk. This formalism generalizes the classical BBP phase transition to a setting that can describe weights in trained neural networks.

	*Cancelled as the speaker fell ill unexpectedly*	Tutorial: Scaling Limits with Tensor Programs (Leena C Vankadara) Tensor Programs is a rigorous framework for deriving the limiting behaviour of high-dimensional vectors as the dimension $n\to\infty$ generated from random matrices. These vectors are produced by a fixed set of allowable operations, such as matrix multiplication and coordinate-wise nonlinearities. It can be viewed as a compositional, nonlinear random matrix theory. This tutorial builds the machinery from the ground up, with elementary proofs. We start from a primer: a single random matrix applied to an independent vector. We then prove a Master Theorem for the simplest Tensor program, the Krylov sequence $v, Wv, W^2v, \dots$ obtained by reusing a single matrix. Then, we show how admitting the transpose $W^\top$ introduces the Onsager correction which is the key structural change the general theorem requires. We close by illustrating the framework with three examples: deriving the infinite-width limits of neural networks (both at initialisation and during training) for a wide class of gradient-based optimisers including SGD and Adam; recovering classical random-matrix laws such as the semicircle and Marchenko–Pastur laws; and computing the limiting singular value distribution of the input–output Jacobian of a randomly initialised network.


23 Jun (Tue)	09:10 - 09:40	☕ Coffee
	09:40 - 10:50	Research talk: Training Dynamics in Large Networks: From Super-Wide to the Scaling Law Regime (Blake Bordelon) SlidesLink opens in a new window VideoLink opens in a new window Abstract: In this talk, we discuss stable scaling of deep learning models by identifying stable, feature-learning infinite width and depth limits of neural networks. The asymptotic description of randomly initialized networks in this regime will take the form of a dynamical mean field theory (DMFT). We will discuss how adoption of scaling strategies that admit such limits yields better hyperparameter transfer, where optimal hyperparameters in small models remain optimal in large models. We will provide examples of these results for multi-layer perceptrons, convolutional networks, self-attention blocks, and mixture-of-experts transformers. These exact limits assume parameters are much larger than the training data, and thus fail to capture the behavior of models in the scaling law regime where models of different sizes achieve different performance. To address this, we will introduce simplified, analytically tractable models which enable analysis of training dynamics far from the infinite limit. We show that early time deviations in model performance are universal, while late time deviations are architecture and data dependent. We use this toy model to analyze hyperparameter transfer across training horizons T through an optimal control paradigm, showing that the optimal strategy is highly dependent on feature structure and SGD noise statistics.
	10:50 - 12:00	Tutorial: From Two-Layer Perceptrons to High-Dimensional Limits: Quantitative Mean-Field and Cavity Methods for Neural Networks (Louis-Pierre Chaintron) SlidesLink opens in a new window VideoLink opens in a new window Abstract: Understanding the behavior of large neural networks requires identifying the correct scaling regimes and deriving effective descriptions of their training dynamics. This tutorial introduces a quantitative approach to these questions, starting from the simplest setting of two-layer perceptrons. We first discuss how architectural hyperparameters should be chosen as the network width and data dimension grow, and how different scaling choices lead to distinct learning regimes. We then derive quantitative large-width limits, emphasizing rates of convergence and finite-size errors. Building on these results, we turn to high-dimensional limits, where the ambient dimension itself becomes large and new phenomena emerge. A central theme of the tutorial is the cavity method from statistical physics. We explain how this powerful heuristic can be turned into a rigorous mathematical tool, yielding quantitative estimates for high-dimensional neural network dynamics. Along the way, we introduce the key probabilistic ingredients — including concentration phenomena, propagation of chaos, and functional fixed-point equations — that underlie modern analyses of large-scale learning systems. The goal is to provide participants with a unified framework for understanding infinite-width and high-dimensional limits, as well as the mathematical techniques needed to make the corresponding predictions rigorous and quantitative.
	12:00 - 13:20	🥄 Lunch provided at venue
	13:20 - 13:30	CRiSM & Warwick Statistics address
	13:30 - 15:00	Research Talk: ResNets of All Shapes and Sizes: Quantitative Large-Scale Theory of Training Dynamics (joint work with Lénaïc Chizat and Javier Maass) (Louis-Pierre Chaintron) SlidesLink opens in a new window VideoLink opens in a new window Abstract: Residual neural networks exhibit rich interactions between depth, width, and feature dimension, making their large-scale behavior considerably more intricate than that of shallow architectures. In this talk, I will present a quantitative theory describing the training dynamics of ResNets in the joint infinite depth–width limit. We consider ResNets built from two-layer perceptron blocks with depth L, hidden width M, and embedding dimension D, under the residual scaling O ( √D / √(LM) ), which has recently been identified as the natural regime for local feature learning. We show that, over a bounded training horizon, the discrepancy between the finite network and its infinite-size limit is controlled by O(1/L + √D /√(L M) + 1/√D ). Numerical experiments indicate that this scaling accurately captures the finite-size effects observed during the early stages of training. From a probabilistic perspective, the limit D→∞ gives rise to a novel mean-field theory over the embedding coordinates, featuring interaction terms that scale as 1 / √D rather than the classical 1/D. The resulting limit can be interpreted as a rigorous and quantitative version of Dynamical Mean Field Theory (DMFT) for deep neural networks. The proof combines propagation-of-chaos techniques with a functional implementation of the cavity method, providing a mathematically controlled framework for deriving effective large-scale descriptions of deep learning dynamics. The talk aims to illustrate how ideas from statistical physics, probability, and machine learning can be combined to obtain sharp quantitative results for modern deep neural networks.
	15:00 - 15:30	📸 Group photo & ☕ Coffee break
	15:30 - 16:40	Research talk: How to train an LLM (Sam Smith) Video: an earlier version of the same talkLink opens in a new window Abstract: Drawing on the experience of designing and scaling Griffin (https://arxiv.org/abs/2402.19427) and RecurrentGemma, I will introduce some of the key practical concepts behind training large language models. Likely to include: a brief introduction to Transformers, including why MLPs, not Attention, usually dominate computation. A simple mental model of the computational bottlenecks on TPUs and GPUs. How to train models too large to fit in memory on a single device. Scaling laws and hyper-parameter tuning. A detailed discussion of LLM inference. If time permits, I will discuss how to design recurrent models competitive with transformers, their advantages and drawbacks.
	16:40 - 18:30	Break
	18:30 - 20:30	🥄 On-campus dinner for attendees (registration required)* at Radcliffe*

	*Cancelled as the speaker fell ill unexpectedly*	Research talk: Towards a theory of scaling in deep learning (Leena C Vankadara) Scale plays a central role in modern deep learning, where increasing model size, data, and compute often produces regular improvements in performance, but can also bring models into qualitatively different regimes. Yet, these gains are not determined by size alone: whether larger models perform well depends crucially on how models and training procedures are scaled. In this talk, I will discuss scaling theory as a principled framework for studying these questions. I will focus on scaling limits as a natural lens for large-scale learning, showing how they help us derive principled scaling rules while shedding light on the empirical behavior of practical, finite-width networks.

24 Jun (Wed)	09:10 - 09:40	☕ Coffee
	09:40 - 10:50	Research talk: Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning (Yatin Dandi) SlidesLink opens in a new window VideoLink opens in a new window Abstract: Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity, and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.
	10:50 - 11:20	☕ Coffee break
	11:20 - 12:30	Research talk: Feature Learning in the Proportional Depth-Width Limit (Mufan Li) SlidesLink opens in a new window VideoLink opens in a new window Abstract: Neural network scaling limits have provided a powerful framework for understanding feature learning and the empirical phenomenon of hyperparameter transfer, whereby hyperparameters tuned at small scale remain highly predictive of the optimal hyperparameters at large scale. In this talk, we consider an infinite-depth-and-width limit of MLPs in which depth and width grow proportionally, and the activation function is scaled with depth. In this limit, we characterize the feature covariance kernels through a system of forward-backward SDEs. We further show that this identifies the unique scaling of the learning rate, prefactors, and activation that achieves maximal feature learning in the relevant sense. However, both this notion of maximal feature learning and the associated prescriptions differ significantly from the well-known $\mu$P scaling. Finally, preliminary experiments indicate that this scaling regime indeed exhibits hyperparameter transfer.
	12:30 - 14:00	🥄 Workshop closure & lunch provided at venue