Skip to main content Skip to navigation


The seminars will take place on Fridays 1 pm UK time in room MB0.08 in a hybrid format.

2022-2023 Organisers: Alice CorbellaLink opens in a new window and Krzysztof ŁatuszyńskiLink opens in a new window

If you would like to speak, or you want to be included in any emails, please contact one of the organisers.

Website URL:

Mailing List Sign-Up:

Mailing List: (NB - only approved members can post)

Term 3:
Date Speaker Title F2F Slides Video
28/04 Flávio GonçalvesLink opens in a new window

Exact Bayesian inference for level-set Cox processes with piecewise constant intensity function


This work proposes a new methodology to perform Bayesian inference for a class of multidimensional Cox processes in which the intensity function is piecewise constant. Poisson processes with piecewise constant intensity functions are believed to be suitable to model a variety of point process phenomena and, given its simpler structure, are expected to provide more precise inference when compared to processes with non-parametric and continuously varying intensity functions. The partition of the space domain is flexibly determined by a level-set function of a latent Gaussian process. Despite the intractability of the likelihood function and the infinite dimensionality of the parameter space, inference is performed exactly, in the sense that no space discretization approximation is used and MCMC error is the only source of inaccuracy. That is achieved by using retrospective sampling techniques and devising a pseudo-marginal infinite-dimensional MCMC algorithm that converges to the exact target posterior distribution. Computational efficiency is favored by considering a nearest neighbor Gaussian process, allowing for the analysis of large datasets. An extension to consider spatiotemporal models is also proposed. The efficiency of the proposed methodology is investigated in simulated examples and its applicability is illustrated in the analysis of some real point process datasets.

05/05 Renato Paes LemeLink opens in a new window

Multiparameter Bernoulli Factories


We consider the problem of computing with many coins of unknown bias. We are given samples access to n coins with unknown biases p1,…,pn and are asked to sample from a coin with bias f(p1,…,pn) for a given function f : [0, 1]n → [0, 1] . We give a complete characterization of the functions f for which this is possible. As a consequence, we show how to extend various combinatorial sampling procedures (most notably, the classic Sampford Sampling for k-subsets) to the boundary of the hypercube.

12/05 Anna KorbaLink opens in a new window

Sampling with Mollified Interaction Energy Descent

  Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this talk, I will present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions—smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the gradient flow of the MIE agrees with that of the chi-square divergence. Optimizing this energy with proper discretization yields a practical first-order particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems our algorithm performs on par with existing particle-based algorithms like Stein Variational Gradient Descent (SVGD), while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives.
19/05 Ye HeLink opens in a new window

High-dimensional scaling limits and fluctuations of online least-squares SGD

  We derive high-dimensional scaling limits and fluctuations for the online least-squares Stochastic Gradient Descent (SGD) algorithm by taking the properties of the data generating model explicitly into consideration. Our approach treats the SGD iterates as an interacting particle system, where the expected interaction is characterized by the covariance structure of the input. Assuming smoothness conditions on moments of order up to eight orders, and without explicitly assuming Gaussianity, we establish the high-dimensional scaling limits and fluctuations in the form of infinite-dimensional Ordinary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs). Our results reveal a precise three-step phase transition of the iterates; it goes from being ballistic, to diffusive, and finally to purely random behavior, as the noise variance goes from low, to moderate and finally to very-high noise setting. In the low-noise setting, we further characterize the precise fluctuations of the (scaled) iterates as infinite-dimensional SDEs. We also show the existence and uniqueness of solutions to the derived limiting ODEs and SDEs. Our results have several applications, including characterization of the limiting mean-square estimation or prediction errors and their fluctuations which can be obtained by analytically or numerically solving the limiting equations.
26/05 Daniel PaulinLink opens in a new window

Contraction rate estimates of kinetic Langevin dynamics integrators


We provide a framework to prove convergence rates for discretizations of kinetic Langevin dynamics for strongly log-concave densities. Our approach provides convergence rates of O(m/M), with explicit stepsize restrictions which are of the same order as the stability threshold for Gaussian targets and are valid for a large interval of the friction parameter. We apply this methodology to various integration methods which are popular in the molecular dynamics and machine learning communities. Finally we study underdamped Langevin schemes that converge to overdamped dynamics in the high friction limit and which have step size restrictions that are independent of the friction parameter; we show that this property is not generic.

This is joint work with Peter Whalley and Ben Leimkuhler.

02/06 Marina RiabizLink opens in a new window
Optimal Thinning of MCMC Output
The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo (MCMC) can be sub-optimal in terms of the empirical approximations that are produced. Typically, a number of the initial states are attributed to “burn in” and removed, whilst the remainder of the chain is “thinned” if compression is also required. In this talk, I consider the problem of retrospectively selecting a subset of states, of fixed cardinality, from the sample path such that the approximation provided by their empirical distribution is close to optimal. A novel class of methods is proposed, based on minimisation of a kernel Stein discrepancy (KSD), that is suitable when the gradient of the log-target can be evaluated and an approximation using a small number of states is required. To minimize the KSD, we consider greedily scanning the entire MCMC output to select one point at the time, as well as selecting more than one point at a time (making the algorithm non-myopic), and mini-batching the candidate set (making the algorithm non-greedy). Theoretical results guarantee consistency of these methods and their effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations.
09/06 George DiligiannidisLink opens in a new window

Quantitative Uniform Stability of the Iterative Proportional Fitting Procedure


We establish the uniform in time stability, w.r.t. the marginals, of the Iterative Proportional Fitting Procedure, also known as Sinkhorn algorithm, used to solve entropy-regularised Optimal Transport problems. Our result is quantitative and stated in terms of the 1- Wasserstein metric. As a corollary we establish a quantitative stability result for Schrödinger bridges.

This is joint work with V. de Bortoli and A. Doucet.

16/06 No SeminarLink opens in a new window

23/06 Jeffrey Rosenthal Link opens in a new window

Equivalences of Geometric Ergodicity

  Geometric ergodicity is an important and widely-studied property
indicating fast convergence of Markov chains to their stationary
distributions. There are various results in the literature about
conditions which imply, or are implied by, geometric ergodicity. In this
talk, we combine and expand upon these results to come up with a list of
34 different conditions equivalent to geometric ergodicity (27 for general
chains plus 7 for reversible chains), in terms of such notions as
convergence bounds, drift conditions, spectral properties, and so on, with
different assumptions about the distance metric used, finiteness of
function moments, initial distribution, uniformity of bounds, and more.
30/06 No SeminarLink opens in a new window

Term 2:
Date Speaker Title F2F Slides Video
13/01 Austin BrownLink opens in a new window Lower bounds on the rate of convergence for accept-reject-based Markov chains Y    
  To avoid poor empirical performance in Metropolis-Hastings and other accept-reject-based algorithms practitioners often tune these algorithms by trial and error. Lower bounds on the geometric convergence rate are developed in both total variation and Wasserstein distances in order to identify how the simulations will fail so these settings can be avoided, providing guidance on tuning. The theory is applied in several settings. For example, if the target density concentrates with a parameter n (e.g. posterior concentration), it is demonstrated that the convergence rate of a Metropolis-Hastings chain can tend to 1 exponentially fast if the tuning parameters do not depend carefully on n. This is demonstrated in Bayesian logistic regression with Zellner's g-prior and flat prior Bayesian logistic regression.
20/01 Jeremie HoussineauLink opens in a new window Trading model selection for robustness with possibility theory Y    
  Robustness is a desirable property of inference in applications where the amount of data is too large to be cleaned by hand and/or where the inference is online. In this talk, I will explain how the evidence, or marginal likelihood, can be used to obtain robust Bayesian inference in the context of possibility theory. This robustness replaces the ability to do model selection in general, and we will discuss the underlying reasons for this difference in meaning of the evidence. We will consider the limitations of the proposed approach and how to compensate for these by using tricks that are specific to possibility theory. For better or worse, I will mostly use the board for this talk.
27/01 Sebastiano Grazzi Link opens in a new window

PDMP samplers with boundary conditions


In this talk, I will formally introduce piecewise deterministic Markov processes (PDMPs) endowed with "sticky floors", "soft/hard walls" and "teleportation portals" which can be used for Monte Carlo simulation and allow to target efficiently a rich class of measures arising in Bayesian inference.

I will motivate and illustrate the framework with three challenging applications: Bayesian variable selection, for sampling the latent space of infection times with unknown infected population size in the SIR model with notifications and for sampling efficiently the invariant measure in hard-sphere models.

The class of processes presented here extends [1] and is joint work with J. Bierkens, G. Roberts, and M. Schauer.

03/02 Raiha Browning Link opens in a new window

Discrete-time Hawkes process and their applications


A Hawkes process is a temporal, self-exciting point process, such that events in the process may trigger future events. Hawkes processes have been studied widely due to their applicability to many real-world phenomena, such as seismic activity and financial markets. The key feature of these processes is the function that controls how previous events in the process impact the current intensity of events, referred to as the triggering kernel. The most common approach for modelling the triggering kernel is to assume it takes the form of simple parametric functions, such as exponential or power law decay. However, this is often too restrictive to capture more complex patterns of excitation. Furthermore, Hawkes processes were introduced as a continuous-time point process. This does not align with common data collection strategies since data is often not available at a continuous time resolution. Instead, event data is often aggregated and provided in the form of event counts over regular, discrete time intervals whereby the exact timing of these events is unknown. This talk develops and applies new statistical methods for modelling multivariate, discrete-time Hawkes processes using flexible Bayesian representations of the triggering kernel. Spatiotemporal variants of discrete-time Hawkes processes are also considered. These models are applied to several substantive case studies; to describe the spread of the novel coronavirus COVID-19, and to monitor the risk of conflict events (riots, protests etc.), in South Asia.

10/02 Max HirdLink opens in a new window

Preconditioning for MCMC


Linear transformation of the state variable (linear preconditioning) is a common technique that often drastically improves the practical performance of a Markov chain Monte Carlo algorithm. Despite this, however, quantifying the benefits of linear preconditioning is not well-studied theoretically, and rigorous guidelines for choosing pre-conditioners are not always readily available. Mixing time bounds for various samplers (HMC, MALA, Unadjusted HMC, Unadjusted Langevin) have been produced in recent works for the class of strongly log-concave and Lipschitz target distributions and depend strongly on a quantity known as the condition number. We study linear preconditioning for this class of distributions, and under appropriate assumptions we provide bounds on the condition number after using a given linear preconditioner. Nonlinear transformations are less common but potentially far more powerful, and are often known by other names, such as Riemannian Manifold methods and the Mirror Langevin. We propose nonlinear preconditioning as a unifying framework in which to conceive of, and compare these techniques.

17/02 Tamas PappLink opens in a new window
Dimensional scaling of coupled MCMC algorithms

Couplings can be used to remove the bias of MCMC estimators. However, it can be challenging to design practically implementable couplings which scale well to high dimensions. I will address this question for the Random Walk Metropolis by connecting coupled MCMC sampling with the optimal scaling literature.

In particular, I will introduce a coupling which is optimal in an asymptotic sense and overcomes the shortcomings of previous proposals. Our main technical tools will be high-dimensional ODE limits for Gaussian targets.

This talk is based on joint work with Chris Sherlock (arxiv:2211.12585Link opens in a new window).

24/02 Andi WangLink opens in a new window Explicit convergence bounds for Metropolis Markov chains Y    

Metropolis Markov chains, such as Random Walk Metropolis, are a class of fundamental MCMC algorithms used to perform Bayesian inference. In practice, running and tuning MCMC algorithms can be a delicate task, and theoretical properties regarding the convergence rates of the chain in question can be useful for providing guidance for tuning algorithms, deciding burn-in periods, and giving bounds on the asymptotic variance. In this talk I will present the first explicit bounds for the spectral gap of Random Walk Metropolis, for any value of the proposal variance. When this variance is appropriately scaled, the bounds recover the correct $d^{-1}$ scaling with dimension for suitably regular invariant distributions. I will outline the techniques used to obtain these bounds, which rely on notions such as isoperimetry and conductance.

Joint work with Christophe Andrieu, Anthony Lee and Sam Power.

3/03 Luke KellyLink opens in a new window Lagged couplings for Markov chain Monte Carlo phylogenetic inference Y    
Phylogenetic inference attempts to reconstruct the ancestry of a set of observed taxa and is an intractable statistical problem on a complex, high-dimensional space. The likelihood function is an integral over unobserved evolutionary events on a tree and is frequently multimodal. MCMC methods are the primary tool for Bayesian phylogenetic inference, but constructing sampling schemes to efficiently explore the associated posterior distributions and diagnosing their performance are difficult tasks.
Couplings have recently been used to construct unbiased MCMC estimators and estimate useful bounds on the convergence of chains to their stationary distribution. We describe a procedure to couple a pair of Markov chains targeting a posterior distribution over a space of phylogenetic tree topologies, branch lengths, scalar parameters and latent variables such that the chains meet exactly after a random, finite number of iterations and remain coupled. We use the meeting times to diagnose convergence in total variation distance jointly across all components of the model on trees with up to 200 leaves.
10/03 Qian QinLink opens in a new window Spectral telescope: Convergence rate bounds for random-scan Gibbs samplers based on a hierarchical structure Y    
Random-scan Gibbs samplers possess a natural hierarchical structure. The structure connects Gibbs samplers targeting higher dimensional distributions to those targeting lower dimensional ones. This leads to a quasi-telescoping property of their spectral gaps. Based on this property, we derive three new bounds on the spectral gaps and convergence rates of Gibbs samplers on general domains. The three bounds relate a chain’s spectral gap to, respectively, the correlation structure of the target distribution, a class of random walk chains, and a collection of influence matrices. Notably, one of our results generalizes the technique of spectral independence, which has received considerable attention for its success on finite domains, to general state spaces. We illustrate our methods through a sampler targeting the uniform distribution on a corner of an n-cube.
Term 1:
Date Speaker Title F2F Slides Video
07/10 Francesca CrucinioLink opens in a new window Optimal scaling of proximal MCMC Y    

Proximal MCMC is a recently proposed class of MCMC methods which uses proximity maps instead of gradients to build proposal mechanisms which can be employed for both differentiable and non-differentiable targets.

These methods have been shown to be stable for a wide class of targets, making them a valuable alternative to Metropolis-adjusted Langevin algorithms (MALA); and have found wide application in imaging contexts.

The wider stability properties are obtained by building the Moreau-Yoshida envelope for the target of interest, which depends on a parameter lambda.

In this work we investigate the optimal scaling problem for proximal MCMC, show that MALA is a special case of this class of algorithms and provide practical guidelines for the selection of the scale parameter and the parameter lambda.


Richard EverittLink opens in a new window

Paul JenkinsLink opens in a new window

Gareth RobertsLink opens in a new window

Massimiliano TamborrinoLink opens in a new window

Short talks      
21/10 Juan Kuntz-NussioLink opens in a new window Scalable particle-based alternatives to EM Y    
  (Neal and Hinton, 1998) recast the problem tackled by EM as the minimization of a free energy functional F on an infinite-dimensional space and EM itself as coordinate descent applied to F. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with F and show that their limits coincide with F's stationary points. By discretizing the flows, we obtain three practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale well to high-dimensional settings and outperform existing state-of-the-art methods in experiments.

Jeremie HoussineauLink opens in a new window

Giovanni MontanaLink opens in a new window

Simon SpencerLink opens in a new window

Nicholas TawnLink opens in a new window

Short talks      
04/11 Aki NishimuraLink opens in a new window Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large n & large p" Bayesian sparse regression Y    

In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of 10^5 ~ 10^6 and of 10^4 ~ 10^5. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large n & large p" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix $\Phi$ is expensive to compute and factorize. In this talk, I will present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector $b$ such that the solution to the linear system $\Phi \beta = b$ has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by $\Phi$; this involves no explicit factorization or calculation of $\Phi$ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n = 72,489 patients and p = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day.


Jon ForsterLink opens in a new window

Adam JohansenLink opens in a new window

Martyn PlummerLink opens in a new window

Mark SteelLink opens in a new window

Ioannis KosmidisLink opens in a new window

Short talks      
18/11 Lucas Catillo Martì Using Random Generation to reverse-engineer the brain’s sampling algorithm Y    

Over the past two decades, a wave of Bayesian explanations has swept through cognitive science, explaining behaviour in domains from intuitive physics and causal learning, to perception, motor control and language. Yet people produce stunningly incorrect answers in response to even the simplest questions about probabilities. How can a supposedly Bayesian brain paradoxically reason so poorly with probabilities?

We propose that the brain approximates Bayesian inference through local sampling, approximating the solutions to inferential problems in similar ways to how MCMC algorithms operate. In a recent experiment, we explicitly asked people to generate a sequence of random samples, and found that human sequences are different to iid sequences in very predictable ways. We compared human performance to that of multiple MCMC algorithms, and found that people are most similar to an MCMC algorithm with autocorrelated proposals.

25/11 Lyudmila GrigoryevaLink opens in a new window Reservoir kernels and Volterra series. Y    

A universal kernel whose sections approximate any causal and time-invariant filter in the fading memory category with inputs and outputs in a finite-dimensional Euclidean space is constructed. This kernel is built using the reservoir functional associated with a state-space representation of the Volterra series expansion available for any analytic fading memory filter. We call it the Volterra reservoir kernel. Even though the state-space representation and the corresponding reservoir feature map are defined on an infinite dimensional tensor algebra space, the kernel map is characterized by explicit recursions that are readily computable for specific datasets when employed in estimation problems using the representer theorem. We showcase the application of the reservoir kernel in two empirical exercises, namely, Mackey-Glass time series forecasting and bitcoin price prediction. We compare the performance of the reservoir kernel to the sigkernel in Salvi et al. and other kernel benchmarks therein.

02/12 Frank van der MeulenLink opens in a new window Backward Filtering Forward Guiding Y    

Consider a stochastic process evolving on a directed acyclic graph. Observations are at the leaf vertices and interest lies in smoothing and parameter estimation.
I will discuss the {\it Backward Filtering Forward Guiding} (BFFG) algorithm. In this algorithm one first computes the Backward Information Filter (BIF) to a simplified version of the forward dynamics. The output of the BIF can subsequently be used to adjust the dynamics of the forward process such that it ends up in the observations at the leaf-vertices. This adjustment is an instance of Doob's $h$-transform. We call the adjusted forward process a {\it guided process}, as it guides the process to the observations. One can deduce closed-form expressions for the likelihood ratio between the true conditioned process and the guided process. This in turn can be exploited for designing Bayesian computational methods. In particular I will show that likelihood based inference can be done in a unified way for seemingly different models by defining two key operations and their composition.

This concerns joint work with Moritz Schauer (Chalmers University of Technology - University of Gothenburg). The BFFG-algorithm for diffusions is discussed in the paperrMider, Schauer & Van der Meulen (2021), while generalisations to graphical models and other stochastic processes are treated in Van derMeulen & Schauer (2022). An accessible introduction to that paper is given in Van der Meulen (2022).

M. Mider, M. Schauer and F.H. van der Meulen (2021), Continuous-discrete smoothing of diffusions, Electronic Journal of Statistics 15, 4295-4342. F.H. van der Meulen and M. Schauer (2021), Automatic Backward Filtering Forward Guiding for Markov processes and graphical models, arXiv:2010.03509 (submitted). F.H. van der Meulen (2022), Introduction to Automatic Backward Filtering Forward Guiding, arXiv:2203.04155.

09/12 Ritabrata DuttaLink opens in a new window and Pablo Ramses Alonso MartinLink opens in a new window Efficient Bayesian model averaging for Neural Networks via Gibbs Boomerang Y    
  We explore the strength of Markov chain Monte Carlo (MCMC) methods based on piece-wise deterministic Markov processes (PDMP) for the uncertainty quantification of neural network based models. Our sampling machinery builds upon a preconditioned PDMP sampler called Boomerang sampler, which achieves superior mixing due to non-reversibility and can sample exactly while only using the sub-samples of the training data, to deal with the highdimensionality of parameter space and big data bottleneck. To solve the identifiability problem we propose a framework based on Bayesian model averaging using global-local shrinkage priors, more specifically horseshoe priors.
We illustrate that in this setting most variational and standard MCMC techniques fail due to the exploding behavior of the gradients of the posterior distribution. Borrowing ideas from the literature of Gibbs PDMPs, we propose Gibbs Boomerang sampler, which does not suffer from the exploding nature of the gradient and further removes the need of estimation of the reference measure to precondition. The only approximation in our method comes while sampling event times from the associated non-homogeneous Poisson process for intractable likelihoods.
Previous Years:

2021/2022Link opens in a new window

2020/2021Link opens in a new window