Skip to main content Skip to navigation

Illustrative research internship projects

Dr Richard Everitt
Approximate Bayesian computation for individual based models
Individual based models (IBMs) are used in ecology for modelling behaviour at a population-level via modelling the behaviour of individual organisms. These computer models are sometimes deterministic, sometimes stochastic, and usually have unobserved parameters. These parameters must be chosen appropriately (“calibrated”) using observed data, in order that it is possible for the model to be an accurate representation of the real world. This calibration can, in theory, be carried out using a technique known as approximate Bayesian computation (ABC). ABC works by simulating from the model for different parameters, and choosing parameters whose simulations are good matches to the observed data. However, in practice, the large computational cost of simulating from an IBM places a limitation on the use of standard ABC techniques. This project will investigate the use of recent approaches, such as Bayesian optimisation ABC, for calibrating IBMs.
Estimating bowler ability in cricket
The bowling average (number of runs conceded divided by number of wickets) is the most commonly used method for evaluating the effectiveness of bowlers in cricket. However, there are a number of variables that affect this average - e.g. the quality of the opposition and the country in which the matches are played. Therefore it is difficult to compare the quality of bowlers from different countries, and from different eras. This project will adapt recent work on estimating the quality of batsmen in Test cricket, and instead examine the quality of bowlers in Test cricket.
Active subspaces using sequential Monte Carlo
Many models for complex processes have the issue that their parameters cannot be fully identified from the data. When performing Bayesian inference on such a model, it can simplify the problem to explore only the “active subspace” of the parameters: the subspace is a particular kind of low-dimensional structure that can be exploited in inference. This project will introduce the use an active subspace into a sequential Monte Carlo framework.
Dr Jere Koskela
Simulation of jump diffusions in genetics

The evolution of allele frequencies in randomly reproducing populations subject to occasional large scale events, such as population bottlenecks or selective sweeps, are typically modelled by so-called Lambda-Fleming-Viot jump diffusions. An important step in using a model for practical inference is to be able to simulate from it. Trajectories of jump diffusions can typically not be simulated exactly. A range of approximation schemes are available, but they are typically only applicable for jump diffusions taking values on the whole real line. The Lambda-Fleming-Viot describes the frequency of an allele in a population, and hence is constrained to take values in the unit interval [0,1]. This project aims to identify a range of jump diffusion approximation schemes, implement them, and compare their performance on the Lambda-Fleming-Viot model empirically.

Time series inference in genetics

DNA sequence data is now available from multiple generations, and there are standard models (most notably, the Wright-Fisher model) which describe the changes in allele frequencies in a population across time. However, the advent of multi-generation data is recent enough that many practical questions remain unanswered. This project is about finding an optimal balance between how many generations to sequence, and how many individuals to sequence in each generation, when the goal is to minimise the mean square error of estimators of standard genetic quantities of interest, such as the rate with which mutations arise in the population.

Prof. Ioannis Kosmidis
High-dimensional logistic regression
Logistic regression is one of the most frequently used models in statistical practice, both for inferring covariate effects on probabilities and for constructing classifiers. One of the reasons for its widespread use is that likelihood-based inference and prediction are easy to carry out, and are asymptotically valid under assumptions like that the number of model parameters p is fixed relative to the number of observations n.
Nevertheless, that validity is known to be lost under more realistic assumptions like p/n → κ ∈ (0, 1). An increasing amount of research is now focusing on developing methodology that can recover the performance of estimation and inference from logistic regression. This project will initially compare, through simulation experiments, some recent proposals for improved estimation and inference in logistic regression under assumptions that match the expectations that modern practice sets. Then, we will attempt to derive new computationally-attractive estimators and test statistics that work well in cases like p/n → κ ∈ (0, 1).
Item-response theory models and politics: How liberal are the members of the US House?
Item-response theory (IRT) models are a core tool in psychometric studies. They can be used to learn about the difficulty of tests and the ability or, more generally, attitude of individuals towards those tests. In a parliamentary setting, we could think of the matters that are presented for vote as tests, and the votes of the members as responses to those tests. Then we can start asking questions like i) how "liberal" each of the members is; ii) how does the liberality of each member change over time; iii) how do the various social and economic matters cross or respect party lines and when; and so on.
This project aims to develop dynamic extensions of IRT models that are suitable for answering such questions. The models will be fit to more than *half-a-century worth* of voting records in the US House of representatives using cutting-edge Bayesian and frequentist computational frameworks for statistical modelling, like STAN ( and TMB (
The ultimate goal is to be able to define "liberality scales" that can be used to visualise the members' movements towards more liberal or more conservative positions over time; predict such moves; and identify groups of similar matters and members.
Dr Simon Spencer
Simulations for the whole family (of quasi-stationary distributions)

If a Markov process has an absorbing state (reached with probability one in finite time) then the stationary distribution is boring – all the mass falls on the absorbing state. However, if we condition on the process not having reached the absorbing state yet then a so-called quasi-stationary distribution may exist. In fact, there can be infinitely many such quasi-stationary distributions for the same process. The birth-death process is a relatively simple model that has an infinite family of quasi-stationary distributions. One is straightforward: a so-called “low energy” distribution with finite mean, and all others are more exotic, “high energy” distributions with infinite mean. In this project we will look to find ways of simulating from the quasi-stationary distributions of the birth death process, and from the “high energy” distributions in particular. Then, we will look to apply these simulation techniques to more complex models in which the family of quasi-stationary distributions is currently unknown. This project will involve programming in the statistical programming language R.

Key reference: Adam Griffin, Paul A. Jenkins, Gareth O. Roberts and Simon E.F. Spencer (2017). Simulation from quasi-stationary distributions on reducible state spaces. Advances in Applied Probability, 49 (3), 960-980.