Skip to main content Skip to navigation

Illustrative research internship projects

Dr Jere Koskela
Does high fecundity mask population growth in DNA sequence data?

DNA sequence data is becoming more and more commonplace, and can be used to answer numerous questions about the history of the population from which it was sampled. There are plenty of successful methods for detecting historical population size changes in mammal-like populations, in which family sizes are small and many offspring survive to adulthood. Many organisms, particularly marine ones, reproduce rather differently: each female can generate millions of eggs, but typically most of the eggs do not survive into adult fish. This kind of high fecundity reproduction is less well studied than the mammalian regime, and in particular, methods for detecting historical population size changes are not in widespread use. The purpose of this project is to investigate whether population growth can be detected in the presence of high fecundity, using standard models of mathematical population genetics and simulated data.

Time series inference in genetics

DNA sequence data is now available from multiple generations, and there are standard models (most notably, the Wright-Fisher model) which describe the changes in allele frequencies in a population across time. However, the advent of multi-generation data is recent enough that many practical questions remain unanswered. This project is about finding an optimal balance between how many generations to sequence, and how many individuals to sequence in each generation, when the goal is to minimise the mean square error of estimators of standard genetic quantities of interest, such as the rate with which mutations arise in the population.

Prof. Vassili Kolokoltsov
Probabilistic methods for fractional calculus
Projects include the analysis of generalized nonlinear fractional equations (for instance, the fractional Hamilton-Jacobi-Bellman equation with application to control, and fractional McKean-Vlasov equations with applications to interacting particles and/or mean-field games), numerical solutions via path integral representations and the Monte Carlo technique. For the state-of-the-art see this survey.
Stochastic games with the large number of players including mean-field games
The topic boasts of a variety of applications, specially in the socio-economic modelling. Projects include investigating theoretical questions on games with countably many states, and many interesting applied problems to explore.
For an introduction to this vast topic see this monograph.
Quantum games

This is a really 21st century very recent development on the cross-road of control, games and quantum technologies. The project will involve the quite new development of dynamic quantum games, an exciting new topic that is just waiting for being properly explored. For an introduction one can see the review "Quantum games: a survey for mathematicians".

Dr Richard Everitt
Approximate Bayesian computation for individual based models
Individual based models (IBMs) are used in ecology for modelling behaviour at a population-level via modelling the behaviour of individual organisms. These computer models are sometimes deterministic, sometimes stochastic, and usually have unobserved parameters. These parameters must be chosen appropriately (“calibrated”) using observed data, in order that it is possible for the model to be an accurate representation of the real world. This calibration can, in theory, be carried out using a technique known as approximate Bayesian computation (ABC). ABC works by simulating from the model for different parameters, and choosing parameters whose simulations are good matches to the observed data. However, in practice, the large computational cost of simulating from an IBM places a limitation on the use of standard ABC techniques. This project will investigate the use of recent approaches, such as Bayesian optimisation ABC, for calibrating IBMs.
Estimating bowler ability in cricket
The bowling average (number of runs conceded divided by number of wickets) is the most commonly used method for evaluating the effectiveness of bowlers in cricket. However, there are a number of variables that affect this average - e.g. the quality of the opposition and the country in which the matches are played. Therefore it is difficult to compare the quality of bowlers from different countries, and from different eras. This project will adapt recent work on estimating the quality of batsmen in Test cricket, and instead examine the quality of bowlers in Test cricket.
Active subspaces using sequential Monte Carlo
Many models for complex processes have the issue that their parameters cannot be fully identified from the data. When performing Bayesian inference on such a model, it can simplify the problem to explore only the “active subspace” of the parameters: the subspace is a particular kind of low-dimensional structure that can be exploited in inference. This project will introduce the use an active subspace into a sequential Monte Carlo framework.
Dr Ioannis Kosmidis
High-dimensional logistic regression
Logistic regression is one of the most frequently used models in statistical practice, both for inferring covariate effects on probabilities and for constructing classifiers. One of the reasons for its widespread use is that likelihood-based inference and prediction are easy to carry out, and are asymptotically valid under assumptions like that the number of model parameters p is fixed relative to the number of observations n.
Nevertheless, that validity is known to be lost under more realistic assumptions like p/n → κ ∈ (0, 1). An increasing amount of research is now focusing on developing methodology that can recover the performance of estimation and inference from logistic regression. This project will initially compare, through simulation experiments, some recent proposals for improved estimation and inference in logistic regression under assumptions that match the expectations that modern practice sets. Then, we will attempt to derive new computationally-attractive estimators and test statistics that work well in cases like p/n → κ ∈ (0, 1).
Item-response theory models and politics: How liberal are the members of the US House?
Item-response theory (IRT) models are a core tool in psychometric studies. They can be used to learn about the difficulty of tests and the ability or, more generally, attitude of individuals towards those tests. In a parliamentary setting, we could think of the matters that are presented for vote as tests, and the votes of the members as responses to those tests. Then we can start asking questions like i) how "liberal" each of the members is; ii) how does the liberality of each member change over time; iii) how do the various social and economic matters cross or respect party lines and when; and so on.
This project aims to develop dynamic extensions of IRT models that are suitable for answering such questions. The models will be fit to more than *half-a-century worth* of voting records in the US House of representatives using cutting-edge Bayesian and frequentist computational frameworks for statistical modelling, like STAN ( and TMB (
The ultimate goal is to be able to define "liberality scales" that can be used to visualise the members' movements towards more liberal or more conservative positions over time; predict such moves; and identify groups of similar matters and members.
Dr Simon Spencer
Simulations for the whole family (of quasi-stationary distributions)

If a Markov process has an absorbing state (reached with probability one in finite time) then the stationary distribution is boring – all the mass falls on the absorbing state. However, if we condition on the process not having reached the absorbing state yet then a so-called quasi-stationary distribution may exist. In fact, there can be infinitely many such quasi-stationary distributions for the same process. The birth-death process is a relatively simple model that has an infinite family of quasi-stationary distributions. One is straightforward: a so-called “low energy” distribution with finite mean, and all others are more exotic, “high energy” distributions with infinite mean. In this project we will look to find ways of simulating from the quasi-stationary distributions of the birth death process, and from the “high energy” distributions in particular. Then, we will look to apply these simulation techniques to more complex models in which the family of quasi-stationary distributions is currently unknown. This project will involve programming in the statistical programming language R.

Key reference: Adam Griffin, Paul A. Jenkins, Gareth O. Roberts and Simon E.F. Spencer (2017). Simulation from quasi-stationary distributions on reducible state spaces. Advances in Applied Probability, 49 (3), 960-980.