Bayesian nonparametric inference of time series data in population genetics
Allele frequencies in population genetics are typically modelled by individual-based, discrete particle systems, which converge under appropriate rescalings to (jump) diffusion limits that are more tractable than the prelimiting models, as the population size is sent to infinity. A data set can then be viewed as a time-series of observations from a path of the underlying diffusion. Recent work has established that the true dynamics of the diffusion can be consistently learned from such a time series. However, since real populations are finite, it would be more appropriate to investigate the inference problem associated with the sequence of prelimiting models, and study the (suitably rescaled) infinite population limit of the inference problems themselves. A suitable tool is the machinery for continuously observed (jump) diffusions.
Genealogies of sequential Monte Carlo algorithms
Sequential Monte Carlo methods are a widely used class of algorithms for evaluating intractable likelihood functions (among other things). They consist of propagating a weighted particle system across a number of time-steps. Crucial to their success is a resampling operation, in which "successful" particles are replicated while "unsuccessful" particles are discarded. This gives rise to an ancestral tree within the particle system. Characterising the size and shape of the tree in the limit of large particle numbers is a relatively open problem, with only preliminary results in toy settings being established. Possible projects include extending the scope of these results to more realistic scenarios, and more esoteric models where the ancestry is formed by a network as opposed to a tree.
Markov chain Monte Carlo on trees
Population genetic inference is often concerned with augmenting an observed set of DNA sequence data with its unobserved common ancestry, most often in order to integrate out the ancestry to evaluate the marginal likelihood of the data. This can be done (among other ways) by running a Metropolis-Hastings algorithm on the space of ancestral trees that are consistent with the observed data. Such algorithms have been used for over 20 years, but almost no theoretical performance results are available. Being able to easily distinguish between efficient and inefficient variants of Metropolis-Hastings algorithms in real-valued settings using diffusion limits of random walks has revolutionised the utility of MCMC as a practical algorithm in many applied fields - there is good reason to think that the same could be done in genetics by working out the correct analogue of random walks and diffusion limits on spaces of trees.
Statistical tools for distinguishing evolutionary scenarios
Patterns in DNA sequence data carry information about the evolutionary history of the population. Observed patterns in a data set can thus be used to evaluate the plausibility of various possible histories. Two scenarios - population growth, and so-called high fecundity events in which the offspring of a single family can account for a sizable fraction on the whole population - are known to result in similar patterns in sequence data. Nevertheless, recent work has shown that these scenarios can be distinguished by relying on standard coalescent models of population genetics, and a surprisingly simple summary statistic. However, the success of the method depends on prior knowledge of the spatial structure of the population. Hence it is of interest to investigate whether standard methods for infering population structure are accurate without knowing which of the above scenarios is correct. If so, then a two-step procedure, in which one first estimates the population structure without knowing the evolutionary scenario, and then the scenario given the structure, can obtain accurate results with minimal prior assumptions. Many other extensions of the method to more complex scenarios are also of interest.