Research internship projects

The projects below are illustrative of the type of project you might work on for the Statistics Summer Research Experience. This list is a small subset of the projects on offer. If you are interested in a project below, please contact the relevant supervisor if you require further information. For other projects, please browse our list of our academic staffLink opens in a new window and take a look through their research interests. URSS applicants must apply with a specific project and supervisor. This is not required for external applicants.

Dr Richard Everitt

Approximate Bayesian computation for individual based models

Individual based models (IBMs) are used in ecology for modelling behaviour at a population-level via modelling the

behaviour of individual organisms. These computer models are sometimes deterministic, sometimes stochastic, and usually have unobserved parameters. These parameters must be chosen appropriately (“calibrated”) using observed data, in order that it is possible for the model to be an accurate representation of the real world. This calibration can, in theory, be carried out using a technique known as approximate Bayesian computation (ABC). ABC works by simulating from the model for different parameters, and choosing parameters whose simulations are good matches to the observed data. However, in practice, the large computational cost of simulating from an IBM places a limitation on the use of standard ABC techniques. This project will investigate the use of recent approaches, such as Bayesian optimisation ABC, for calibrating IBMs.

Estimating bowler ability in cricket

The bowling average (number of runs conceded divided by number of wickets) is the most commonly used method for evaluating the effectiveness of bowlers in cricket. However, there are a number of variables that affect this average - e.g. the quality of the opposition and the country in which the matches are played. Therefore it is difficult to compare the quality of bowlers from different countries, and from different eras. This project will adapt recent work on estimating the quality of batsmen in Test cricket, and instead examine the quality of bowlers in Test cricket.

Active subspaces using sequential Monte Carlo

Many models for complex processes have the issue that their parameters cannot be fully identified from the data. When performing Bayesian inference on such a model, it can simplify the problem to explore only the “active subspace” of the parameters: the subspace is a particular kind of low-dimensional structure that can be exploited in inference. This project will introduce the use an active subspace into a sequential Monte Carlo framework.

Prof. Ioannis Kosmidis

High-dimensional logistic regression

Logistic regression is one of the most frequently used models in statistical practice, both for inferring covariate effects on probabilities and for constructing classifiers. One of the reasons for its widespread use is that likelihood-based inference and prediction are easy to carry out, and are asymptotically valid under assumptions like that the number of model parameters p is fixed relative to the number of observations n.

Nevertheless, that validity is known to be lost under more realistic assumptions like p/n → κ ∈ (0, 1). An increasing amount of research is now focusing on developing methodology that can recover the performance of estimation and inference from logistic regression. This project will initially compare, through simulation experiments, some recent proposals for improved estimation and inference in logistic regression under assumptions that match the expectations that modern practice sets. Then, we will attempt to derive new computationally-attractive estimators and test statistics that work well in cases like p/n → κ ∈ (0, 1).

Item-response theory models and politics: How liberal are the members of the US House?

Item-response theory (IRT) models are a core tool in psychometric studies. They can be used to learn about the difficulty of tests and the ability or, more generally, attitude of individuals towards those tests. In a parliamentary setting, we could think of the matters that are presented for vote as tests, and the votes of the members as responses to those tests. Then we can start asking questions like i) how "liberal" each of the members is; ii) how does the liberality of each member change over time; iii) how do the various social and economic matters cross or respect party lines and when; and so on.

This project aims to develop dynamic extensions of IRT models that are suitable for answering such questions. The models will be fit to more than *half-a-century worth* of voting records in the US House of representatives using cutting-edge Bayesian and frequentist computational frameworks for statistical modelling, like STAN (https://mc-stan.org) and TMB (http://tmb-project.org). The ultimate goal is to be able to define "liberality scales" that can be used to visualise the members' movements towards more liberal or more conservative positions over time; predict such moves; and identify groups of similar matters and members.

Dr Martyn Parker

Learning analytics and natural language processing

Learning analytics aims to use data about learning and its contexts to optimise education and the spaces in which it occurs. As a field, it intersects data science, statistics, human-centred design, and educational research. Typical methodologies include:

Descriptive analytics. Use of data aggregation and mining to evaluate trends and metrics over time.
Diagnostic analysis. Insights into why outcomes occurred.
Prescriptive analysis. Provide recommendations on outcomes.

Data sources can include quantitative and qualitative data sources.

The challenges are complex and current tools often lack theoretical rigour and focus on data presentation rather than applying formal data analysis principles to support evidence-based decision-making, for example, diagnostic analysis needs to be stronger. Furthermore, qualitative data analysis should take advantage of current advances in natural language processing.

This project aims to offer the opportunity to

critically review of current learning analytics tools and methodologies.
Identify gaps in current diagnostic analysis approaches and tools.
Apply rigorous statistical and data analysis techniques to create tools that provide rigorous diagnostic insight and justifiable prescriptive analytics.
Apply natural language processing techniques to create new descriptive and diagnostic analytics.

Required skills:

Ability to analyse a range of both quantitative and qualitative data sets using a range of statistical tools.
Knowledge of, or desire to learning natural language processing techniques and apply them to a developing area.
Strong interest is application of statistical and data analysis to learning.
Strong statistical programming skills that also focus on human-centred interactions, for example, useability

A key output will be a theoretically justified learning analytics tool. Since learning analytics data sources contain sensitive information, this project uses illustrative data sets. These data sets illustrate typical inputs into the learning analytics process but are not derived from and do not contain “live” data. Consequently, there are no GDPR or ethical considerations.

References:

K G, S., Kurni, M. (2021). Introduction to Learning Analytics. In: A Beginner’s Guide to Learning Analytics. Advances in Analytics for Learning and Teaching. Springer, Cham. https://doi.org/10.1007/978-3-030-70258-8_1

Guzmán-Valenzuela, C., Gómez-González, C., Rojas-Murphy Tagle, A. et al. Learning analytics in higher education: a preponderance of analytics but very little learning?. Int J Educ Technol High Educ 18, 23 (2021). https://doi.org/10.1186/s41239-021-00258-x

Clow, D. (2012). The learning analytics cycle: closing the loop effectively.

Siemens, G., & d Baker, R. S. (2012, April). Learning analytics and educational data mining: towards communication and collaboration. In Proceedings of the 2nd international conference on learning analytics and knowledge (pp. 252-254). ACM.

Shinji Watanabe, Jen-Tzung Chien. (2015) Bayesian speech and language processing Cambridge University Press.

Dan Jurafsky, (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, Pearson Prentice Hall

Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, Walter Kintsch (Eds), (2007) Handbook of Latent Semantic Analysis, Taylor Francis: New York. https://doi.org/10.4324/9780203936399

CRAN (2021), CRAN Task View: Natural Language Processing, Maintainer: Fridolin Wild, Version 20th October 2021. Available online: https://cran.r-project.org/view=NaturalLanguageProcessing

Rahil Shaiky, (2018) Gentle Start to Natural Language Processing using Python, Published in Towards Data Science, available online: https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3

Natural language processing: theory and practice

Natural language processing (NLP) combines elements of statistical theory, artificial intelligence (machine learning/deep learning) to understand document contents including language nuances. This project investigates the both the theoretical background and practice implementation for text-based analysis. The theory will cover data pre-processing and different statistical models that can be used to develop language insights. Initial statistical approaches will examine how to provide statistical insights into language data. Time permitting, there is the opportunity to look at probabilistic models used to predict the next word in a sequence, given the preceding words. Practical implementation may include developing tools that apply the theory to data (for example, web pages or downloaded data sources) to organise, analyse, and categorise the information contained in these documents. In the wider context, businesses might apply NLP to customer reviews or feedback to help understand contextual elements, provide insights, detect fake reviews and help develop solutions to key customer-facing issues.

Key outcomes may include:

Review, synethesis, evaluate and report the investigation of the theory of natural language processing. This would include areas such as tokenisation, lemmatisation, parsing, segmentation, stemming, entity recognition, classification, and semantic analysis.
Develop a tool that accept as input a range of natural language data sources, process these inputs by applying the theory from part 1, and provides flexible ways to interpret the inputs.

The tool may be able to accept quantitative data and quantitative data that contextualises the quantitative data.

Required background. You must be able to develop tools and visualisations using for example Python/R. You will need to know or be able to learn relevant statistical theory suitable for the statistical modelling techniques employed.

References

CRAN. (2021) CRAN Task View: Natural Language Processing, Maintainer: Fridolin Wild, Version 20th October 2021. Available online: https://cran.r-project.org/view=NaturalLanguageProcessing

Luiz Felipe de Almeida Brino. (2019) Natural Language Processing with R. Version 3rd June 2019. Available online: https://rpubs.com/LuizFelipeBrito/NLP_Text_Mining_001

Francois Chollet. (2022) Deep Learning with Python (2nd ed.). Manning

Kenneth Church. (2014) Statistical Models for Natural Language Processing, in Ruslan Mitkov (ed.), The Oxford Handbook of Computational Linguistics, 2nd edn, Oxford Handbooks (2022; online edn, Oxford Academic, 1 Apr. 2014). Available online https://doi.org/10.1093/oxfordhb/9780199573691.013.54.

Shinji Watanabe and Jen-Tzung Chien. (2015) Bayesian speech and language processing Cambridge University Press.

Dan Jurafsky. (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, Pearson Prentice Hall

Dr Simon Spencer

Simulations for the whole family (of quasi-stationary distributions)

If a Markov process has an absorbing state (reached with probability one in finite time) then the stationary distribution is boring – all the mass falls on the absorbing state. However, if we condition on the process not having reached the absorbing state yet then a so-called quasi-stationary distribution may exist. In fact, there can be infinitely many such quasi-stationary distributions for the same process. The birth-death process is a relatively simple model that has an infinite family of quasi-stationary distributions. One is straightforward: a so-called “low energy” distribution with finite mean, and all others are more exotic, “high energy” distributions with infinite mean. In this project we will look to find ways of simulating from the quasi-stationary distributions of the birth death process, and from the “high energy” distributions in particular. Then, we will look to apply these simulation techniques to more complex models in which the family of quasi-stationary distributions is currently unknown. This project will involve programming in the statistical programming language R.

Key reference: Adam Griffin, Paul A. Jenkins, Gareth O. Roberts and Simon E.F. Spencer (2017). Simulation from quasi-stationary distributions on reducible state spaces. Advances in Applied Probability,Link opens in a new window 49 (3), 960-980.