Integrated Masters dissertation
During the 4th year, MMathStat, MMORSE, and MSci Data Science students work on a research project with the support of a member of the lecturing staff. This allows students to synthesise, apply and extend the knowledge and skills developed during the taught component of the course and to demonstrate mastery of some elements of Statistics and Data Science. The wide range of topics reflects the broad spectrum of interests of academics, and in some cases, suggestion brought forwards by students or external collaborators.
What students say
"During the project, not only you can explore the area of your own interest, but also it feels great to build something upon the knowledge and skills you gained from the last three years. Besides, it is really a rewarding experience to meet regularly with your project supervisor and peers. You can learn a lot from them. Moreover, although I was nervous before the presentation, I found it quite exciting when I finished. It is indeed a challenging process where you can really improve yourself. Overall, doing the project was one of the most valuable experiences during my university time."
Anbang Du, MMORSE 2021
Examples of past student projects
Diversity of a set of items consumed by users with applications to streamed music
Ollie Rosoman (4th year Integrated Masters Dissertation, BSc MMORSE)
Extending concepts such as entropy to establish mathematical measures for diversity in Spotify playlists
The popularity of music playlists among listeners have been associated with levels of diversity with respect to mood, genre and other characteristics. Past approaches to the measurement of diversity of a set of items focussed on one or two of three important aspects: the cardinality, the evenness of the distribution, and the similarity between the items.
This report studies the advantages and disadvantages of key existing diversity measures , including Richness, entropy-based measures (Shannon and Rényi), Gini-Simpson index, Hill numbers, Sharma-Mittal generalisation, and Generalist-specialist score. It further proposes a number of novel approaches which prove very compelling.
One of our proposed scores, the trio diversity measure, provides a unification of the three key aspects of diversity, performing excellently on a large music activity dataset when compared to popular existing measures.
The quantification of diversity is a difficult challenge, as individuals will often disagree about which factors are the most relevant and often have very different objectives. It deemed important then, that the context should be given careful consideration, which was partly why implementation on real music streaming data was carried out. The Taste Profile subset  provided real data from music streaming sourced from an undisclosed source and contains over 1 million unique users and around 385,000 unique songs.
 Anderson et al (2020), Algorithmic Effects on the Diversity of Consumption on Spotify, WWW’20: Proceedings of The Web Conference 2020: 2155–2165,  Chao et al (2014), Ecological monographs 84, 45–67,  Bertin-Mahieux et al (2011), The Million Song Dataset, Proceedings of the 12th International Conference on Music, Information.
Bias correction in citizen science data repositories
Vidoushee Jogarah (4th year Integrated Masters Dissertation, BSc MMORSE)
Developing statistical models to account for sampling biases due to factors such as weather and animal behaviour in UK butterfly data
Ecological data collected by volunteers, also known as “citizen science” data, has become an important tool for scientific research and environmental policing globally. However, without a fixed sample design, the data collected can be subject to recording biases. This, in turn, affects the conclusions drawn from analysis of such data and hinders the ability to obtain robust observations regarding important trends in ecological data.
This project builds upon the findings that the occupancy detection model, a Bayesian hierarchical model described by the flowchart, is robust to four types of sampling biases: (i) uneven recording intensity over time, (ii) bias in visits across sites, (iii) uneven recording effort, and (iv) uneven probability of detecting a species .
The aim of the project is to build a model, which is as robust as the occupancy detection model but which uses abundance rather than presence-absence data to detect trends in occupancy.
To do so, a large real-life citizen science dataset  was studied containing butterfly counting data over 43 years across more than 3000 sites in the UK. Data was then simulated to resemble this dataset, with non-random adjustments to simulate the different scenarios of recording bias as well as a 30% decline in abundance. The model constructed is evaluated by its ability to detect this decline from observed abundance in each biased scenario.
 Isaac et al (2014), Statistics for citizen science: extracting signals of change from noisy ecological data, Methods in Ecology and Evolution 5(10), 1052–1060,  Botham et al (2020), United kingdom butterfly monitoring scheme: site indices 2019.
Ranking test match bowlers
Gabriel Musker (4th year Integrated Masters Dissertation, BSc MMathStat)
Using Bayesian computational methods to compare bowlers’ "true abilities" on a level playing field across eras
Test cricket is a sport that lends itself to statistical analysis better than almost any other, and recent advances in data collection have led to a boom in in-depth, micro-analyses of players and teams. However, with the sport going through a huge number of changes in its 145-year history, a great debate still rages on: is it fair to compare players across eras, and if so, what adjustments need to be made?
This project applies Bayesian computational models to historical Test match data to effectively evaluate the difficulty of conditions surrounding bowlers’ performances throughout history, creating a baseline on which to compare bowlers from different eras and answer that very question.
A model for re-contextualising a batsman’s average was first proposed by . Modelling “true ability”, they considered several factors, notably the quality of the opposition and the difficulty of the decade of play (which has varied significantly due to changes in the laws of the game and pitch preparation methods, amongst other things).
This paper builds on that work, modelling bowlers’ performances by considering both runs conceded and wickets taken, as well as the aforementioned contextual variables and others. Factors influencing a bowling performance were evaluated using a Bayesian hierarchical model, with posterior distributions for bowlers’ “true ability” parameters being estimated using the Hamilton Monte Carlo algorithm based software package .
 Boys et al (2018), On the Ranking of Test Match Batsmen, Journal of the Royal Statistical Society, Series C, vol. 68, no. 1, pp. 161-179,  Team (2021), Stan Modeling Language Users Guide and Reference Manual, V2.27.
Ergodicity of limit order book (LOB) Markov chains
Harry Mantelos Sapranidis (4th year Integrated Masters Dissertation, BSc MMathStat)
Using SDEs to derive generators of Markov processes modeling LOB evolution and prove their ergodicity
A Limit Order Book (LOB) contains all the orders and essential characteristics submitted by traders and serves to connect buyers and sellers. We consider a market with three types of orders: limit orders (price at which a trader is willing to buy/sell a specific number of shares any time) and cancellation orders of these, and market orders (immediately buy/sell a certain quantity of shares). The stochastic evolution of the LOB can be modelled by continuous-time Markov processes with all the different order submission given by Poisson processes [1, 2].
In this project, we also studied the embedded Markov chain of this process (in discrete time), only dealing with the times at which there is a change in the shape of the order book (as a result of incoming orders), as well as some interesting variants of the model e.g. changing assumptions on the intensities of the Poisson processes for the different orders. We describe the evolution of the shape of the LOB by a stochastic differential equation driven by Poisson processes and use this to derive the generator/transition operator which infinitesimally characterise the movement of the Markov process/embedded Markov chain.
Using stochastic stability theory (cf ) we use our previous results to prove a very important property of the models: ergodicity. This ensures that our process will eventually have a particular distribution, regardless of the initial distribution. What is more, we prove that this stabilization happens exponentially fast.
 Abergel et al, Limit Order Books, Physics of Society: Econophysics, CUP, 2016,  Rama et al, A stochastic model for order book dynamics, Operations Research, 58:549-563, 06 2010,  Meyn et al, Markov Chains and Stochastic Stability, Cambridge Math Library, CUP, 2009.
Shuffling algorithms and users’ perceptions of randomness with application to streamed music
Anbang Du (4th year Integrated Masters Dissertation, BSc MMORSE)
Defining randomness for playlists using concepts from card shuffling and applying runs tests to Spotify data
The shuffle play option in music streaming Apps like Spotify is meant to provide a feeling of randomness to users. If the order of a list of songs feels random to a user, are the songs in a truly random order? It is known that subjective perception of randomness does not necessarily correspond to the mathematical concepts of randomness . The first part of this project focusses on the meaning of true randomness in the sense of probability theory. We embed this in a discussion of card shuffling theory including top-in-at-random shuffle and riffle shuffle .
The second part of this project incorporates the idea of randomness in the analysis of the Million Playlist Dataset introduced in the RecSys Challenge 2018 . The collective distributions of music features of the first several songs in all playlists does not exhibit any particular pattern, which raises the question whether they may occur at random. This motivates the usage of the runs test to investigate dependencies between items in individual playlists. The theories and applications of runs distribution in the binary case and the k-category extension of runs test are discussed in this section.
In the final part of this project we address the multiple testing problem caused by simultaneous application of the runs test to nearly 1000 playlists. Correction methods include the Bonferroni and the False Discovery Rate (FDR). We prove that how the Benjamini-Hochberg procedure controls the FDR.
 Bar-Hillel and Wagenaar, The perception of randomness, Advances in applied mathematics 12.4 (1991),  Aldous and Diaconis, Shuffling cards and stopping times, in The Amer. Mathem. Monthly 93.5 (1986),  Chen et al, Recsys Challenge 2018: automatic music playlist continuation, Proc. of the 12th ACM Conf. on Recommender Systems, Vancouver, BC, Canada: Assoc. for Computing Machinery, 2018, https://doi.org/10.1145/3240323.3240342.
Supervised machine learning of game-styles in tennis
Nicolai Williams (4th year Integrated Masters Dissertation, BSc MMORSE)
Comparing several clustering techniques for characterising and classifying game-style in tennis
Game-styles in tennis are characteristics that are evident to both players and tennis professionals. Players have different strengths and utilise them in customised strategies. Coaches attempts to classify players have resulted in lists of player game-styles, but so far this was based on expert judgement, rather than statistical methods.
This project made use of LTA point-by-point match data to characterise players using summary metrics, which were evaluated using discrimination and stability meta-metrics originally designed for NBA player data  ensuring sufficient discriminatory power. The refined summary metrics were firstly implemented and validated as part of a supervised machine learning approach based on K-means (shown in figure), K-medoids and Hierarchical Clustering algorithms to see what natural groupings would form and how many clusters would best split the data. Secondly, they served to train a classifier to see if it is possible to accurately classify the game-style of a new tennis player not used to train the classifier.
We showed that game-styles defined by the LTA are somewhat validated by our statistical analysis. Patterns in the clustering analysis and the numbers of clusters identified show a clear resemblance to the LTA game-styles. In particular, the All Court Player characterises a cluster by all 3 of the clustering algorithms indicating that this game-style is not only extremely appropriate, but also well allocated.
 Franks, Alexander et al, Meta-Analytics: Tools for Understanding the Statistical Properties of Sports Metrics, 2016.