Student projects (to include a variety of projects supervised by a range of academics - to be moved to departmental website later, currently under construction)

Integrated Masters dissertation

During the 4th year, MMathStat, MMORSE, and MSci Data Science students work on a research project with the support of a member of the lecturing staff. This allows students to synthesise, apply and extend the knowledge and skills developed during the taught component of the course and to demonstrate mastery of some elements of Statistics and Data Science. The wide range of topics reflects the broad spectrum of interests of academics, and in some cases, suggestion brought forwards by students or external collaborators.

What students say

"During the project, not only you can explore the area of your own interest, but also it feels great to build something upon the knowledge and skills you gained from the last three years. Besides, it is really a rewarding experience to meet regularly with your project supervisor and peers. You can learn a lot from them. Moreover, although I was nervous before the presentation, I found it quite exciting when I finished. It is indeed a challenging process where you can really improve yourself. Overall, doing the project was one of the most valuable experiences during my university time."

Anbang Du, MMORSE 2021

Third year data science projects

This is an extended, individual piece of work which forms a core element of the BSc and the MSci in Data Science. Students have the opportunity to apply practical and analytical skills in an innovate and/or creative way, to synthesise information, ideas and practices to provide a quality solution, and to evaluate that solution. They develop the ability to self-manage a significant piece of work. Students choose a topic and supervisor from the Statistics or the Computer Science department.

The project includes a progress report, an oral presentation, and an extended final project report. Students not only learn about the research project themselves, but they also experience different phases of project organisation and delivery. The work extends over about seven months, which introduces students to the management of a longer piece of work and the time management and task planning associated with this.

What students say

" Unlike most other final year modules, this project gives you the opportunity to own a piece of work that you have truly chosen yourself. I spoke to several supervisors before deciding on one and would recommend that you take the time to do this too, as this may be one of the longest pieces of work you complete during your university career! Throughout the project, I loved the freedom and flexibility with which I could structure my time. I also found the work I did for my dissertation to always be a refreshing change from most module formats. It provided ample opportunity for you to tailor the methods you use to match your desired technical skillsets. I learnt a lot working with my supervisor and found the process of working with a more experienced and knowledgeable staff member to be invaluable. I would recommend making use of their expertise and experience, as they are there to support you and want you to succeed. "

Mai-An Dang, BSc Data Science 2021

Examples of past student projects

Animal movement modelling with signatures

Callum Ellard (3rd year BSc Data Science project)

Modelling animal GPS data with stochastic analysis to feed machine learning algorithms for classifying behaviour

animal movement

Human development and expansion are causing the destruction of animal habitats and the rapid loss of species. Some of the most effective ways to protect these animals can be determined from examining their behaviour in their ecosystems.

This work explores the use of animal Global Positioning System data with the signature method [1] shown in the formula for a continuous d-dimensional path X. It has been used to summarise complex path with little a priori knowledge about data characteristics in other applications. Machine learning tools including random forest, stochastic gradient boosting, and extreme gradient boosting were then used to classify animal behaviour based on outputs from the signature method. Model performance was evaluated using average accuracy, average kappa, and average AUC. This method to classify animal behaviour produced some effective models and the results indicated the signature method could be a valuable addition to existing methods, such as Hidden Markov models.

One of the case studies is Isabela, a female Galapagos tortoise of the species Chelonoidis vandenburghi on Isabela Island, with a dataset containing roughly 39,000 datapoints. The figure shows the entirety of her dataset, from September 2010 to November 2017, with the colour indicating the number of hours elapsed since the first datapoint. The travel is migratory with various stopping points, where other types of movement occur. The tortoise and vulture datasets used in the project were obtained from the openly accessible online repository for animal movement data [2].

[1] Chevyrev et al, A Primer on the Signature Method in Machine Learning, 2016, https://arxiv.org/pdf/1603.03788.pdf, [2] www.movebank.org.

Performance indicators in asymptomatic SARS-CoV-2 detection

Yingning Shen (3rd year BSc Data Science project)

Calculating COVID-19 screening scheme testing errors using iterated conditional probabilities and implementation in a web app

performance indicators asymptomatic testing

Suitably designed screening programs can help limiting infection rates in the current global COVID-19 pandemic before sufficiently many people have been vaccinated. Such schemes detect otherwise missed asymptomatic cases, allowing positively tested people self-isolate rather than infecting others. The government or employers may require participating in testing schemes involving rapid Lateral Flow Tests (LFT) or laboratory-based PCR tests. Students may be asked to take two LFTs when returning to campus for face-to-face learning, followed by two LTFs per week. However, the potentially high testing error rates have triggered an ongoing controversy, among experts and in the public discourse, about ethics, effectiveness, and efficiency of such schemes [1, 2].

The first task was to derive probabilistic formulas for the error rates for a few alternative testing schemes. They would differ by the types of tests involved and by the mechanisms how the results of individual tests would be combined. Based on that, performance indicators of such schemes were evaluated across different sets of assumptions about specificity, sensitivity and infection incidence in the screened population.

The theoretical results were used to build a web app. This can be used to compare the outcomes of different testing schemes by running simulations under a battery of alternative assumptions about the above mentioned parameters. Such tool can support evaluation, designs and implementation of screening schemes tailored for a given setting (e.g. educational institution, work place, event).

[1] Gill et al, Mass testing for covid-19 in the UK, BMJ 2020;371:m4436, [2] Mina et al, Rethinking covid-19 test sensitivity — a strategy for containment, NEJM, vol 383, no 22, 2020

Diversity of a set of items consumed by users with applications to streamed music

Ollie Rosoman (4th year Integrated Masters Dissertation, BSc MMORSE)

Extending concepts such as entropy to establish mathematical measures for diversity in Spotify playlists

performance indicators asymptomatic testing

The popularity of music playlists among listeners have been associated with levels of diversity with respect to mood, genre and other characteristics. Past approaches to the measurement of diversity of a set of items focussed on one or two of three important aspects: the cardinality, the evenness of the distribution, and the similarity between the items.

This report studies the advantages and disadvantages of key existing diversity measures [2], including Richness, entropy-based measures (Shannon and Rényi), Gini-Simpson index, Hill numbers, Sharma-Mittal generalisation, and Generalist-specialist score. It further proposes a number of novel approaches which prove very compelling.

One of our proposed scores, the trio diversity measure, provides a unification of the three key aspects of diversity, performing excellently on a large music activity dataset when compared to popular existing measures.

The quantification of diversity is a difficult challenge, as individuals will often disagree about which factors are the most relevant and often have very different objectives. It deemed important then, that the context should be given careful consideration, which was partly why implementation on real music streaming data was carried out. The Taste Profile subset [3] provided real data from music streaming sourced from an undisclosed source and contains over 1 million unique users and around 385,000 unique songs.

[1] Anderson et al (2020), Algorithmic Effects on the Diversity of Consumption on Spotify, WWW’20: Proceedings of The Web Conference 2020: 2155–2165, [2] Chao et al (2014), Ecological monographs 84, 45–67, [3] Bertin-Mahieux et al (2011), The Million Song Dataset, Proceedings of the 12th International Conference on Music, Information.

Bias correction in citizen science data repositories

Vidoushee Jogarah (4th year Integrated Masters Dissertation, BSc MMORSE)

Developing statistical models to account for sampling biases due to factors such as weather and animal behaviour in UK butterfly data

performance indicators asymptomatic testing

Ecological data collected by volunteers, also known as “citizen science” data, has become an important tool for scientific research and environmental policing globally. However, without a fixed sample design, the data collected can be subject to recording biases. This, in turn, affects the conclusions drawn from analysis of such data and hinders the ability to obtain robust observations regarding important trends in ecological data.

This project builds upon the findings that the occupancy detection model, a Bayesian hierarchical model described by the flowchart, is robust to four types of sampling biases: (i) uneven recording intensity over time, (ii) bias in visits across sites, (iii) uneven recording effort, and (iv) uneven probability of detecting a species [1].

The aim of the project is to build a model, which is as robust as the occupancy detection model but which uses abundance rather than presence-absence data to detect trends in occupancy.

To do so, a large real-life citizen science dataset [2] was studied containing butterfly counting data over 43 years across more than 3000 sites in the UK. Data was then simulated to resemble this dataset, with non-random adjustments to simulate the different scenarios of recording bias as well as a 30% decline in abundance. The model constructed is evaluated by its ability to detect this decline from observed abundance in each biased scenario.

[1] Isaac et al (2014), Statistics for citizen science: extracting signals of change from noisy ecological data, Methods in Ecology and Evolution 5(10), 1052–1060, [2] Botham et al (2020), United kingdom butterfly monitoring scheme: site indices 2019.

Ranking test match bowlers

Gabriel Musker (4th year Integrated Masters Dissertation, BSc MMathStat)

Using Bayesian computational methods to compare bowlers’ "true abilities" on a level playing field across eras.

performance indicators asymptomatic testing

Test cricket is a sport that lends itself to statistical analysis better than almost any other, and recent advances in data collection have led to a boom in in-depth, micro-analyses of players and teams. However, with the sport going through a huge number of changes in its 145-year history, a great debate still rages on: is it fair to compare players across eras, and if so, what adjustments need to be made?

This project applies Bayesian computational models to historical Test match data to effectively evaluate the difficulty of conditions surrounding bowlers’ performances throughout history, creating a baseline on which to compare bowlers from different eras and answer that very question.

A model for re-contextualising a batsman’s average was first proposed by [1]. Modelling “true ability”, they considered several factors, notably the quality of the opposition and the difficulty of the decade of play (which has varied significantly due to changes in the laws of the game and pitch preparation methods, amongst other things).

This paper builds on that work, modelling bowlers’ performances by considering both runs conceded and wickets taken, as well as the aforementioned contextual variables and others. Factors influencing a bowling performance were evaluated using a Bayesian hierarchical model, with posterior distributions for bowlers’ “true ability” parameters being estimated using the Hamilton Monte Carlo algorithm based software package [2].

[1] Boys et al (2018), On the Ranking of Test Match Batsmen, Journal of the Royal Statistical Society, Series C, vol. 68, no. 1, pp. 161-179, [2] Team (2021), Stan Modeling Language Users Guide and Reference Manual, V2.27.

Shuffling algorithms and users’ perceptions of randomness with application to streamed music

Anbang Du (4th year Integrated Masters Dissertation, BSc MMORSE)

Defining randomness for playlists using concepts from card shuffling and applying runs tests to Spotify data

characteristics of music playlists

The shuffle play option in music streaming Apps like Spotify is meant to provide a feeling of randomness to users. If the order of a list of songs feels random to a user, are the songs in a truly random order? It is known that subjective perception of randomness does not necessarily correspond to the mathematical concepts of randomness [1]. The first part of this project focusses on the meaning of true randomness in the sense of probability theory. We embed this in a discussion of card shuffling theory including top-in-at-random shuffle and riffle shuffle [2].

The second part of this project incorporates the idea of randomness in the analysis of the Million Playlist Dataset introduced in the RecSys Challenge 2018 [3]. The collective distributions of music features of the first several songs in all playlists does not exhibit any particular pattern, which raises the question whether they may occur at random. This motivates the usage of the runs test to investigate dependencies between items in individual playlists. The theories and applications of runs distribution in the binary case and the k-category extension of runs test are discussed in this section.

In the final part of this project we address the multiple testing problem caused by simultaneous application of the runs test to nearly 1000 playlists. Correction methods include the Bonferroni and the False Discovery Rate (FDR). We prove that how the Benjamini-Hochberg procedure controls the FDR.

[1] Bar-Hillel and Wagenaar, The perception of randomness, Advances in applied mathematics 12.4 (1991), [2] Aldous and Diaconis, Shuffling cards and stopping times, in The Amer. Mathem. Monthly 93.5 (1986), [3] Chen et al, Recsys Challenge 2018: automatic music playlist continuation, Proc. of the 12th ACM Conf. on Recommender Systems, Vancouver, BC, Canada: Assoc. for Computing Machinery, 2018, https://doi.org/10.1145/3240323.3240342.

Decision models on beet yield and their agri-environmental consequences

Elizabeth Potter (3rd year BSc Data Science project)

Data visualisation and multi-agent multi-step decision approaches to longitudinal agricultural data

performance indicators asymptomatic testing

In the recent past, biodiversity levels in the UK have been dropping as more land is being converted to agriculture. How can agricultural decision making achieve high crop yield while maintaining environmental sustainability? We are interested in the impact of weed control on both yield and biodiversity [1, 2] with a particular focus on wild pollinators and their crucial role for the ecosystem.

We develop visualisation tools that show the temporal evolution of key events in crop growth in parallel for multiple farms against a backdrop coloured by meteorological factors such as temperature (top figure) or rain fall. The project utilised the raw data from the beet experiments in the 2003 Farm-Scale Evaluations (FSE) [3] repository. Data quality assessment and the interpretation is conducted keeping in mind the overall sparse availability of high-quality data in this domain.

Exploratory and correlation analysis of the variables involved is followed by the construction of multi-step decision models. Utility functions can be used to govern the optimisation of decision tasks. They can, for example, be used to quantify environmental sustainability as a function of pollinator count (bottom figure).

We build an interactive web app for users to create temporal plots according to their own specifications. Our visualisation tool can also be used for data from different domains, e.g. showing the numbers of GP appointments against temperature.

[1] BBC Future Maddie Moate, "What would happen if bees went extinct?", May 2014, [2] Food and Agriculture Organization of the United Nations, "Why bees matter", 2018, [3] Woiwod et al, Farm scale evaluations of herbicide tolerant genetically modified crops. J. Applied Ecology, 40(1):2– 16, 2003.

Tools for managing risks and uncertainties in big project management

Shiyu Chen (3rd year project, BSc Data Science)

Modelling and simulating the contribution of extreme risks inherent to subtasks of large projects to the overall project's value at risk

performance indicators asymptomatic testing

Project management is a precarious domain that project managers and related personnel struggle to conquer. The uniqueness of projects and the unpredictability of risks make their management challenging, and complexity and size increase difficulties [1]. Despite in-depth study of controlling risks from various aspects such as finance, technology and human resources, a statistical perspective and a data-driven approach can add further insight and offer potential improvement of project outcomes.

This project analyses how the probability distributions of the duration (time) and the budget (cost) of sub-tasks contribute to those of the project as a whole, which is defined by a work break down structure (WBS). Considering normally and exponentially distributed values for common situations, we also look at the case of extreme risks presented by the Weibull family. Combining this with the concept of Value at Risk allows the modelling of losses in the project management under a wide range of scenarios [2]. For calculations, Monte Carlo simulations are superior to the variance-covariance method. This is because they allow more flexibility for the distributions of the factors involved. Non-uniqueness of the critical path also needs to be considered.

The methods are implemented in a web application developed through R Shiny and the extRemes package [3]. It produces a PERT chart, finds possible critical paths, and generates simulations for the project. The analysis graphs provided can be used to calculate the value at risk at a given confidence level.

[1] Locatelli et al, The Successful Delivery of Megaprojects: A Novel Research Method, Project Management Journal, Project management journal, 48(5), 2017, [2] Pilar et al, A comprehensive review of Value at Risk methodologies. The Spanish Review of Financial Economics, Issue 12, 2014, [3] Gilleland, CRAN - Package extRemes, https://cran.r-project.org/web/packages/extRemes/index.html

Supervised machine learning of game-styles in tennis

Nicolai Williams (4th year Integrated Masters Dissertation, BSc MMORSE)

Comparing several clustering techniques for characterising and classifying game-style in tennis

performance indicators asymptomatic testing

Game-styles in tennis are characteristics that are evident to both players and tennis professionals. Players have different strengths and utilise them in customised strategies. Coaches attempts to classify players have resulted in lists of player game-styles, but so far this was based on expert judgement, rather than statistical methods.

This project made use of LTA point-by-point match data to characterise players using summary metrics, which were evaluated using discrimination and stability meta-metrics originally designed for NBA player data [1] ensuring sufficient discriminatory power. The refined summary metrics were firstly implemented and validated as part of a supervised machine learning approach based on K-means (shown in figure), K-medoids and Hierarchical Clustering algorithms to see what natural groupings would form and how many clusters would best split the data. Secondly, they served to train a classifier to see if it is possible to accurately classify the game-style of a new tennis player not used to train the classifier.

We showed that game-styles defined by the LTA are somewhat validated by our statistical analysis. Patterns in the clustering analysis and the numbers of clusters identified show a clear resemblance to the LTA game-styles. In particular, the All Court Player characterises a cluster by all 3 of the clustering algorithms indicating that this game-style is not only extremely appropriate, but also well allocated.

[1] Franks, Alexander et al, Meta-Analytics: Tools for Understanding the Statistical Properties of Sports Metrics, 2016.

Ergodicity of limit order book (LOB) Markov chains

Harry Mantelos Sapranidis (4th year Integrated Masters Dissertation, BSc MMathStat)

Using SDEs to derive generators of Markov processes modeling LOB evolution and prove their ergodicity

“LOB”

A Limit Order Book (LOB) contains all the orders and essential characteristics submitted by traders and serves to connect buyers and sellers. We consider a market with three types of orders: limit orders (price at which a trader is willing to buy/sell a specific number of shares any time) and cancellation orders of these, and market orders (immediately buy/sell a certain quantity of shares). The stochastic evolution of the LOB can be modelled by continuous-time Markov processes with all the different order submission given by Poisson processes [1, 2].

In this project, we also studied the embedded Markov chain of this process (in discrete time), only dealing with the times at which there is a change in the shape of the order book (as a result of incoming orders), as well as some interesting variants of the model e.g. changing assumptions on the intensities of the Poisson processes for the different orders. We describe the evolution of the shape of the LOB by a stochastic differential equation driven by Poisson processes and use this to derive the generator/transition operator which infinitesimally characterise the movement of the Markov process/embedded Markov chain.

Using stochastic stability theory (cf [3]) we use our previous results to prove a very important property of the models: ergodicity. This ensures that our process will eventually have a particular distribution, regardless of the initial distribution. What is more, we prove that this stabilization happens exponentially fast.

[1] Abergel et al, Limit Order Books, Physics of Society: Econophysics, CUP, 2016, [2] Rama et al, A stochastic model for order book dynamics, Operations Research, 58:549-563, 06 2010, [3] Meyn et al, Markov Chains and Stochastic Stability, Cambridge Math Library, CUP, 2009.

Option pricing and hedging with execution costs and market impact

Nikolaos Constantinou (4th year Integrated Masters Dissertation, BSc MMORSE)

Comparing alternative stochastic models for stock price evolution and numerical solutions of PDE obtained via a splitting technique

performance indicators asymptomatic testing

Option pricing theory is well developed in terms of credit risk of an option contract, with the celebrated Black-Scholes model for pricing vanilla options being a leading example. On the other hand, liquidity risk is not as advanced, partially perhaps due to the various practicalities one can face in intraday trading situations, which constantly densify, and subsequently because of the lack of a formal definition on what liquidity really is about.

Our study aims to price vanilla options, with a given nominal, by embodying frictions in a delicate manner. In particular, the modelling approach of imperfections is revolved around the continuous time model [2], a recent contribution to the literature on option pricing. Their framework is based on [1], whose work on optimal execution is concerned with another broad area of Financial Mathematics problems. However, the Almgren-Chriss framework deviates from the conventional log-normal stock price dynamics as it works on a Bachelier model, where the stock price is assumed to have a Gaussian evolution instead, so that this gives rise to the second, and inevitably more complex, model in our study.

Discussing each of the two problems from scratch, we were able to make some inference about the price and strategy effect of various parameters of interest and to compare across the arithmetic and geometric Brownian motion stock price problems. Most notably, in both modelling approaches we stress indeed the departure from the perfect market setting in terms of both option prices and hedging strategies (see figure) via a numerical example, whose derivation relies on numerically solving relevant PDEs via a splitting technique.

[1] Almgren and Chriss, Optimal execution of portfolio transactions. Journal of Risk, 3:5–39, 2001, [2] Guéant and Pu, Option pricing and hedging with execution costs and market impact, Math. Finance, 27:803–831, 2017.

Detecting, visualising and analysing dependencies in big project management

Mai-An Dang (3rd year project, BSc Data Science)

Modularisation of big projects into small tasks with a network view and critical path distribution

project charts

Around 9 out of 10 large infrastructural projects go over budget [1]. Unfortunately, megaprojects – projects with budgets over $1 billion [2] – have persistently experienced poor performance over time and across the globe [3]. The motivation for this project is to provide better insight into complex project structures. The objective is to achieve this by analysing dependencies between variables in large projects from a statistical point of view.

This work challenges existing project management processes and proposes a stochastic approach to project management network scheduling techniques. In particular, this work extends the notion of a critical path from the existing critical path method by considering the concept of criticality, a measure to rank non-critical paths within work packages.

This work first considers defining projects as stochastic processes such as Markov processes and decision processes. The focus then moves towards network scheduling techniques. A large proportion of this work was spent on manually implementing a stochastic critical path method in the statistical programming language R.

Finally, several recommendations for a criticality measure are made. All of the defined and proposed criticality measures for non-critical paths are based on task floats or the distribution of entire paths as random variables.

[1] B Flyvbjerg. Megaproject Policy and Planning: Problems, Causes, Cures. Sum- mary of Dissertation for Higher Doctorate in Science, (Dr. Scient.), Aalborg Univer- sity, 2007. URL https://ssrn.com/abstract=2278265, [2] C Fiori and M Kovaka. Defining megaprojects: Learning from construction at the edge of experience. In Construction Research Congress 2005. American Society of Civil Engineers, 2005. URL https://ascelibrary.org/doi/pdf/10. 1061/40754(183)70, [3] B Flyvbjerg. What you should know about megaprojects and why: An overview. Project Management Journal, 45(2):6–19, April 2014. URL https://arxiv.org/ftp/arxiv/papers/1409/1409.0003.pdf.