Please read our student and staff community guidance on COVID-19
Skip to main content Skip to navigation

ST420 Statistical Learning and Big Data

Lecturer(s): Dr Richard Everitt

Commitment: 3 lectures per week.

Pre-requisite(s):

  • Statistics UG students: ST218 Mathematical Statistics A, ST219 Mathematical Statistics B and ST221 Linear Statistical Modelling.
  • MSc in Statistics students: ST903 Statistical Methods and ST952 Introduction to Statistical Practice.
  • Master’s in Financial Mathematics students: MA907 Simulation and Machine Learning.
  • External UG students: ST220 Introduction to Mathematical Statistics and ST221 Linear Statistical Modelling.

Aims: This module will introduce students to modern applications of Statistics in challenging modern data analysis contexts and provide them with the theoretical underpinnings to apply these methods.

Learning Outcomes: On successful completion of the module students will be able to

  • explain, critically discuss and apply fundamental concepts and analytic tools in Statistical Learning;
  • analyse and discuss issues and fundamental tools in the analysis of Big Data and Big Models;
  • implement and assess methods for prediction based on partitioning data;
  • apply fundamental tools based on sparsity, regularisation and the control of error rates to analyse large data sets.

Syllabus:

Statistical Learning – an introduction to statistical learning theory, using simple ML methods to illustrate the various ideas:

  • From over-fitting to apparently complex methods which can work well, such as VC dimension and shattering sets.
  • PAC bounds. Loss functions. Risk (in the learning theoretic sense) and posterior expected risk. Generalisation error.
  • Supervised, unsupervised and semi-supervised learning.
  • The use of distinct training, test and validation sets, particularly in the context of prediction problems.
  • The Bootstrap revisited. Bags of Little Bootstraps. Bootstrap aggregation. Boosting.

Big Data and Big Model – issues and (partial) solutions:

  • The “curse of dimensionality”. Multiple testing; voodoo correlations, false-discovery rate and family-wise error rate. Corrections: Bonferroni, Benjamini-Hochberg.
  • Sparsity and Regularisation. Variable selection; regression. Spike and slab priors. Ridge Regression. The Lasso. The Dantzig Selector.
  • Concentration of measure and related inferential issues.
  • MCMC in high dimensions – preconditioned Crank Nicholson; MALA, HMC. Preconditioning. Rates of convergence.

Assessment: 100% by 2-hour examination in April.

Books: