Module leader: Y Yu
Please see the full Module Specifications for background information relating to all of the APTS modules, including how to interpret the information below.
Aims: Remarkable developments in computing power and other technology now allow datasets of immense size and complexity to be collected routinely. One common feature of many of these modern datasets is that the number of variables measured can be very large, and even exceed the number of observations. In these challenging high-dimensional settings, classical statistical methods often perform very poorly or do not work at all. In this course we will look at some of the current methods for handling such data and try to understand when and why they work well.
Learning outcomes: After taking this module, students should be able to use analogues of many of the tools from classical statistics to analyse high-dimensional datasets. They should also be more well-placed to study and make a contribution to the growing literature on high-dimensional statistics.
Prerequisites: Preparation for this module should establish:-
- Standard matrix algebra (not beyond that covered in the Statistical Computing module);
- Basic knowledge of real analysis and norms;
- Undergraduate level probability (no measure theory required) and statistics (e.g. maximum likelihood, the normal linear model, hypothesis tests and p-values);
- Thorough understanding of the normal linear model;
- Some basic elements of optimisation and convex analysis that will be covered in the preliminary material.
- Hastie et al, (2001). The Elements of Statistical Learning, Springer - You may wish to look at chapters 3 and 17 up to the end of 17.3.2. It is slightly less mathematical than this course but great for gaining some intuition.
- Buhlmann and van de Geer (2011). Statistics for High-dimensional Data, Springer - gives a more in-depth treatment of parts of our course. You may wish to look initially at chapter 2. Chapters 6, 10, 11 and 13 cover the material of the course, but are much more advanced.
- Ridge regression;
- The Lasso and extensions;
- Graphical modelling including neighbourhood selection and the graphical Lasso;
- Multiple testing including the false discovery rate and the Benjamini-Hochberg procedure.
Assessment: Exercises with both a theoretical and a computational component.