# ST343 Topics in Data Science

###### Lecturer(s): Dr Sebastian Vollmer, Dr Teresa Brunsdon and Professor Yulan He

###### Please note that the topics covered in this module may change from year to year.

**Prerequisite(s): **Either ST219 Mathematical Statistics B, ST220 Introduction to Mathematical Statistics or CS260 Algorithms.

**Commitment: **3 lectures per week for 10 weeks. This module runs in Term 2.

**Content: **Three self-contained sets of ten lectures in term 2.

**Assessment: **100% by 2-hour examination.

* Examination period*: Summer

**Title: Deep Learning for Natural Language Processing**

**Lecturer:** Prof. Yulan He

**Aims:** Deep learning has gained significant interests in recent years due to ground breaking results in various areas including natural language processing, computer vision and speech recognition. Deep learning approaches employ many-layered Neural Networks (NNs) that automatically learn hidden semantic structures at different levels of abstraction from the raw data such as text, images, video and audio signals. This topic will provide an introduction to the theory and practice of deep NNs with focus on the applications in natural language processing (NLP). We will cover neural network architectures such as convolutional NNs, recurrent NNs, attention mechanisms, transformers, sequence-to-sequence learning and (time-permitting) generative adversarial networks and variational autoencoder.

**Objectives:** By the end of the course students should be able to (1) explain the key concepts of artificial NNs, such as activation functions, layers, weights and gradient descent for fitting a NN; (2) describe and implement NN architectures such as CNNs and RNNs for NLP applications; (3) have an understanding of more advanced NN architectures such as attention mechanism, transformers and sequence-to-sequence learning.

**References:**

Goldberg, Y., 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1), pp.1-309.

Goodfellow et al. “Deep learning”. MIT Press, 2017.

**Examination period: **Summer

**Title:** Decision Trees and Random Forests

**Lecturer:** Dr. Teresa Brunsdon

**Aims:** Decision Trees are a popular alternative to statistical predictive models in data analytics. They have the advantage of being very simple for a non-specialist to understand with algorithms available for both categorical dependent variables (classification trees) and continuous dependent variables (regression trees). In addition they handle both interactions and missing data extremely well. A variety of algorithms exist such as CHAID, C4.5, C5.0, ID3 and CART. However, there are many pitfalls to building a decision tree, e.g. overfitting and instability. Methods exist to combat these issues such as the use of training and validation data, bagging and boosting, assessment criteria etc. One particular way to improve stability is to use a random forest which in effect builds several decision trees and combines the results. This section will give an introduction to the theory and practice of decision trees and random forests with practical examples and details of some of the algorithms available. We will cover some of the options available within these algorithms and examine the advantages and disadvantages of using a decision tree or a random forest over each other and other predictive modelling techniques.

**Objectives:** By the end of the course students should be able to (1) explain the key concepts of decision trees and random forests, their terminology, such as nodes, branches, depth, weights and be familiar with the main algorithms; (2) describe and implement various algorithms for decision trees and random forests and assess their effectiveness; (3) be able to explain the advantages and disadvantages of using a decision tree or a random forest over each other and other algorithms.

**References:**

Rokach, L. and Maimon O. Z., 2014. Data Mining With Decision Trees: Theory And Applications (2nd Edition) (Series In Machine Perception And Artificial Intelligence), World Scientific Publishing Company.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd Corrections 9th printing 2017 edition ed.). Springer.

Kass , G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. *Applied Statistics, 29*(2), 119–127.

Larose, D. T., & Larose, C. D. (2015). *Data Mining and Predictive Analysis.* (Second, Ed.) New Jersey: Wiley.

**Title:**

**Lecturer:** Dr Sebastian Vollmer

**Outline: **Model comparison and selection

- Scientific validity in the context of the data analytics workflow
- Basic model-agnostic assessment of supervised/predictive models
- Performance quantification of regression and classification models
- Predictive model validationin R/mlr and python/sklearn Julia/MLJ
- Full statistical formulation of the supervised learning setting
- Bias-variance trade-off, cross-validation and re-sampling estimators
- Estimators of the generalization loss and the loss‘s variance
- Hypothesis testing for pairwise and portmanteau model comparison
- Meta-strategies for automated model improvement
- Interaction of model tuning and model validation workflows

**References:**