The aim of this module is to allow students to understand the foundational skills in data analytics, including preparing and working with data; abstracting and modeling an analytic question; and using tools from statistics, learning and mining to address these questions. Students will study techniques for how to go from raw data to a deeper understanding of the patterns and structures within the data, to support making predictions and decision making.
The course will cover a number of topic, including:
- Introduction to analytics and case studies - examples of successful analytics work from companies such as Google, Facebook, Kaggle, and Netflix);
- Basic tools - including unix/linux command line tools for data manipulation (sorting, counting, reformatting, aggregating, joining); tools such as gnuplot for displaying and visualizing data; advanced programming tools such as Perl and Python for powerful data manipulation;
- Statistics - the tools from statistics for understanding distributions and probability (means, variance, tail bounds). Hypothesis testing for determining the significance of an observation, and the R system for working with statistical data;
- Databases – including problems found in realistic data: errors, missing values, lack of consistency, and techniques for addressing them. The relational data model, and the SQL language for expressing queries. The NoSQL movement, and the systems evolving around it;
- Regression - predicting new data values via regression models. Simple linear regression over low dimensional data, regression for higher dimensional data via least squares optimization, logistic regression for categoric data;
- Matrices - Matrices to represent relations between data, and necessary linear algerbraic operations on matrices. Approximately representing matrices by decompositions (Singular Value Decomposition and Principal Components Analysis). Application to the netflix prize;
- Clustering - Finding clusters in data via different approaches. Choosing distance metrics. Different clustering approaches: hierarchical agglomerative clustering, k-means (Lloyd's algorithm), k-center approximations. Relative merits of each method;
- Classification - Building models to classify new data instances. Decision tree approaches and Naive Bayes classifiers. The Support Vector Machines model and use of Kernels to produce separable data and non-linear classification boundaries. The Weka toolkit;
- Data Structures - Data structures to scale analytics to big data and data streams. The Bloom filter to represent large set values. Sketch data structures for more complex data analysis, and other summary data structures;
- Data Sharing - The ethics and risks of sharing data on individuals. Technologies for anonymising data: k-anonymity, and differential privacy.
- Graphs - Graph representations of data, with applications to social network data. Measurements of centrality and importance. Recommendations in social networks, and inference via relational learning.