Coronavirus (Covid-19): Latest updates and information
Skip to main content Skip to navigation

Data Engineering with Python

Introduction

An inherent component in data science and artificial intelligence systems is the ability to process, engineer and manage the flow of data from source systems into the algorithms. In many cases, the efficiency and efficacy of these pipelines are of equal importance to the success of these systems as the algorithms themselves. This module, using the industry-standard Python language, aims to provide students the necessary skills and competencies to implement efficient and reliable code, and employ best practices in data management, algorithm development and computational statistics.

This module aims to introduce students to many of the advanced statistical and data engineering techniques made possible by innovations in computing and modern processing power. This includes:

  • clustering
  • dimension reduction
  • feature engineering
  • natural language processing
  • high performance computing
  • analysis of algorithms and computational complexity.

Objectives

Upon successful completion participants will be able to:

  • Write original, non-trivial Python applications and algorithms.
  • Develop a sound understanding of current, modern data engineering approaches and their application to a variety of datasets.
  • Automate dimension reduction and clustering techniques on a variety of complex datasets and critically evaluate the results.
  • Evaluate and optimise algorithms for better computational performance.

Syllabus

Programming with Python
- Programming Fundamentals
- Introduction to Python
- Data structures
- Functions
- Packages

Pandas and data management
- Data cleaning
- Joining and merging datasets
- Feature engineering
- Automation

Computational complexity and analysis of algorithms
- Big O notation
- Compilation
- Vectorisation
- Best practices for programming

Natural language processing
- Working with text data
- NLP
- Sentiment analysis
- Topic models

Clustering and Dimension Reduction
- Clustering
- Dimension reduction

Assessment

  • Python programming test (20%)
  • Data Engineering Presentation (10%)
  • 4,000 words Post Module Assignment (70%)

Duration

1 week including 15 hours of lectures and 15 hours of practical classes