COM6012 Scalable Machine Learning
|| This module will focus on technologies and algorithms that can be applied to data at a very large scale (e.g. population level). From a theoretical perspective it will focus on parallelization of algorithms and algorithmic approaches such as stochastic gradient descent. There will also be a significant practical element to the module that will focus on approaches to deploying scalable ML in practice such as SPARK, programming languages such as Python/Scala and deployment on high performance computing platforms/clusters.
- Formal examination
||Dr Mauricio Alvarez & Dr Haiping Lu
|| Unconfirmed practical marks when available
This unit aims to provide a deeper understanding of the fundamental technologies underlying data analytics at scale. In particular it will provide advanced understanding of
- parallelization of algorithms and algorithmic approaches such as stochastic gradient descent
- practical skills relating to the deployment of scalable ML
By the end of the unit, a student will be able to
- understand the theoretical issues and wider context relating to ML at scale
- understand practical parallelization of algorithms and algorithmic approaches using such techniques as stochastic gradient descent;
- deploy a practical implementation of ML at scale, using SPARK, and programming languages such as Python/Scala;
- deployment onto high performance computing platforms/clusters.
- Spark overview
- Scala programming
Spark & HPC
- Spark DataFrame/dataset
- Machine learning pipeline
- High performance computing
Parallelization & optimization in Spark
Scalable matrix factorization for collaborative filtering & applications
Scalable KMeans clustering & applications
Scalable PCA for dimensionality reduction & applications
Scalable decision trees & applications
Scalable logistic regression & applications
Scalable GLM & applications
Scalable neural networks
||Lectures, laboratory classes.
||Immediately for exercises in laboratory classes. After each coursework stage through debriefing lecture and individual marking.
- Apache Spark https://en.wikipedia.org/wiki/Apache_Spark
- Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive data sets. Cambridge university press, 2020.
- Shah, Chirag. A Hands-On Introduction to Data Science. Cambridge University Press, 2020.
- Karau et al., "Learning Spark: Lightning-Fast Big Data Analysis", O'Reilly Media, 2015
- Learning Apache Spark with Python Release at https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf
- B. Chambers & M. Zaharia, Spark: The Definitive Guide