The University of Sheffield
Department of Computer Science

COM6012 Scalable Machine Learning

Summary This module will focus on technologies and algorithms that can be applied to data at a very large scale (e.g. population level). From a theoretical perspective it will focus on parallelization of algorithms and algorithmic approaches such as stochastic gradient descent. There will also be a significant practical element to the module that will focus on approaches to deploying scalable ML in practice such as SPARK, programming languages such as Python/Scala and deployment on high performance computing platforms/clusters.
Session Spring 2020/21
Credits 15
Assessment Coursework and Blackboard quizzes
Lecturer(s) Dr Mauricio Alvarez & Dr Haiping Lu
Resources Unconfirmed practical marks when available

This unit aims to to provide a deeper understanding of the fundamental technologies underlying data analytics at scale. In particular it will provide advanced understanding of

  • parallelization of algorithms and algorithmic approaches such as stochastic gradient descent
  • practical skills relating to the deployment of scalable ML

By the end of the unit, a student will be able to

  • understand the theoretical issues and wider context relating to ML at scale
  • understand practical parallelization of algorithms and algorithmic approaches using such techniques as stochastic gradient descent;
  • deploy a practical implementation of ML at scale, using SPARK, and programming languages such as Python/Scala;
  • deployment onto high performance computing platforms/clusters.


  • Spark overview
  • Scala programming

Spark & HPC

  • Spark DataFrame/dataset
  • Machine learning pipeline
  • High performance computing

Parallelization & optimization in Spark

  • Parallelization
  • Optimization

Scalable matrix factorization for collaborative filtering & applications
Scalable KMeans clustering & applications
Scalable PCA for dimensionality reduction & applications
Scalable decision trees & applications
Scalable logistic regression & applications
Scalable GLM & applications
Scalable neural networks
Other topics

Teaching Method Lectures, laboratory classes.
Feedback Immediately for exercises in laboratory classes. After each coursework stage through debriefing lecture and individual marking.
Recommended Reading
  • Apache Spark