The University of Sheffield
School of Computer Science

Matthew Burman Undergraduate Dissertation 2017/18

A data architecture for parallel stream clustering

Supervised by F.Ciravegna

Abstract

The rate of global data production is growing rapidly - more and more sources of increasing heterogeneity at larger and larger scales. Conventional data architectures are consistently showing signs of fragility, making way for modern architectures built to handle this new-age scale.

The aims of this project were to investigate various architectural and algorithmic solutions to modern scale issues. The focus specifically was on summarisation of mobility data via location clustering through the Active10 dataset.

An architecture consisting of Apache Kafka and parallel python consumers has been deployed to investigate time and space of execution of algorithms on a 16-core machine. It has been shown that Apache Kafka is a huge benefit to production workloads, offering flexibility and redundancy that is conventionally unmatched. Further, experimentation has suggested levels of scale before conventional ETL processes begin to break down due to fragility.