The University of Sheffield
Department of Computer Science

COM4521 Parallel Computing with Graphical Processing Units (GPUs)

Summary Accelerator architectures are discrete processing units which supplement a base processor with the objective of providing advanced performance at lower energy cost. Performance is gained by a design which favours a high number of parallel compute cores at the expense of imposing significant software challenges. This module looks at accelerated computing from multi-core CPUs to GPU accelerators with many TFlops of theoretical performance. The module will give insight into how to write high performance code with specific emphasis on GPU programming with NVIDIA CUDA GPUs. A key aspect of the module will be understanding what the implications of program code are on the underlying hardware so that it can be optimised.
Session Spring 2017/18
Credits 15
Assessment

80% for project assignment of which 40% is for the delivery of course work (code) and 40% is for written report describing methods and techniques

Two 10% multiple choice quizzes.

Lecturer(s) Dr Paul Richmond
Resources
Aims
  • To introduce modern accelerator architectures, explain the difference between data and task parallelism and raise awareness into how the practical and theoretical performance of architectures differs.
  • To give practical knowledge of how GPU programs operate and how they can be utilised for high performance applications.
  • To develop an understanding of the importance of benchmarking and profiling in order to recognise factors limiting performance and to address these through optimisation.
Objectives

By the end of this course students will be able to:

  • Understand how to write C programs and manage memory allocation manually.
  • Utilise OpenMP to write programs for multi core architectures to improve code performance.
  • Be able to describe and discuss performance techniques for multi core processors.
  • Program GPUs for general purpose use with the CUDA language.
  • Appreciate how GPU program performance can be improved through intelligent caching.
  • Appreciate the scope, potential and limitations of accelerators for improving code performance.
  • Benchmark and profile GPU programs.
  • Identify limiting factors to code performance and address these through architecture specific optimisation techniques.
  • Recognise and understand the importance of parallel primitives (such as scan and reduce) and understand how these can be implemented with data parallelism.
  • Understand how to display GPU data graphically by integrating CUDA programs with OpenGL.
Content
  • Introduction to accelerated computing
  • Introduction to programming in C
  • Pointer and Memory
  • Optimising C programs
  • Multi core programming with OpenMP
  • Introduction to Accelerated Computing
  • Introduction to CUDA
  • GPU memory systems
  • Caching and Shared Memory
  • Synchronisation and Atomics
  • Parallel Primitives
  • Asynchronous programming
  • Profiling and Optimisation of GPU programs
  • Graphics Interoperability
Restrictions This module is only open to Computer Science students.
Teaching Method

Weekly lectures will introduce students to the background on CPU and GPU architectures and programming techniques. Lectures will highlight key design principles for parallel and GPU programming to give students the necessary insight to be able to constructively look at problems and understand the implications of parallel computing.

Lab sessions will facilitate hands on learning of practical skills through targeted exercises

Feedback Students will receive continuous feedback from lab sessions and Google discussion groups. Feedback will also be given on marked quiz assignments and for the main assignment.
Recommended Reading
  • Edward Kandrot, Jason Sanders, "CUDA by Example: An Introduction to General-Purpose GPU Programming", Addison Wesley 2010.
  • Brian Kernighan, Dennis Ritchie, “The C Programming Language (2nd Edition)”, Prentice Hall 1988.