The University of Sheffield
School of Computer Science

Danny Heard Undergraduate Dissertation 2017/18

Sound Event Tagging Using Audio and Visual Features

Supervised by J.Barker

Abstract

Sound event tagging systems have typically been designed in regard to using audio data as the sole input of the system. However, large new data sources that consist of data from both audio and visual domains such as YouTube are available for audio tagging and sound event detection tasks.

In this project, multi-modal techniques are explored with the aim of enhancing a state-of-the-art audio tagging system by leveraging the visual information from a YouTube based dataset.

These techniques shows that a multi-modal approach to audio tagging is feasible, and improved the performance of sound events that have visual object representations compared to the baseline system. However, further research needs to be carried out in order to address the limitation of the presented techniques.