The University of Sheffield
Department of Computer Science

Alexander Burley Undergraduate Dissertation 2015/16

Fixing Twitter

Supervised by L.Specia

Abstract

Everyday hundreds of millions of people open up their Twitter application and write their thoughts to the world. From teenagers in school to famous celebrities, people of all backgrounds can interact with each other across the internet by sending short messages or simply giving indication of approval by the press of a button. Twitter has become a central data source for areas of research like sentiment analysis and machine translation. However, Twitter has no unified language and each 'tweet' has a character limit that leads users to emphasize quality over quantity. The problem arises that tweets are full of useful information yet hidden in a mass of non-standard terms created by web culture. Emoticons, slang, abbreviations, all of these terms that are so easily understandable from human view can cause large problems for machine translation systems. This project's objective is to examine this irregular use of English language, specifically in the context of tweets, and implement various methods to bring each tweet closer to it's true representation in standard English. Using regular expressions we can categorize each component of a tweet and use these techniques to 'normalize' the component to regular language. Through this, we aim to improve the ability for a public machine translation system to translate irregular tweets into a foreign language.