David Meakin Undergraduate Dissertation 2000/01

School of Computer Science

David Meakin Undergraduate Dissertation 2000/01

"Statistical Word Sense Disambiguation"

Supervised by Y.Wilks

Abstract

In Natural Language many words have multiple meanings, or senses. This project will attempt to automatically assign the correct sense to a word in a given context. This is not an easy task and there are many complications to be considered before Word Sense Disambiguation (WSD) can be carried out successfully

The purpose of this project is to investigate word sense disambiguation, and particularly a statistical method to carry out such disambiguation. A theoretical background is given on the subject of WSD examining the difficulties and successes faced by previous NLP scholars, and a history of the various approaches that have been used A recent approach to this task was by David Yarowsky, who attempted to build statistical models of the likely category or categories of a word in Roget's Thesaurus using information about the sorts of words that occur around the given word in a large corpus. In order to investigate how WSD can be carried out, and how successfully, this project will reimplement his algorithm using the British National Corpus (BNC) to provide training data. This will include the design implementation and testing of algorithms for building the statistical models and for applying them to new words. The results from this process can then be compared with Yarowsky's original results, and results claimed by other WSD methodologies. Unlike Yarowsky, verbs will be used in the WSD process Resources used will include electronic versions of Roget's Thesaurus and the BNC, and the Perl programming language.