The University of Sheffield
Department of Computer Science

Ngai Tang MSc Dissertation 2000/01

"Text Categorisation using Support Vector Machines"

Supervised by M.Hepple

Abstract

Text categorisation is becoming increasingly important given the large volume of online text available through the World Wide Web (WWW), electronic mail, corporate databases, medical patient records and digital libraries. The goal of text categorisation is to classify documents into a certain number of predefined categories. Evidence shows that Support Vector Machines (SVMs) are well suited for text categorisation, outperforming other well-known text categorisation methods such as k-nearest neighbor (kNN), neural networks (NNet), and Naïve Bayes (NB). Support vector learning attempts to induce a decision rule from which to categorise examples of different concepts by generalising from a set of training examples.

Using the SVMlight program, linear support vector machines showed very good overall categorisation accuracy on the Reuters-21578 collection. The minimum number of training examples required to achieve good classification accuracy (around 90%) was found to be a training set size of around 300 examples (150 positive and 150 negative instances). In terms of document pre-processing, the exclusion of stemming, inclusion of stop-words or non-use of IDF weights did not significantly affect the classifier's performance. The effect of variations in feature selection was also tested, using document frequency thresholding (DF). Removal of rare terms reduces the dimensionality of the feature space, and can lead to an increase in text categorisation performance. However, excessive term removal is not recommended. The transductive learning approach shows great promise when few training examples are available, however the training time associated with transductive learning is exceptionally slow.