The University of Sheffield
Department of Computer Science

Stephen Levin Undergraduate Dissertation 2000/01

"Email Filtering Tool"

Supervised by M.Hepple

Abstract

A vast number of people are now using e-mail as a communication medium, for both business and personal purposes. This has in turn led to an increase in the amount of unsolicited, 'junk mail' being sent. Thus, it has become extremely desirable to create an effective e-mail filtering tool, capable of automatically removing unwanted e-mail from a user's system.

In this report, existing approaches towards information retrieval and text categorisation are reviewed, with particular emphasis on the recent emergence of statistical approaches. This leads to the analysis of a relatively new statistical approach towards text categorisation that uses the chi-square test. The possible application of the chi-square approach in the context of e-mail filtering is explored, and a functional e-mail filtering tool is subsequently developed.

The final system is thoroughly evaluated, using both e-mail and news article corpora. The results suggest that the chi-square approach can be used to effectively filter e-mail, and extensions that could be made to the system to further improve accuracy and functionality are proposed.