The University of Sheffield
Department of Computer Science

Feng Bian Undergraduate Dissertation 2005/06

"Automatic Corpus Construction from the Web"

Supervised by Dr RM Stevenson

Abstract

Nowadays, language researchers often process on large bodies of text as raw data. This is what we call a corpus. Following the development of information technology, the World Wide Web has become the indispensable source for the researchers as well as the source for daily useful information in our lives. Powerful search engines such as Google and Yahoo! have played important roles as a Yellow Page on the Internet. However, the retrieved texts obtained by the conventional search through these search engines are largely unordered, consisting of variable qualities, mixed with irrelevant information and redundant documents, and as a result, the desire for an off-line search engine which retrieves only the relevant information from the Internet is growing.

The aim of this project is to build a Corpus Generator that can automatically generate a large corpus using available web search technology, such as Yahoo! API and Google API. It will accept a set of the web pages or text files as the input, summarize the contents of these input files, generate an expended query, and finally, retrieve further relevant information for the user.

The Corpus Generator achieves this by implementing various searching optimization features: generating expanded query from the initial documents provided (technologies involved in this part covers statistical term weighting, Pseudo Relevant Feedback, and Morphological Analysis of each word), further retrieval using the generated query built upon the Google API and Yahoo! API, and applying document Similarity Measure and filtering algorithm to the retrieved information before return the final search result, the corpus, to the user. Other features for search optimization implemented in this project include HTML Parser (a technology to remove Java script and HTML tags), MySQL database, embedded IE browser, and a number of other search optimization options that have been described in this report.