The University of Sheffield
Department of Computer Science

Carlos Morales Garduno MSc Dissertation 2015/16

Textual Alignment of News and Blogs

Supervised by M.Stevenson

Abstract

The number of blog  posts published each day is estimated to be above 3 million. As more people is using the blogosphere as its source of information. This change has represented an issue for traditional media which needs to leverage the new opportunities offered by the medium. A key step for news researchers, the automated identification of links between news and blogs , would enable further analytical processing. Tracking public response, identifying new leads, following the stories after print, new forms of public engagement, are among the motivations for this. Experiments were conducted using a subset of the New York Times Annotated Corpus 05 and the TREC BLOGS06 collection.

This project explores different techniques to identify the links. A baseline was defined using cosine distance. Then, a procedure was developed to extrapolat e document-to-document comparisons for the exploration of topical relations. Later, an LDA implementation was explored to improve the baseline. A further improvement is introduced by performing a second comparison using Kullback - Leibler  divergence. To compare the performance Precision and Recall were used. Additionally, a gold standard was created which would enable further advancing the research in this area.