Wim Peters

 Personal Homepage




I am a Research Sssociate/Fellow within the Sheffield  Natural Language Processing group.

I am working within the GATE (General Architecture for Text Engineering) group on a number of recent and current projects within various areas:


Digital Preservation

Coordination of the ARCOMEM project (January 2011 - December 2013)
ARCOMEM (FP7-IST-270239) - From Collect-All ARchives to COmmunity MEMories - was an FP7 integrated project focusing on memory institutions like archives, museums and libraries in the age of the social web.
Social media are becoming more and more pervasive in all areas of life, and archiving institutions need to incluse these in their preservation activities.
ARCOMEM’s aim was to help to transform archives into collective memories that are more tightly integrated with their community of users and to exploit Web 2.0 and the wisdom of crowds to make web archiving a more selective and meaning-based process.
For this purpose methods and tools have been developed based on novel socially-aware and socially-driven preservation models that:
(a) leverage the Wisdom of the Crowds reflected in the rich context and reflective information in the Social Web for driving innovative, concise and socially-aware content appraisal and selection processes for preservation,
taking events, entities and topics as seeds, and by encapsulating this functionality into an adaptive decision support tool for the archivist, and
b) use Social Web contextualization as well as extracted information on events, topics, and entities for creating richer and socially contextualized digital archives.


Human Computing

  Principal Investigator of the UCOMP project (Embedded Human Computation for Knowledge Extraction and Evaluation) (Nov 2012-Nov 2015).

The rapid growth and fragmented character of social media (Twitter, Facebook, blogs, etc.) and publicly available structured data (e.g., Linked Open Data) has led to challenges of how to extract knowledgefrom such noisy, multilingual data in a robust, scalable and accurate manner. Existing approaches often fail when encountering such unpredictable input, a short-coming that is amplified by the lack of suitable training and evaluation gold standards.
The goal of this inter-disciplinary project is to address these challenges by developing new methods arising from the Human Computation (HC) paradigm, which aims to merge collective human intelligence with automated knowledge extraction methods.
Embedding HC in the emerging discipline of Web Science, however, is far from trivial, especially when aiming to extract knowledge from heterogeneous, noisy, and multilingual data:
Which knowledge artefacts are best suited for HC-based acquisition?
How can complex knowledge extraction be broken down into a series of simple HC tasks?
How can noisy input be used to train a knowledge extraction algorithm?
The project builds on the emerging field of Human Computation (HC) in the tradition of games with a purpose and crowdsourcing marketplaces. It will advance the field of Web Science by developing a scalable and generic HC framework for knowledge extraction and evaluation, delegating the most challenging tasks to large communities of users and continuously learning from their feedback to optimise automated methods as part of an iterative process.


Digital Humanities

Main technical partner for TALH (Text Analysis for the Humanities) (October 2013-June 2014)
The council registers of Aberdeen, Scotland are the earliest and most complete body of surviving records of any Scottish town, running nearly continuously from 1398 to the present day.
Few cities in the United Kingdom or indeed Western Europe rival Aberdeen's burgh registers in historical depth and completeness.
In July 2013, work started on the analysis of a number of transcribed manuscripts from Council Register 13, covering 1530-1531.
These volumes, written in Latin and Middle Scots, offer a detailed view into one of Scotland’s principal burghs, casting light on administrative, legal, and commercial activities as well as daily life.
The registers include the elections of office bearers, property transfers, regulations of trade and prices, references to crimes and subsequent punishment, matters of public health, credit and debt, cargoes of foreign vessels, tax and rental of burgh lands, woods and fishings.
In their entirety, the records for the City of Aberdeen range from 1398 to the present day in the form of manuscripts and printed text.
Aims of the TALH project:
1. Support legal historical research. (Legal) historians are now able to formulate their research questions and translate these into correlating corpus search patterns whose results may provide indications towards answers.
2. Implement and advocate the methodology of applying text analytic tools, in order to enrich historical documents with metadata and make the contained information accessible.
3. Showcase the rich informational structure and the enhanced accessibility of the archival material by means of a metadata layer instantiated by text annotations. The project's main output is a digital version of manuscripts, enriched with metadata in the form of annotations, which provide fine-grained acccess to content, and enable semantic search at a previously unattainable level of detail. The corpus content and its accessin/querynig facilities herald a new phase in ICT based analysis of historical resources.


Natural language processing for legal applications is a growing area. Guiven the fact that legal texts  mostly co nsist of unstructured text, NLP allows either the automatic filtering of regular legal text fragments or the automatic support for close reading of legal text. I have been involved in various activities in thsi area. One example: Case based reasoning is a crucial aspect of common law practice, where lawyers select precedent cases which they use to argue for or against a decision in a current case. To select the precedents, the relevant facts (the case factors) of precedent cases must be identified; the factors predispose the case decision for one side or the other.  As the factors of cases are linguistically expressed, it is useful to provide a means to automate the identification of candidate passages. Factor analysis from unstructured linguistic information is a complex, time-consuming, error-prone, and knowledge intensive task; it is a difficult aspect of the ``knowledge bottleneck'' in legal information processing.  Techniques which could facilitate factor analysis would support a task that is essential to lawyers -- finding relevant cases.  In addition, by using Semantic Web technologies such as XML and ontologies, novel methods could be developed to analyse the law, make it more available to the general public, and to support automated reasoning.  Nonetheless, the development of such technologies depends on making legal cases structured and informative for machine processing.

Adam Wyner and Wim Peters. Lexical semantics and expert legal knowledge towards the identification of legal case factors.
In Radboud Winkels, editor, In Proceedings of Legal Knowledge and Information Systems (JURIX 2010), pages 127-136. IOS Press, 2010.

We apply natural language information extraction techniques to a sample body of cases, which are unstructured text, in order to automatically identify and annotate the factors.  Annotated factors can then be extracted for further processing and interpretation.


Co-PI on the Argumentation Workbench project (August 2014-April 2015).
It is very difficut to make coherent sense of arguments for and against issues raised in articles or comments. While argument visualisation tools such as DebateGraph help people to structure and understand media derived arguments, the visualisations are manually reconstructed, and are consequently expensive to produce in terms of time, money, and knowledge.
The argumentation workbench will make use of automatic text mining techniques, e.g. sentiment analysis and named entity/term/relation extraction, together with automated discourse and argumentation marking, in order to select textual material deemed important for the argmentation structure of the text under consideration. This material with its automated annotations willl then be assist an argumentation engineer in her manual reconstruction and visualization of the arguments in DebateGraph.
This pilot development will create a semi-automated, interactive, integrated, modular tool set to extract, reconstruct, and visualise arguments. It will integrate well-developed, published, state-of-the-art tools in information retrieval and extraction, visualisation, and computational approaches to abstract and instantiate argumentation.


Other work

1. Development of GATE tools such as TermRaider: a hybrid term extractor extractor of single word and multiword nominal term candidates, using lexico-syntactic patterns and a number of statistical termhood scorers.
TermRaider is now an integral part of the GATE architecture.

2. Consultancy: bespoke information extraction for clients

Previous projects

Selected publications

My CV for anyone interested in my wanderings.


The nature of the linguistic information contained in language resources

Member of ISO TC37/SC4 standardization effort for terminological and linguistic resources

Member of W3C OntoLex standardizaton community for multilingual lexical information

Regular Polysemy and Computer Understanding




Department of Computer Science
Regent Court
211 Portobello Street
Sheffield S1 4DP
Tel: +44-114-2221902

Fax: +44-114-2221810

Email: w.peters at sheffield.ac.uk

TheDepartment of Computer Science