Angus Roberts
Home
Research
Utils & Notes
Software
Etc
PhD
Projects
Publications
Terminology
DAML+Oil

My research interests began with Clinical Information Systems: their architecture, and the use of controlled vocabularies within them. These interests stemmed from real questions arising when writing real applications as a system developer in the NHS. I then worked for a few years as a Research Associate for Alan Rector in the Medical Informatics Group at Manchester University. Some of my work was providing tool support for medical knowledge engineers. This developed my interests in knowledge representation and ontologies.

Some of my work in Manchester was also concerned with ways of getting clinical information into computers. This led me to ask how far is it possible to extract clinical information from natural language? I now work as a research associate in the Natural Language Processing Group at Sheffield University. I am also studying for a PhD with Rob Gaizauskas. My topic is the extraction of part-whole relations (known as meronymy) from clinical records and other biomedical texts. You can download a copy of my first year report and an A1 poster introducing the topic from the links below. You can also read the following brief explanation.

Extracting Parts and Wholes from Biomedical Text

Meronymy

We are used to seeing words, or lexemes, listed alphabetically in dictionaries. In terms of the meaning of the lexemes, this ordering has little relevance beyond shared word roots. In the OED, jam is sandwiched between jalpaite (a sulphide) and jama (a cotton gown). It is a long way from bread, conserve, and raspberry (Oxford English Dictionary, Second Edition, 1989). Vocabularies, however, do have a natural structure: one that we rely on for language understanding. As Cruse says,

Natural vocabularies are not random assemblages of points in semantic space: there are quite strong regularising and structuring tendencies 5, p146].

Consider this example (examples in this report are not from real patient records, but are indicative of the type of material seen in patient records),

...he has noticeable shortness of breath. I am sure that multiple pulmonary emboli are the cause of his dyspnoea.

Our understanding of the text depends on knowing that shortness of breath and dyspnoea are related in some way. Indeed, the intuition is so strong that someone who does not know the meaning of dyspnoea might hazard a guess that it is the same as shortness of breath. There is a lexical relation, or sense relation, between the two lexemes -- in this case the relation of synonymy. There are many other kinds of lexical relations. The next example shows hyponymy, equivalent to the conceptual relation of isA or kindOf, as found in taxonomies:

...I feel that a second line hormone would be of most benefit. Arimidex has therefore been prescribed ...

This report looks at a third such lexical relation, meronymy. Meronymy relates the lexeme for a part to that for a whole. It is equivalent to the conceptual relation of partOf. The example below shows a typical meronym. When we read the text, we understand that the frontal lobes being discussed are not some new entity unrelated to what has gone before: they are a part of the previously mentioned brain.

MRI sections were taken through the brain. Frontal lobe shrinkage suggests a generalised cerebral atrophy.

Reasoning about meronymy sometimes requires knowledge of more than a single part-whole relation. Meronymy, like some other lexical relations, appears to be transitive. Consider this example:

His chest X-Ray shows fibrosis of the left upper lobe.

left upper lobe is not part of chest, but is part of lung which is itself part of (or at least contained within) chest. The nature of meronymy's transitivity has been been the subject of debate, and is of some importance.

A Language Engineering Motivation

In all of the above examples, the underlined lexemes refer to objects and concepts in the world. Within each example, the pairs of underlined lexemes refer either to the same object or to related objects: they co-refer. Any model of natural language must account for this co-reference, and the way in which co-references are resolved. This is especially so if a model of natural language is to be used for practical, real-world problems, i.e. for language engineering. In Information Extraction (IE), for example, co-reference resolution is generally seen as a core component [4,7]. As Cowie and Lehnert say, referring to three problems in discourse analysis (noun-phrase analysis, co-reference resolution and relational link recognition),

Of the three problems, co-reference resolution is by far the most challenging. Important entities are likely to be mentioned many times and may never be described twice by the same noun phrase.

If we are to resolve the references between different lexemes, it is obvious that some knowledge beyond lexical and syntactic structure is required. As Meyer and Dale argue, referring to such co-reference as associative anaphora,

...the absence of surface level cues makes associative anaphora difficult to handle using the sort of shallow processing techniques that have become dominant over the past decade [9].

Berland and Charniak back up this claim by pointing to the widespread usage of general lexical resources such as WordNet, which encodes lexical relations [2]. It has, however, become a truism in Natural Language Processing (NLP), and in Artificial Intelligence in general, that the manual construction of such knowledge resources is untenable. Such large-scale knowledge engineering programmes are often greeted with cynicism. It is especially difficult to imagine the scale of programme needed to build and maintain a knowledge resource in a continually developing subject domain such as biomedicine. And if built, what would this resource reflect? Many such resources are built with abstract notions of application independence in mind. As Bateman pointed out, the world reflected in them is divorced from language [1].

For these reasons, a significant line of research has investigated the automated building of such resources from corpora, both as explicitly encoded relations, and as implicit knowledge (for example, as statistical co-occurrences). Automatic extraction of lexical knowledge is clearly preferable to manual techniques when adapting to a new domain, and is more likely to reflect the reality of language in use. To date, most work has concentrated on taxonomy extraction. Extraction of other relations, such as meronymy, has been less studied, and where it has been studied, several shortcomings are apparent.

An End-user Motivation

In addition to the above linguistic motivation for learning parts and wholes, there is a second set of needs and uses focused on the end user of information extracted from text. This report and the planned research are set in the context of the Medical Research Council (MRC) CLEF project: a Clinical E-Science Framework [11]. CLEF seeks to extract structured information from the dictated text which makes up the bulk of current and past UK medical records, thus making it amenable to machine processing, for use in the support of basic health-care and clinical research [6]. Structured information will be extracted using Information Extraction technologies [4,7], and stored in a database.

Queries across this database may refer to extracted entities, classes of entities, and the relations between them, such as partOf. The value of information in the IE database comes not only from this ability to query specific extracted entities and relations, but also through the integration of other domain knowledge, allowing constraints to be placed on queries. The user could ask the query shown in example 1 below (examples in this section are adapted from trials registered with the European Organisation for Research and Treatment of Cancer protocols database, http://www.eortc.be/). The result should include patients with cancers of parts of the pancreas and duodenum, such as cancer of the Head of the Pancreas and cancer of the Ampulla of Vater. Example 2 should retrieve patients with ACC sited in the saliva glands, but not in the breast. Similarly, Example 3 should retrieve patients with sarcomas located in the chest and pelvis, but not in the limbs.

  1. For all patients with cancer of the pancreas or duodenum, and who have had a pancreaticoduodenectomy, retrieve numbers of patients by treatment and survival time.
  2. Retrieve patients on Gemcitabine with recurrent or metastatic Adenoid Cystic Carcinoma (ACC) of the head and neck
  3. Retrieve adult patients on Gemcitabine with advanced soft tissue sarcomas located in the trunk of the body.

The examples all show the need for some external knowledge resource encoding the location, containment and spatial relationship of parts of the body -- primarily through the partOf relation. This is, of course, not the only relational knowledge that will be needed by such a query engine. It is, however, seen as having a special place in medical concept representation, second only to isA. Medicine is grounded in the physical objects of anatomy and the processes of physiology, both of which have complex partonomic structures. partOf is therefore critical in concept representation [3,10,12,13].

Learning Parts and Wholes from Biomedical Text

The above sections illustrate the need for a resource encoding meronymy, for use both in the NLP methodologies at the heart of Information Extraction, and in the end-user interaction with extracted information. A specific example was given of biomedicine, and it was argued that meronymy has a special place in biomedical text.

This importance is apparent through the inclusion of meronomies in many existing biomedical knowledge resources [14,8,12,13]. These resources, however, suffer from three drawbacks: they were not designed with language processing in mind; they are incomplete; they do not keep pace with their rapidly changing subject domain. It was also mentioned in the above sections that research has been carried out on the extraction of taxonomies, together with a small amount of work on meronymy extraction.

This leads to the central questions to be addressed by the planned research. Can the corpus techniques used to construct taxonomies be used to construct meronomies? Can problems with the existing work on meronymy extraction be overcome? How would a corpus based meronymy compare to existing, manually constructed meronomies, and how would it perform in comparison to them?

Bibliography

1
John A. Bateman.
The theoretical status of ontologies in natural language processing.
In Susanne Preuß and Birte Schmitz, editors, Text Representation and Domain Modelling - ideas from linguistics and AI, pages 50-99, Berlin, Germany, May 1992. KIT-Report 97, Technische Universität Berlin.
(Papers from KIT-FAST Workshop, Technical University Berlin, October 9th - 11th 1991). Also available from the Computation and Language E-print archive: cmp-lg/9704010.

2
Matthew Berland and Eugene Charniak.
Finding parts in very large corpora.
In Robert Dale and Ken Church, editors, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 57-64. Association for Computational Linguistics, Morgan Kaufmann, 1999.

3
Jochen Bernauer.
Analysis of part-whole relation and subsumption in the medical domain.
Data and Knowledge Engineering, 20(3):405-415, November 1996.

4
Jim Cowie and Wendy Lehnert.
Information extraction.
Communications of the ACM special issue on Natural Language Processing, 39(1):80-91, January 1996.

5
David Alan Cruse.
Meaning in Language: An Introduction to Semantics and Pragmatics.
Oxford Textbooks in Linguistics. Oxford University Press, Oxford, 2000.

6
Robert Gaizauskas, Mark Hepple, Neil Davis, Yikun Guo, Henk Harkema, Angus Roberts, and ian Roberts.
AMBIT: Acquiring Medical and Biological Information from Text.
In Simon Cox, editor, Proceedings of UK e-Science All Hands Meeting 2003, pages 370-373, Nottingham, UK, September 2003.

7
Robert Gaizauskas and Yorick Wilks.
Information extraction: Beyond document retrieval.
Journal of Documentation, 54(1), 1997.

8
Donald A. Lindberg, Betsy L. Humphreys, and Alexa T. McCray.
The unified medical language system.
Methods of Information in Medicine, 32(4):281-291, August 1993.

9
Josef Meyer and Robert Dale.
Using the WordNet Hierarchy for Associative Anaphora Resolution.
In Proceedings of SemaNet'02: Building and Using Semantic Networks, Taipei, Taiwan, August 2002.

10
Alan Rector.
Medical Informatics.
In Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter Patel Schneider, editors, Description Logics Handbook, chapter 13, pages 406-426. Cambridge University Press, Cambridge, 2003.

11
Alan Rector, Jeremy Rogers, Adel Taweel, David Ingram, Dipak Kalra, Jo Milan, Peter Singleton, Robert Gaizauskas, Mark Hepple, Donia Scott, and Richard Power.
CLEF -- Joining up Healthcare with Clinical and Post-Genomic Research.
In Simon Cox, editor, Proceedings of UK e-Science All Hands Meeting 2003, pages 264-267, Nottingham, UK, September 2003.

12
Jeremy E. Rogers and Alan L. Rector.
GALEN's model of parts and wholes: Experience and comparisons.
In J. Marc Overhage, editor, Proceedings of the 2000 American Medical Informatics Association Annual Symposium (AMIA 2000), pages 714-718, Philadelphia PA, 2000. American Medical Informatics Association, Hanley and Belfus Inc.

13
Cornelius Rosse, José L. Mejino, Bharath R. Modayur, Rex Jakobovits, Kevin P. Hinshaw, and James F. Brinkley.
Motivation and organizational principles for anatomical knowledge representation: The digital anatomist symbolic knowledge base.
Journal of the American Medical Informatics Association, 5(1):17-40, Jan Feb 1998.

14
The Gene Ontology Consortium.
Gene ontology: tool for the unification of biology.
Nature Genetics, 25:25-29, May 2000.

About this document ...

This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.43)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 transfer.tex

The translation was initiated by Angus Roberts on 2003-11-17


Last modified: Mon 8 Jun 2009 11:37:32