| Angus Roberts | ||||
|
Home |
Research |
Utils & Notes |
Software |
Etc |
|
PhD |
Projects |
Publications |
Terminology |
DAML+Oil |
My research interests began with Clinical Information Systems: their architecture, and the use of controlled vocabularies within them. These interests stemmed from real questions arising when writing real applications as a system developer in the NHS. I then worked for a few years as a Research Associate for Alan Rector in the Medical Informatics Group at Manchester University. Some of my work was providing tool support for medical knowledge engineers. This developed my interests in knowledge representation and ontologies.
Some of my work in Manchester was also concerned with ways of getting clinical information into computers. This led me to ask how far is it possible to extract clinical information from natural language? I now work as a research associate in the Natural Language Processing Group at Sheffield University. I am also studying for a PhD with Rob Gaizauskas. My topic is the extraction of part-whole relations (known as meronymy) from clinical records and other biomedical texts. You can download a copy of my first year report and an A1 poster introducing the topic from the links below. You can also read the following brief explanation.
We are used to seeing words, or lexemes, listed alphabetically in dictionaries. In terms of the meaning of the lexemes, this ordering has little relevance beyond shared word roots. In the OED, jam is sandwiched between jalpaite (a sulphide) and jama (a cotton gown). It is a long way from bread, conserve, and raspberry (Oxford English Dictionary, Second Edition, 1989). Vocabularies, however, do have a natural structure: one that we rely on for language understanding. As Cruse says,
Natural vocabularies are not random assemblages of points in semantic space: there are quite strong regularising and structuring tendencies 5, p146].
Consider this example (examples in this report are not from real patient records, but are indicative of the type of material seen in patient records),
...he has noticeable shortness of breath. I am sure that multiple pulmonary emboli are the cause of his dyspnoea.
Our understanding of the text depends on knowing that shortness of breath and dyspnoea are related in some way. Indeed, the intuition is so strong that someone who does not know the meaning of dyspnoea might hazard a guess that it is the same as shortness of breath. There is a lexical relation, or sense relation, between the two lexemes -- in this case the relation of synonymy. There are many other kinds of lexical relations. The next example shows hyponymy, equivalent to the conceptual relation of isA or kindOf, as found in taxonomies:
...I feel that a second line hormone would be of most benefit. Arimidex has therefore been prescribed ...
This report looks at a third such lexical relation, meronymy. Meronymy relates the lexeme for a part to that for a whole. It is equivalent to the conceptual relation of partOf. The example below shows a typical meronym. When we read the text, we understand that the frontal lobes being discussed are not some new entity unrelated to what has gone before: they are a part of the previously mentioned brain.
MRI sections were taken through the brain. Frontal lobe shrinkage suggests a generalised cerebral atrophy.
Reasoning about meronymy sometimes requires knowledge of more than a single part-whole relation. Meronymy, like some other lexical relations, appears to be transitive. Consider this example:
His chest X-Ray shows fibrosis of the left upper lobe.
left upper lobe is not part of chest, but is part of lung which is itself part of (or at least contained within) chest. The nature of meronymy's transitivity has been been the subject of debate, and is of some importance.
In all of the above examples, the underlined lexemes refer to objects and concepts in the world. Within each example, the pairs of underlined lexemes refer either to the same object or to related objects: they co-refer. Any model of natural language must account for this co-reference, and the way in which co-references are resolved. This is especially so if a model of natural language is to be used for practical, real-world problems, i.e. for language engineering. In Information Extraction (IE), for example, co-reference resolution is generally seen as a core component [4,7]. As Cowie and Lehnert say, referring to three problems in discourse analysis (noun-phrase analysis, co-reference resolution and relational link recognition),
Of the three problems, co-reference resolution is by far the most challenging. Important entities are likely to be mentioned many times and may never be described twice by the same noun phrase.
If we are to resolve the references between different lexemes, it is obvious that some knowledge beyond lexical and syntactic structure is required. As Meyer and Dale argue, referring to such co-reference as associative anaphora,
...the absence of surface level cues makes associative anaphora difficult to handle using the sort of shallow processing techniques that have become dominant over the past decade [9].
Berland and Charniak back up this claim by pointing to the widespread usage of general lexical resources such as WordNet, which encodes lexical relations [2]. It has, however, become a truism in Natural Language Processing (NLP), and in Artificial Intelligence in general, that the manual construction of such knowledge resources is untenable. Such large-scale knowledge engineering programmes are often greeted with cynicism. It is especially difficult to imagine the scale of programme needed to build and maintain a knowledge resource in a continually developing subject domain such as biomedicine. And if built, what would this resource reflect? Many such resources are built with abstract notions of application independence in mind. As Bateman pointed out, the world reflected in them is divorced from language [1].
For these reasons, a significant line of research has investigated the automated building of such resources from corpora, both as explicitly encoded relations, and as implicit knowledge (for example, as statistical co-occurrences). Automatic extraction of lexical knowledge is clearly preferable to manual techniques when adapting to a new domain, and is more likely to reflect the reality of language in use. To date, most work has concentrated on taxonomy extraction. Extraction of other relations, such as meronymy, has been less studied, and where it has been studied, several shortcomings are apparent.
In addition to the above linguistic motivation for learning parts and wholes, there is a second set of needs and uses focused on the end user of information extracted from text. This report and the planned research are set in the context of the Medical Research Council (MRC) CLEF project: a Clinical E-Science Framework [11]. CLEF seeks to extract structured information from the dictated text which makes up the bulk of current and past UK medical records, thus making it amenable to machine processing, for use in the support of basic health-care and clinical research [6]. Structured information will be extracted using Information Extraction technologies [4,7], and stored in a database.
Queries across this database may refer to extracted entities, classes of entities, and the relations between them, such as partOf. The value of information in the IE database comes not only from this ability to query specific extracted entities and relations, but also through the integration of other domain knowledge, allowing constraints to be placed on queries. The user could ask the query shown in example 1 below (examples in this section are adapted from trials registered with the European Organisation for Research and Treatment of Cancer protocols database, http://www.eortc.be/). The result should include patients with cancers of parts of the pancreas and duodenum, such as cancer of the Head of the Pancreas and cancer of the Ampulla of Vater. Example 2 should retrieve patients with ACC sited in the saliva glands, but not in the breast. Similarly, Example 3 should retrieve patients with sarcomas located in the chest and pelvis, but not in the limbs.
The examples all show the need for some external knowledge resource encoding the location, containment and spatial relationship of parts of the body -- primarily through the partOf relation. This is, of course, not the only relational knowledge that will be needed by such a query engine. It is, however, seen as having a special place in medical concept representation, second only to isA. Medicine is grounded in the physical objects of anatomy and the processes of physiology, both of which have complex partonomic structures. partOf is therefore critical in concept representation [3,10,12,13].
The above sections illustrate the need for a resource encoding meronymy, for use both in the NLP methodologies at the heart of Information Extraction, and in the end-user interaction with extracted information. A specific example was given of biomedicine, and it was argued that meronymy has a special place in biomedical text.
This importance is apparent through the inclusion of meronomies in many existing biomedical knowledge resources [14,8,12,13]. These resources, however, suffer from three drawbacks: they were not designed with language processing in mind; they are incomplete; they do not keep pace with their rapidly changing subject domain. It was also mentioned in the above sections that research has been carried out on the extraction of taxonomies, together with a small amount of work on meronymy extraction.
This leads to the central questions to be addressed by the planned research. Can the corpus techniques used to construct taxonomies be used to construct meronomies? Can problems with the existing work on meronymy extraction be overcome? How would a corpus based meronymy compare to existing, manually constructed meronomies, and how would it perform in comparison to them?
This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.43)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 transfer.tex
The translation was initiated by Angus Roberts on 2003-11-17