Fabio Ciravegna
Department of Computer Science
, University of Sheffield
Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
email:
F.Ciravegna@dcs.shef.ac.uk
www: www.dcs.shef.ac.uk/~fabio/
Knowledge management
is the key source for competitive advantage. The success or failure of
a company can depend on the ability to find the right information at the
right time and to correctly integrate new information with existing structured
knowledge, in order to facilitate communication and knowledge sharing and
to support knowledge-based organisations. The vast majority of information
is textual, therefore tools for structuring textual data starting from its
content represents one of the fundamental steps in successfully managing
information. The Web explosion (and the increasing usage of Inter/Intranet
technologies as a core channel for communication) focuses the needs towards
Web-based documents and texts.
Amilcare
is a system for Information
Extraction from Web documents for Knowledge Management that provides both
accuracy and easy user customisation. It implements the (LP)2
algorithm [Ciravegna 2000, Ciravegna 2001a, Ciravegna 2001b]. It maintains
most of the characteristics of previous (LP)2’s implementations
such as Learning Pinocchio
( )
in terms of easy of use, but it includes a number of new features that:
Reduce the application development
time (by reducing both the learning time and the number of training session
needed for application tuning)
Support users in the whole application
development process, from the initial task definition to the application
delivery and use.
Supporting Users
Amilcare comprehensively supports
the user in the whole application development cycle, from design to delivery
and even during post-marketing assistance via its unique set of tools.
Human computer interaction experts and information extraction experts have
worked together in the design of tools for user support.
Application development
is divided in the following steps:
-
Application design: the
goal of this step is to define a template, i.e., a kind of form the
system must fill with the extracted information. Amilcare provides
a set of tools for helping the user to identify the correct application
settings: it provides a
graphical interface that allows information highlighting in text
examples, coupled
with a set of methods for the semi-automatic organization of information
into templates and (in future releases)
unsupervised methods for helping identifying the information present
in the relevant documents. Considering that choosing a representative
set of texts may be difficult, a number of statistical tools are
provided for checking the representativeness of the corpus selected
by the user, so to avoid the (not infrequent) problems of wrong example
selection.
-
System training: in this
phase the system learns how to extract information for a particular
application by analysing a number of user-defined examples (i.e. a
set of documents with associated the information to be extracted).
a simple
graphical interface is provided that allows information highlighting
via mouse. Considering that providing examples can be tedious ,
Amilcare provides facilities for reducing the quantity of texts to
be tagged via active learning, a strategy that may reduce the
need of training examples up to 80%.
-
Result validation:
a fundamental step in the application development is the tuning
of results according to the specific application needs: given that
a 100% accurate information extraction process is out of grasp of the
current technology, it is necessary to be able to balance the ability
to find information (recall) with the precision in information identification
sot to identify the correct mix of precision and recall. Amilcare provides
a set of tools for result monitoring, both from a qualitative point
of view (inspecting the system results on a set of test texts with
error highlighting) and statistical point of view (accuracy, precision,
recall).
Amilcare’s tuning
interface is
designed to bridge the user’s qualitative vision
(“you are not capturing enough information”)
with the numerical concepts the system is able to manipulate
(e.g. moving error thresholds in order to obtain higher recall).
CPU time needed for retuning is 1/10 of the initial learning time.
-
Application delivery: once
the system performance has been tuned to the application needs the
information extraction engine can be delivered as a black box module
to be integrated in the user environment. A powerful API allow text
feeding and result extraction.
-
Post-marketing monitoring:
Amilcare provides tools that are fundamental once the application
has been delivered to the final user. They allow to statistically compare
both the corpus received for analysis and the results obtained at
training/testing time with those
on the corpus received. This is fundamental because the kind
of texts received can change in time (e.g. initially only very short
texts were received but then long texts start to appear) and the user
must be sure that such a change (that may not be noticed by the system
administrator) does not affect the system performances. Moreover Amilcare
is also able to statistically monitor its accuracy on new texts by
measuring the statistical distribution of identified information across
texts and issue worning in case such distribution radically differs
from the one observed on the training corpus.
The application development
cycle is shown in the next figure.
Acknowledgement
Amilcare’s development
is supported under the Advanced Knowledge Technologies (AKT) Interdisciplinary
Research Collaboration (IRC), which is sponsored by the UK Engineering
and Physical Sciences Research Council under grant number GR/N15764/01.
The views and conclusions contained herein are those of the authors and
should not be interpreted as necessarily representing official policies or
endorsements, either express or implied, of the EPSRC or any other member
of the AKT IRC.
Amilcare is based on
Gate,
a tool for architectures for language engineering developed
at the University of Sheffield. Gate is used for preprocessing texts, i.e.
for tokenization, sentence identification, part of speech tagging and gazetteer
lookup.
Fabio Ciravegna, Department
of Computer Science, University of Sheffield,