Science, engineering, innovation and scholarship are increasingly dependent on computer-based representations of complex structures, whether derived by data capture from the real world, or created from scratch, or a mixture of the two. We introduce the term 'digital objects' to refer to these structures.
Examples of these objects are
The mission of the centre we propose will be to 'understand' digital objects: their design, creation, analysis, maintenance, manipulation and application.
Studies based on digital objects require active interaction rather than passive absorption, for instance:
We request JIF funding to found the Sheffield Centre for the Understanding of Digital Objects (CUDO). CUDO will be housed in a single building which will contain up-to-state digitisation, computing and output rendering equipment covering the whole life-cycle of digital objects. The facilities will allow academics in the Faculties of Engineering, Science and Architecture to work together in research based on rich interactions with digital objects. In this collaboration the department of Computer Science will be responsible for developing new types of software and improving existing tools in collaboration with researchers from the other departments. CUDO will thus enable the exploration of new ways to carry out fundamental research in the creation, analysis and manipulation of digital objects in key application domains utilising the latest technology. The aim is to reinforce Sheffield as a premier site in what will become a major new research direction: the understanding of complex digital objects representing space, sound and text.
Suitable space has been identified in the Amy Johnson Building.
This research direction has been identified as a major opportunity for the applications of new concepts in software science, and the continuing dramatic development of computer technology will create many exciting research opportunities.
[Ref. The Cornell Report [http://www.cs.cornell.edu/cis-dean/Task%20Force%20Final%20Report.htm]]
In the following sections we demonstrate that Sheffield is well placed to exploit digital object technology, by building on its strengths in science and engineering, and in application-driven computer science and artificial intelligence. In addition, many eminent scholars in the Arts and Humanities in Sheffield will be associated with the Centre and will be able to provide further challenging and exciting applications for the technologies and software which will be developed. The cognitive basis of many higher order human activities such as creativity and understanding will be the subject of fundamental scientific enquiry in order to develop new methods of creating and understanding digital objects.
The principle CUDO themes are:
(i) Developing digital object repositories or corpora and information retrieval mechanisms for them based on content and semantics. This contrasts with traditional database classification methods which are expensive, inflexible and generally unsuitable for emerging applications in the efficient use and analysis of digital objects. Here we will build on key research carried out with the British Broadcasting Corporation in which automatic speech recognition has been used successfully to access BBC sound archive material without recourse to human transcription .
(ii) Capitalising on the availability of repositories of digital objects to build new ways of understanding the properties and uses of digital objects in fields as diverse as:
speech and music - increasing the understanding of the production and perception of speech and music sounds and interacting with such digital objects to identify new and revealing insights into their properties.
Summary of the scientific rationale given in the case for support
CUDO will address fundamental research problems from Computer Science, Architecture, Archaeology, Music, Bioinformatics and Molecular Biology which involving the representation, recognition and processing of sounds, images and multi-dimensional digital objects and their use in creating, understanding and applying complex artefacts and computational models.
Areas of the digital arts and humanities which involve new ways of using intelligent software technology will also be involved and will include researchers from: English Language; English Literature; French, History and other arts subjects.
Software technologies of interest include: information retrieval; natural language processing; speech recognition; electronic music; machine learning; computer graphics; virtual reality; intelligent agent technology, multi-media information retrieval, robotics and simulation.
Key research issues
Many areas of scholarship and research increasingly use digital objects to explore and understand their domain. These objects can be of considerable complexity and they require sophisticated computers and software to enable them to be exploited to the full. In some cases these objects represent an opportunity for many researchers to investigate rare and delicate historical artefacts in ways that could not be supported without the digitisation process. In other cases researchers can investigate how complex constructions such as buildings and musical compositions can be created, altered and refined using suitable computer representations in order to better understand the properties and capabilities of the created artefact better.
Underlying both of these endeavours is the need to organise and manage the repositories of these digital objects in efficient and cost-effective ways. Standard database technology can be inflexible and expensive to maintain and incapable of supporting non-standard queries. Typically, researchers are analysing and annotating their objects in many and varied ways. They also need to be able to retrieve information and to search their repositories in ways which are based on the content of the objects rather than on some arbitrary and fixed classification in the database model.
For instance, speech researchers in the Department of Computer Science have implemented a novel method of dealing with the sound archives of the BBC's radio programmes. Instead of relying on a traditional database with the associated expense of classifying the contents down to a fine level of detail the method involves using speech recognition software to automatically translate the spoken words into text, which can then be searched and exploited using natural language processing techniques (another strength of Sheffield).
Other repositories of digital objects, such as 3 dimensional images, will be important to researchers in the Centre and it is here that some interesting research will be carried out into using suitable feature extraction software to capture essential information about the image and represented in a suitable dialect of XML.
Researchers can then annotate further these XML representations associated with the digital object to provide added value, perhaps identifying important properties such as the provenance of an archaeological discovery, or some acoustic property of a surface in a building design. Because XML is extensible and flexible it is a good candidate for marking up digital objects in this way. natural language technology could then provide extensive opportunities to extract information from such an electronic archive, additionally XQL is emerging as an extensible query language and may provide some benefits. Already XMLaec has been standardised for use in the architecture and construction industry and is being used to represent the core 3-dimensional structure at the heart of major construction projects.
The philosophy of the centre is to provide the resources and the expertise for the digitisation of a variety of artefacts, visual, aural and textural. These can then be manipulated and analysed as digital objects, stored and retrieved efficiently from a suitable repository and shared, where appropriate, with other researchers using the medium of the internet.
Along with the processes of analysing and exploring given digital objects we will also be looking at how to create objects as representations of real solutions to specific design problems such as virtual buildings. The issue of exploiting the resources of computers to assist composers to create musical compositions and performances will also be an important focus.
The issue of creativity in the design of buildings, artefacts, musical compositions etc., is also an important area. Having access to digital objects representing the various stages in the design or composition of such an artistic product will enable us to represent more rigorously the design process as being a series of simpler processes and evaluations which lead to a finished object.
It will also be interesting to see how we could capture some of the simpler rules which might enable the creative process to be imitated, albeit crudely, by some computational device such as a robot. This is, of course, rather speculative, but the availability of such a resource and the artistic and creative individuals associated with such activities could lead to a rich and rewarding interaction with computer scientists.
Research topics will include:
information retrieval from Multi-Media Archives - aural, visual and textual;
extracting and analysing information from electronic texts;
manipulating and understanding electronic images and animations;
generating and exploring electronic artefacts from various sources.
building multi-dimensional models of molecular species;
creating sophisticated behavioural models of cellular systems;
experimenting with computational models of artistic appreciation and creativity;
building models of architectural and archaeological built environments;
acoustical modelling of buildings (auditory VR);
acoustic modelling of buildings and urban environment (auralization).
Perception and understanding
virtual music instruments;
historical and archaeological reconstructions;
interactive evaluation of multi-media information retrieval systems.
analysis of music performance and composition;
study of creativity in architectural design;
design and fabrication routes to found archaeological relics;
creativity in literature.
The University of Sheffield has a strong research reputation in a wide variety of subjects within Engineering, Science, the Arts and Humanities. Of the departments involved in CUDO, Archaeology/Prehistory and Information Studies are RAE grade 5*, Automatic Control and Systems Engineering, Music and Architecture Grade 5, and Computer Science is Grade 4. The University has a strong international presence in novel applications of computer science, in particular in areas relating to language, speech, graphics, artificial intelligence and software development. The Department of Computer Science has recently been awarded an Interdisciplinary Research Centre (IRC) in Information Retrieval by EPSRC with the title Accessing Knowledge Technologies.
Associated with the Centre will be The Humanities Research Institute, which has an international reputation and has recently been awarded a Queen's Anniversary Prize, as well as many of the Departments of the Faculty of Arts (RAE grades 5 and 4).
This proposal brings together researchers from all these areas with a common interest in the exploitation of digital technology and the application of state of the art computer software to interesting research problems in the understanding of new types of digital objects that are appearing in science, engineering, the arts and humanities. This proposal capitalises on technology that was unavailable during previous JIF calls.
Many authorities have identified that there will be an explosion in opportunities for creating and exploiting digital media in many areas of life. The rate of progress of software technology will enable activities that were undreamt of until recently. It will create opportunities
CUDO will foster multi-disciplinary research collaborations based around common understandings of the possibilities of new technology. It will allow for the development of new computational tools which will be stimulated by problems in creating, manipulating and analysing digital objects of every type, in a variety of areas. The following detailed case can only describe some of the projects that the Centre can support in the 10 pages available.
The aim here is to investigate a novel design workspace for architects that will support their exploratory design process and facilitate communication of the proposed design concepts to their clients more effectively. [42 - 54] The scenario presented to the architects will be the creation of their design using a component library and virtual modelling tools on a Responsive Workbench using a two-handed glove-based interface. The users will be able to grab the virtual construction parts such as walls, doors, columns and beams using a dataglove-based interface and position them in 3D space or join them with other construction components to generate their design concepts. The Responsive Workbench will provide a much more natural visual interface for the architects to carry out the design on a "virtual 3D drafting table". This will offer the opportunity for gesture based interaction of a kind comparable with freehand sketching. However this suggests that a gesturing language and direct manipulation techniques need to be developed which are capable of providing a satisfactory set of operations on a known collection of appropriate architectural objects. The inter-locking or accurate positioning of construction parts in the 3D design space will be supported by an interactive constraint-based modeler developed by Salford University. A set of virtual modelling tools will be developed to support the creation of new construction parts or to change the existing parts.
Architectural design is not simply the result of assembling totally performed elements in space. Whilst some architectural components such as windows or roof trusses may be considered in this way others such as floors or walls most certainly cannot, since they are surfaces which may need stretching, reshaping and even dividing during the design process. It will therefore be necessary to establish the full range of manipulations which architects will wish to carry out in a virtual world and to provide a gesturing language to support these naturally. For example, the architects may wish to remove a portion of a wall to create a doorway by cutting a portion off the wall using a virtual cutting tool. Alternatively the door object may be able to carry a certain degree of information telling the system how it should interact with a wall both at the time it is inserted and later as further manipulations of the door and wall take place. These manipulative tools will be developed by combining constraint-based geometric modelling and 3D direct manipulation techniques. Once the design is constructed on the Responsive Workbench, the designers can go into the virtual building using a CAVE or using a panoramic display to inspect the design or to perform further design changes. Such a novel approach will allow the designers to explore their creativity in 3D and absorb suggestions from the clients interactively. The clients will be able to interact with the proposed design and suggest design changes to the model more easily using the proposed virtual environment interface. This will allow designers to capture the client's needs and to ensure the compatibility between the client's vision of the project and the resulting product.
The project "Sheffield 1900" was launched in 1998. Using old maps and material from the City Archives, Local Studies and the Hawley Collection, a physical model of the city centre as it stood in 1900 has been built, at a scale of 1:500. [60, 61, 62] The study has four aims: to demonstrate the importance of understanding how a place has evolved before contributing a new design; to show the changes that have taken place in cities during the last century; to teach students how to undertake historical research; and to build up a database about the history of Sheffield for future reference. The project is intended to continue taking a different aspect of the city each year. Until now the Sheffield 1900 database has taken the form of a paper file on each grid square and the model has been entirely physical. The use of Information Technology offers great advantages both in storing the material and in making it both indexible and accessible through many routes. Since the ultimate aim is to provide a physical history of every site in the city to inform new architectural interventions, it is highly desirable that the data on the three-dimensional positions of lost buildings and roads be electronically reproducible. The proposed research aims to produce a hypermedia databank dedicated to the on-going research on historical buildings and places at Sheffield. Similar urban simulation projects have been undertaken before, such as the Bath Model constructed by CASA-University of Bath, the Glasgow Model by ABACUS-University of Strathclyde.
The proposed generic modelling methodology and the hypermedia databank for the Sheffield Study will differ in three aspects: (1) the model will be built to reveal evolution of city development by structuring and displaying reconstruction across different periods of time rather than a fixed one in the past or at present; (2) the databank will contain documents hyper-linked with 3D rendered CAD models to include texts, drawings, still/animated images, and field interviews recorded in various digital mediums; (3) the databank will be searchable through an appropriate mechanism of indexing so that contents can be retrieved efficiently according to end-users' interests. The objectives of the research are:
(iii) to demonstrate that a hypermedia databank dedicated to reconstructing historical architecture and places of a region can facilitate collaborative design and research regarding the contextual issues of that region;
Planned activities include applications to soil micromorphology, dental microwear, automated identification and measurement of botanical remains and 3-D recording of landscapes and buildings. In all these areas the accurate recording, best accomplished through digitisation of the individual artefact or relic is the first stage before identification and appropriate scientific analysis.
One challenge is developing the digitising of virtual reference collections of stone age tools. Virtual typologies - i.e. the range of stone artefacts made at any one period - would provide accurate references against which new material could be compared. Another exciting - and novel - development would be to digitise experimental collections that can show stone tool production and modification; the waste by-products as well as the intended object. This too would aid teaching, and the identification of material from excavated sites.
Pilot applications to virtual stone tool typologies could feasibly be based upon material from sites in the Sheffield region. A particularly appropriate example is Creswell Crags, the best known palaeolithic site in northern England, and with which the department has a long association. The Department has its own experimental collections of stone tool manufacture and modification which are immediately available for digitisation.
Geographic Information Systems are currently used in landscape archaeology and research focuses on the history of spatial and temporal organisation of human activity, the physical modification of the landscape and the changing ways in which landscapes have been used. Specific projects include departmental research projects into late Neolithic landscapes of Derbyshire and multi-period landscapes (neolithic to 19th Century) in South Uist, Hebrides. 
Research into archaeological materials science includes the analysis and reconstruction of early Roman glass, late Neolithic ceramics and Bronze Age weapons. Important issues include the selection, procurement and working of raw materials and on artefact production. [16, 17, 19]
A key aspect for many of these areas will be the use of automatic feature recognition software to build up a basic inventory of properties and associated information which can be augmented using suitable investigative techniques. This is recorded in a suitable tagging dialect which is extensible and searchable. The repository of suitable digital objects will provide a powerful basis for further analysis and, in the case of the creation of a digital (virtual) environment to represent the possible scenarios within which the objects might have existed, a resource for populating this virtual world with realistic objects of interest. Incomplete or damaged specimens can be "completed" according to particular hypotheses and their digital representation refined and studied to establish likely properties.
By scanning all of their material and archival holdings museums can provide virtual access over the internet to their all of their collections to almost any scholar. Here in Sheffield we have the Hawley Collection - a unique archive and museum of the Sheffield tool making industry: there is much interest in the Hawley Collection from social historians and historical archaeologists from all over the world, and this would be an excellent opportunity to testbed the creation of a virtual museum. [This development has also been flagged by JISC (Joint Information Systems Committee), who are funding collaborative research between HE institutions and museums with the aim of setting up virtual museums as part of the National Grid for Learning.]
Constructing a representative and well documented and analysed resource of digital objects of these tools together with information about their function and use will be the basis of an important historical collection that can be made widely available to scholars. This collection is unique and internationally important in that it combines artefacts with published catalogues, archival material, pictures, photographs, tapes and films, all recording the development of many of Sheffield's manufacturing processes and products. The crafts and trades which are covered in this collection are: cutlery, flatware and holloware - knives, forks and spoons, silver tableware, provision and specialist knives, and silversmithing. The Collection's importance is in the number of part-finished items and the tools, which are valuable in understanding work and manufacturing processes. These artefacts, together with the trade catalogues, the archival material, photographs and ephemera, all make this Collection an integrated whole. The Collection includes some commercially-made videos of craftsmen at work - a markmaker, an acid etcher, a razormaker - to name a few. There are almost 2,000 photographs of people at work and the original photographs of tools for catalogue illustrations, many of which are glass negatives. The Collection has some 16mm film of manufacturing processes and about 100 cassette tapes of interviews with craftsmen.
CUDO will provide facilities to underpin multidisciplinary studies in the world of sound: Acoustics, Audition, Music, Speech Perception and Speech Technology. Sheffield has an established and growing presence in these fields, but needs a common equipment base to enable a range of studies, based on audio digital objects, which cannot at present be resourced. The research groups involved are in the departments of Architecture, Computer Science, Music, Human Communication Science and Automatic Control and Systems Engineering.
With acoustic simulation software and auralization tools, architectural designers and their clients can experience the acoustic environment inside and outside buildings at the design stage. This would be very useful for avoiding possible acoustic problems and for improving the design.
(ii) The calculation speed of current software is rather slow, and the acoustic simulation is not in real-time. When designers and clients are walking in a virtual building, real-time sound performance is often an essential requirement. Also, if the sound performance is not satisfactory, designers may want to change some building components, such as adding sound absorbers or double-glazed windows, and then listen to the sound performance again. In this case, a real-time simulation would be very important.
(iii) For an efficient application of auralization in architectural design, an integrative combination between acoustic simulation programs and CAAD software is vital. For example, it should be convenient to assign acoustic characteristics of building materials using the CAAD component library. In this aspect considerable work is still required.
Listeners are remarkably adept in `Auditory Scene Analysis' - the perceptual separation of evidence from individual sound sources in the melange which reaches the ears. With EPSRC support, Cooke and Brown have pioneered computational modelling of Auditory Scene Analysis with some success . The ability to identify those time-frequency regions which carry reliable speech evidence may be the key to automatic speech recognition in noisy conditions and has led us to develop the `missing data' approach to robust ASR [91, 92]. This is a key topic in two current EC grants, SPHEAR (TMR Network) and RESPITE (ESPRIT Long Term Research). Further studies in this area require the ability to construct controlled auditory scenes for experimentation. There are strong links here with the acoustic simulation work above and the sound spatialisation work below.
(i) Auditory spatial attention: The role of attention in Auditory Scene Analysis is unclear - does attention have a `passive' role, in which sound components that are grouped by other processes are simply promoted to the attentional foreground, or does attention play an `active' role in grouping? Recent evidence suggests that attention may be required for grouping tone sequences. We intend to further investigate this issue by considering the perception of spatially distributed sources; this work will require the acoustic chamber and multiple-speaker sound delivery system. Computational models will be derived from our experimental data.
(ii) Corpora for comparing CASA and blind source separation: currently, there is considerable interest in two approaches to sound source separation: computational auditory scene analysis (CASA) and blind source separation. Comparison of the two methods is frustrated by the lack of corpora which are appropriate for the evaluation of both techniques. In particular, the standard corpora used for CASA either lack the microphone array recordings required for blind separation, or use a simple linear mixing model. The latter present little challenge for blind separation approaches. Using the acoustic chamber, we will record a multi-source, multi-sensor corpus so that a meaningful comparison of the two approaches can be made.
The problem of equipping the contemporary artist with the required tools to complete an artistic task is best approached through a collaborative, multidisciplinary campaign. Artists who require specialist tools can work with scientists to shape their construction. Scientists can create new paradigms of working. Together, an understanding of the cultural differences and logistics of multidisciplinary work can be achieved.
(i) Sound spatialisation tools: A multi-channel system is being developed at Sheffield University Sound Studios. This currently offers an integrated graphic interface to enable the complex motion of sounds in space and to automate certain parameters. The CUDO facility would enable the development of this software in an acoustically controlled environment, affording optimum listening conditions.
(ii) Sound synthesis by chaotic algorithms: There are few, if any, reliable tools that take the sound synthesis or analysis-resynthesis paradigms to their creative extremes. Current research applies chaotic and fractal algorithms to the raw materials of sound in an attempt to manipulate materials at higher structural levels. Whilst at a very early stage, this work is yielding positive (and musical) results. Of critical importance again is the necessity to mediate between the theoretical model and the ear's perception. The optimum listening environment is again stressed as important in this research.
(iii) Interactive music which incorporates Artificial Intelligence and robotic technologies in the redesign of the Human Computer Interface still lies in the limited world of `triggering'. This is partly due to the simple tools on offer that allow the composer to be both the performer and the engineer. The high powered computing supplied and maintained in the bid will enable the full potential of this research to trickle down to the composition studios of the music department and beyond.
(iv) Analysis of compositional method is broadened through the development of complex databases. The applicants are currently engaged in European Union funded projects to analyse the creative process. Numerous Internationally recognised music psychology research groups based within the Music Department are offering valuable advice. Where musical creation takes place in this facility, it will be carefully monitored.
(v) Computer music composition tools, including analysis of methods for music orchestration; musical sound understanding and timbre modelling, symbolic recognition of musical patterns etc. The question of equipping the contemporary artist with the required tools to complete an artistic task is best approached through a collaborative, multi disciplinary campaign.
A major thrust of Sheffield speech research is in the storage and content-based retrieval of spoken corpora. Because this work has more general importance for multi-media repositories, we present it in section 2.6.2.
This large degradation in speech recognition performance for spontaneous speech suggests that the relatively simple models that underpin state-of-the-art speech recognition systems are extremely ill-matched to casually-spoken, unplanned speech. We propose to develop generative statistical models of the speech signal much more strongly rooted in the physics of speech production. This is an ambitious project and will require the collection of new digital object corpora relating, by simultaneous recording, articulatory and acoustic data from which statistical models may be built. At Sheffield, Renals and Carriera-Perpinan  have developed a latent variable model relating acoustic and electropalatogram data. This is a rich model, but is essentially static. We propose to develop dynamic latent variable models. Research aimed at developing better models of the speech production process requires accurate equipment to measure the dynamics of the vocal system during speech production. We therefore intend to create a facility for instrumental speech research. In this proposal we request funding which will enable us make a start in building this facility: we will seek further provision from other sources, including EPSRC's multi-project scheme and industrial collaborators such as Nokia. The infrastructure we request here will help to facilitate accurate real time measurements of the vocal tract using Electromagnetic (Articulograph), Inductance (Laryngograph) and Conductance (Electro-Palatograph) data, synchronous with the speech that has been produced. From other sources we will fund Ultrasound and Video equipment, the latter for lip movement measurements. Data of this sort is crucial for gaining a better understanding of vocal tract dynamics. For example, synchronous recordings of speech and vision can help us build lip-reading systems via the fusion of complementary information from speech and vision. Such challenges can only be addressed by the development of more powerful machine learning algorithms such as probabilistic graphical models. In the lip-reading example, current mathematical models are inadequate for integrating such complementary cues.
(i) Lip reading: The integration of acoustic and visual cues to enhance speech perception is something humans perform seemingly effortlessly. There are several difficulties in performing this automatically in machines. If we were to regard the variation in the acoustic signal and the corresponding lip movements as a Markov processes, we get a parallel coupled model. A lower dimensional common latent space couples the two. Training such a coupled model, and making statistical inferences from it are not straightforward. A rich dataset of speech and vision is required to validate such approaches and pursue them further.
(ii) Multi modal identification of persons: Automatic identification and/or verification of persons is an important topic for a range of security, access control and forensic applications. Techniques that can integrate multimodalities are of interest in this context. Some work along these lines has been undertaken in a number of laboratories including IDIAP (Switzerland), Matra Communications (France) and Surrey. Many of the techniques developed in these studies have been evaluated on small databases. Further research along these lines will involve directions such as pattern processing algorithms for optimal sensor fusion of multi modal data and dealing with feature selection in the presence of variable costs of misclassification. Rich data bases with synchronous speech and visual information is a fundamental requirement for advances in these directions.
(iii) Modelling coarticulation: Coarticulation is one of the most fundamental aspects of the dynamics of speech patterns. It is widely acknowledged that state of the art statistical pattern recognition techniques used in building speech recognition systems make a very poor job of modelling the effects of coarticulation. By increasing the power of our parametric models and developing robust ways of parameter estimation we have had some success in coping with the context dependent variability induced by coarticulation. We believe better instrumentation of the vocal tract is the best way of providing the framework to carry out the development of better models of the effect of coarticulated speech. There is some recent evidence to this effect,, for example, which used a small database of X-ray measurements of the vocal tract (the Wisconsin microbeam data), to learn a nonlinear mapping from acoustic data and the corresponding articulatory movement. Work to estimate a latent space model that can represent articulatory movements has been carried out by Richards and Bridle in their hidden dynamic model. Such attempts could benefit much if a rich set of measurements of the vocal tract dynamics were available.The EMG articulatory measurements can provide precisely this sort of information.
The explosive growth of digital text in all areas of human endeavour (science, commerce and the arts) has lead to urgent demands for more sophisticated automated support for retrieving and extracting information from text. [23, 24, 37] Sheffield's NLP Group has developed a state-of-the-art Information Extraction (IE) system, with support from the EPSRC, which allows structured information repositories to be built from free text sources. This system has been deployed in a number of research council and industrially funded research projects, including:
(i) the US DARPA-sponsored Message Understanding Conferences, the premier international evaluative framework for IE research, in which information is extracted from financial newswires -- e.g information about corporate joint ventures or management succession events -- and quantitatively compared with humans performing the same task;
(ii) the Enzyme and Metabolic Pathways Information Extraction (EMPaTHIE) project, a bioinformatics research project funded by GlaxoWellcome plc and Elsevier Science, collaborating with the Departments of Computer Science and Information Studies at the University of Sheffield. EMPathIE aims to apply Information Extraction technology to create a database of enzyme and metabolic pathway data from academic journal papers to support drug discovery;
(iii) the Protein Active Site Template Acquisition (PASTA) project, a BBSRC/EPSRC BioInformatics Initiative project involving the Departments of Computer Science, of Molecular Biology and Biotechnology and of Information Studies at Sheffield University. It aims to create a database of protein active site data from academic journal papers and abstracts to support molecular biologists in protein structure analysis;
(iv) the Text Retrieval, Extraction and Summarisation for Large Enterprises (TRESTLE) project sponsored by Glaxo Wellcome Research and involving a collaboration between the Departments of Computer Science and Information Studies at the University of Sheffield. TRESTLE aims to assist GlaxoWellcome in deploying the IE technology developed in EMPathIE in a practical way in the company, as well as extending the technology to other areas of interest such as the extraction of information about clinical trials from pharmaceutical company newsletters.
In addition to these projects other IE projects are underway ranging from multinational collaborative projects within the European Commission's Framework programmes on extracting and fusing information in multiple languages contained in police reports to work on extracting information from biographical dictionaries to support historical research.
IE systems must be made more adaptable, so that moving to new domains is more straightforward and ultimately becomes something manageable by an end user. This challenge involves investigating how domain-specific lexicons and grammars can be acquired from corpora, as well as issues of usability engineering and HCI. Using the MicroULab and associated facilities we will be able to investigate, in a rigorous manner, the ways that users interact with IE systems and identify ways to improve their effectiveness.[83-86]
The accuracy of IE systems and their depth of "understanding" must be improved. This involves gaining a deeper understanding of the mechanisms of human language and of ways of modelling the world knowledge (for example, knowledge of molecular biology) the underpins human language understanding.
Given a pair of texts, is there an algorithm that can decide, with an acceptably high level of probability, that one was derived from the other?. In a world in which electronic text passes freely, often independent of its original authors and subject to constant reuse, a measure of that systematic reuse is becoming essential, since the phenomenon is having profound social and legal consequences. The Measuring Text Reuse (MeTeR) project, funded by an EPSRC ROPA award, and involving collaboration between the Departments of Computer Science and Journalism at the University of Sheffield and the Press Association (PA) is addressing this question in the context of investigating to what extent UK daily newspapers are reusing PA-supplied copy. [23, 24, 25]
The scientific challenge here is to see if algorithms can be developed to detect text derivation relations. This work is pushing the state-of-the-art in statistical language processing techniques and will lead to insights into the nature of individual and social text creation.
The NLP group also has a strong interest in the field of pragmatics and in particular the analysis and computer representation of intentions, beliefs, plans and goals in natural language dialogue. More specifically, we are currently researching the following topics: the effect of speech act sequences in intention recognition, models of belief representation and ascription, argumentation, processing of conversational implicature, empirical models of dialogue co-reference resolution and multimedia dialogue systems. This research has lead to a paradigm called ViewGen which has been embodied in a sequence of Prolog programs. The basic notion is that of nested belief environment corresponding to agents who compute each others belief spaces by default in order to understand and act. ViewGen computes belief nestings by a process of ascription. This research has also lead to a technique for deriving a dialogue grammar from corpora empirically, which resulted in a dialogue system (Converse) that won the Loebner Prize in New York in 1997. 
Can dialogue be modelled in a machine sufficiently well to allow a machine to converse usefully with a human? As old as the Turing test, but ever more relevant in the era of call centres and ubiquitous machine held information sources, this question will continue to drive dialogue processing research, and to yield payoffs not only in the form of deeper understanding, but also in the production of usable, albeit limited, dialogue systems which are starting to appear in information system interfaces.
The Sheffield NLP group has pioneered the development of platforms for carrying out text engineering research, through the development of the General Architecture for Text Engineering (GATE) with support from the EPSRC. GATE provides a platform for the development, integration and reuse of text processing components and systems for a wide range of research and application purposes. GATE has been downloaded by over 500 researchers worldwide, has been used in a variety of UK and European language engineering projects, and was chosen by the US DARPA-sponsored TIPSTER project in 1998 as a final demonstration platform for their programme results. GATE is now being broadened to cope with multimedia data (sound and images) through the EC-funded MUMIS project, and to cope with non-Western languages via the Enabling Minority Language Engineering (EMILLE) project, an EPSRC funded project in collaboration with the University of Lancaster, which will produce corpora and text processing tools for a variety of subcontinental (Indian) languages. [76 - 81]
Constructing a large-scale, robust platform like GATE raises many challenging software engineering issues concerning software reuse, integration, distribution, and portability. As a repository for very sophisticated meta-data about text, sound and images, GATE is also at the forefront of investigating how best to represent, store and manipulate complex meta-data.
Important progress has been made in the modelling of complex molecules and their role in metabolic processes within cells. The major recent advances in genome science and bioinformatics have resulted in an avalanche of scientific information about the sequence, structure (both secondary and tertiary) of many important genes and proteins.
The challenge for bioinformatics and biology, now, is to turn all this raw data into intelligible and useful understandings of the function and behaviour of these molecules. The digitisation of the complex and multi-dimensional information that has been gathered in a way that it can be usefully analysed and applied is an important challenge. Examples of the work in the Department of Computer Science in this area include the World Wide Web Cytokine Exploratory developed in collaboration with UCL and the public domain Java version of RASMOL developed within the University's Academic Java Centre funded by Sun Microsystems (this can be downloaded from - http://www.dcs.shef.ac.uk/aajc/jmvs/)
As a result of recent advances in experimental techniques such as automated sequencing of DNA molecules, determination of molecular structures by crystallographic and magnetic resonance methods, and the ability to visualise the activity profiles of thousands of genes in a single microarray hybridisation experiment, modern molecular biology is a source of a very large amount of rich data. Effective manipulation of such data will be a key determinant of our ability to use knowledge of molecular biology in healthcare and make progress in our understanding of biological function. Efficient storage, content based retrieval, the use of advanced statistical pattern processing techniques in the form of data mining, and the development of tools for visualisation of three dimensional structures to highlight functional similarities retrieved from databases are recognised as key topics which could fill the current gap between having so much of experimental data and the ability to make effective use of it. [21, 22, 41, 65].
Our interests in this area include the development and critical evaluation of modern machine learning algorithms in a number of specific biological inference problems. We are working on protein secondary structure prediction using neural network type techniques, and automatic classification of proteins into structurally similar classes.
Recent advances in the computational modelling of cells and tissues using hybrid machines and an accompanying formal logic have resulted in an opportunity to build massive, computational models which can reflect what is known about the detailed behaviour of the participants in the many different pathways involved and particularly in the interaction of the many different subsystems involved. The recent model of the T2 immune response system developed and simulated [32, 34] and this work has demonstrated that the use of computational models using semantically accurate digital representations of cellular and molecular processes can be validated against experimental data.
Further construction of digital representations of complex systems of tissues which exploit the behavioural properties of the digital molecules as represented from 1 will be possible and more complete representations of complex systems such as the immune system will be possible. All this requires a coherent basis of data created to represent the sequential, structural and behavioural properties of realistic digital models of molecules and organelles. The current BBSRC/EPSRC Initiative in Bioinformatics identifies amongst the priority areas: "Understanding, integrating and modelling cellular processes; Novel computer science and IT techniques which can be applied to important areas of Biology" both of which relate directly to this work. [11, 33, 35]
CUDO will develop quality software using state of the art software engineering methods combined with the benefits of new insights in artificial intelligence. The laboratory will also be able to build practical software applications using existing technology through the mechanism of computer science research students and through masters projects (EPSRC funded Advanced MSc and MRes).
One important issue for this emerging application area is that the traditional approaches to software engineering, originally based around the design of information systems and real-time control systems do not seem to be the most effective. The methodology and the procedures are extremely heavyweight and do not translate well into research oriented software applications. Holcombe and co-workers have developed a more light weight and more usable approach which is based round conceptual modelling and extensive early testing. This is related to the emerging discipline called Extreme Programming  which has proved effective in domains where the requirements are changing rapidly.
Based on the formalisms and practical approach to software design pioneered by the proposer [36, 38, 39, 40] and used in at least 30 industrial software projects, CUDO will develop further the tools and methods for rapid quality software development. The approach is based round the identification of low level requirements, the construction of functional tests for these and the direct coding which is extensively and continuously tested against the test sets. This enables the inevitable changes in the requirements to be managed efficiently and the use of X-machine design methods allows for the rapid integration and testing of the complete application. Too often the shaping of modern software engineering methodologies and techniques has been driven by traditional applications in information systems and control applications developed in large software houses. The resulting methods are often poorly suited to applications where the requirements are highly dynamic and the programming resources are limited. New, quality-oriented but practical approaches are needed which can be demonstrated to be effective and underpinnned by a strong computational foundation.
Repositories of digital objects whether representations of text, sound, 3-dimensional images or film will need to be developed so that objects can be retrieved using human-centric mechanisms based on their content and semantics rather than being catalogued and classified using abstract and arbitrary methods. Thus the retrieval basis will be the natural content of the digital artefact. This is a major area for Computer Science research and one that will have an enormous impact in the future. For example we are using applied speech recognition technology in the ESPRIT project THISL and the EPSRC project STOBS to handle archives of audio and audio-visual material, without the need for human transcription into text. The BBC is a partner in THISL. In this context the motivation would be to provide a service for historians and other scholars wishing to build and use large repositories of spoken material.
It has been estimated that a large proportion of human-generated information is spoken and that much of this is in the form of television and radio broadcasts . While the content-based navigation and retrieval of textual data is commonplace, it still an outstanding research problem to perform such operations on archives of spoken data. Researchers in Computer Science at Sheffield have a strong recent track record in this area. The THISL project has developed a large-scale system to search and retrieve TV and radio news stories from an archive, currently containing over 2,000 hours of broadcast material (and growing at 5 hours/day). STOBS has been concerned with structuring the automatically generated transcriptions of such archives through automatic detection of sentence boundaries and identification of entities such as proper names, numbers, and monetary expressions. The techniques developed in these projects have been evaluated in competitive international programmes such as TREC and the DARPA-sponsored IE-ER evaluation, and the statistical approaches employed at Sheffield have been validated as state-of-the-art. Our work in information access of multimedia data is currently based on a two stage procedure: first translate the material to text (e.g., by using a speech recognizer), followed by statistical text processing operations. This has proven to be very successful for tasks such as spoken document retrieval  and identification of names and other entities in spoken audio  and we have developed new approaches (in particular, based on statistical finite state machines) to the problem. However, transcripts do not capture the richness of spoken data, as characterised by features such as intonation and duration. We have carried out some pilot studies on using such prosodic features for information access, and (in the context of sentence boundary identification) have had promising results. We plan to concentrate further on these areas to provide both a structured representation with a greater richness and better information access facilities from spoken data.
Furthermore, we propose to broaden the theme of integration from multimedia databases, by developing statistical models for the integration of audio, video and textual data. The development of XML based repositories of digital objects and associated searching algorithms and querying interfaces will provide a specific framework for this and related work. This work will relate directly to work in Digital Architecture - the representation and manipulation of three dimensional structures defined in XMLaec and in Digital Archaeology where the development of annotated collections of digital artefacts will also be a major theme.
Research into human cognitive processes has often focused on building computational models of these processes. Robots represent a physical instantiation of a computational model. The exploration of perception, appreciation and creativity in robots takes research in robots and what they can do in new directions. Moreover, because they are physical devices, robots have their own aesthetic appeal both in appearance and in movement.
The availability of sophisticated digital objects representing many dimensions, physical, temporal and behavioural, of designed or manufactured artefacts such as buildings, archaeological relics, musical compositions, spoken sentences etc. provides us with an opportunity to examine how these might have been created. The transformation, refinement and composition of partially complete digital artefacts can be simulated within the electronic environment and the creative design processes examined in detail. Computational models of these processes can be postulated and explored with reference to the ways in which the digital objects manifest themselves, and thus greater understanding of the creative processes achieved. Building robots to carry out simple artistic and creative activities will further enhance the understanding of how creation can take place subject to specific and well defined physical constraints. We hope to base this on some theoretical models of communicating systems , where interacting agents, specified in a rigorous but straightforward way, can be designed and programmed directly from the specification model. We have developed a communications management technique which allows for autonomous behaviour subject to some clearly specified rules and we will be able to explore the consequences of refining these rules in the context of real robot behaviour. This approach will be contrasted with more conventional training methods based around neural nets and related paradigms.
The breadth of interest in the University in the use and exploration of research issues using digital objects is very considerable. These are plans to establish a laboratory for Digital Arts alongside these activities to explore areas of common interest and to share expertise and resources. Examples of the sort of projects under way in the University are described below:
T. Balaneascu, M. Holcombe, A. J. Cowling, H. Gheorgescu, M. Gheorghe, C. Vertan, "Communicating Stream X-machines Systems are no more than X-machines". Journal of Universal Computer Science, Volume 5, no. 9, 494-507, 1999.
H. Cunningham, R. Gaizauskas, K. Humphreys and Y. Wilks, "Experience with a Language Engineering Architecture: 3 Years of GATE", Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, Edinburgh, April, 1999.
M. Carreira-Perpinan and S. Renals, [http://www.dcs.shef.ac.uk/~sjr/pubs/1999/icphs99.html], "A latent-variable modelling approach to the acoustic-to-articulatory mapping problem", Proc. 14th Int. Congress of Phonetic Sciences, San Francisco, 2013-2016, 1999. Y. Gotoh and S. Renals, "Information Extraction from Broadcast News", Philosophical Transactions of the Royal Society of London, Series A, 2000, In press.
M. Carreira-Perpinan and S.Renals, [http://www.dcs.shef.ac.uk/~sjr/pubs/1998/spcom98.html] "Dimensionality reduction of electropalatographic data using latent variable models", Speech Communication, 26, 259--28, 1998.
R.W. Dennell, "Grasslands, tool-making and the earliest colonization of south Asia: a reconsideration". In Early Human Behavior in Global Context: The Rise and Diversity of the Lower Palaeolithic Record, 280-303, edited by M. Petraglia and R. Korisettar. London: Routledge. 1998.
R.W. Dennell, "The TD6 horizon of Atapuerca and the earliest colonisation of Europe: an Asian perspective". In Los primeros pobladores de Europa/The First Europeans, ed. E. Carbonell, Bermudez de Castro, J.M., Arsuaga, J.L. and Rodriguez, X.P. (1999). Burgos, Spain, 75-97. 1999.
R.W. Dennell, "The Palaeolithic and Pleistocene potential of the Indus drainage system: a review of recent work". In The Indus River: Biodiversity, Resources, Humankind, ed. A. and P. Meadows, 306-319. Linnaean Society, London/Karachi: Oxford University Press. 1999.
J.F.G. de Freitas, M. Niranjan, A.H. Gee & A. Doucet, "Sequential Monte Carlo Methods for Optimisation of Neural Network Models" 1999, Neural Computation, (In Press). Also available as CUED/F-INFENG-TR 328, via http://www-svr.eng.cam.ac.uk/~jfgf.
R. Gaizauskas and K. Humphreys, "A Combined IR/NLP Approach to Question Answering Against Large Text Collections", To appear in Proceedings of RIAO 2000: Content-Based Multimedia Information Access, Paris, April, 2000.
M. Harman, M. Holcombe, R. Hierons, B. Jones, S. Reid, M. Roper, M. Woodward, "Towards a Maturity Model for Empirical Studies of Software Testing", Fifth Workshop on Empirical Studies of Software Maintenance (WESS'99), 1999.
M. Holcombe, A. Bell. "Computational models of cellular processing." in Computation in Cellular and Molecular Biological Systems. M. Holcombe, R. Paton and R. Cuthbertson (Eds.), World Scientific Press, Singapore,167-18, 1996.
K. Humphreys, G. Demetriou, R. Gaizauskas, "Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures", Proceedings of the Pacific Symposium on Biocomputing (PSB-2000), Hawaii, pp. 505-516, 2000.
D. R. Lovell, B. Rosario, M. Niranjan, R. W. Prager, K.J. Dalton, R. Derom & J. Chalmers, "The QAMC Project: A case study of neural networks for medical risk prediction". Australian Journal of Intelligent Information Processing Systems, Vol 5, No 1, pp 24-28, 1998.
B. R. Lawson, "The quest for the parrot on the shoulder: knowledge about emerging design solutions and its representation in a CAD system". In J. J. Connor,S. Hermandez,T. K. S. Murthy, & H. Power (Eds.), Vizualization and Intelligent Design in Engineering and Architecture (pp. 421-430). London: Elsevier, 1993.
Moore. A., . First performance of "Soundbodies" (55minute multi-media collaboration) at the Crucible Theatre 17th November 1998. From this a short clip "Soundbodies: Bodypart" has appeared on the compilation CD: Presence II pub, Canadian Electroacoustic Community, 1998.
C. Peng and P. Blundell Jones, "Hypermedia Authoring and Contextual Modeling in Architecture and Urban Design: Collaborative Reconstructing Historical Sheffield". Proceedings of ACADIA'99, pages 116-127,, Salt Lake City, US, 1999.
A.J.C. Sharkey & N. E. Sharkey, " Diversity, Selection and Ensembles of Artificial Neural Nets". Proceedings of Third International Conference on Neural Networks and their Applications. IUSPIM, University of Aix-Marseille III, Marseilles, France, March, 205-212. 1997.
M. Sanderson & D. Lawrie, "Building, Testing, and Applying Concept Hierarchies", In Advances in Information Retrieval: Recent Research from the CIIR, W. Bruce Croft, ed., Kluwer Academic Publishers, 2000.