PRESENCE (PREdictive SENsorimotor Control and Emulation)
PRESENCE is a new architecture for speech-based interaction that is founded on the
premise that future progress depends, not on how to "bridge the gap" between speech
science and speech technology, but on both communities seeking to assimilate wider
research findings on the behaviour of living systems in general and the cognitive
abilities of human beings in particular.
The PRESENCE architecture is inspired by relatively old ideas such as perceptual
control theory [Powers, W. T. (1973). Behavior: The Control of Perception: Hawthorne,
NY: Aldine] together with relatively new discoveries such as mirror neurons [Rizzolatti,
G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience,
27, 169-192] coupled with contemporary theories of cortical functionality such as
hierarchical temporal memory [Hawkins, J. (2004). On Intelligence: Times Books] and
emulation mechanisms [Wilson, M., & Knoblich, G. (2005). The case for motor involvement
in perceiving conspecifics. Psychological Bulletin, 131(3), 460-473].
PRESENCE intentionally blurs the distinction between the core components of a traditional
spoken language dialogue system and, as a result, cooperative and communicative behaviour
emerges as a by-product of an architecture that is founded on a model of co-action
in which the system has in mind the needs and intentions of a user, and a user has
in mind the needs and intentions of the system.
The PRESENCE architecture is organized into four layers. The top layer is the main
path for motor behaviour such as speaking. A system's needs S:n modulated by motivation,
causes the selection of a communicative intention S:i that would satisfy those needs.
The selection mechanism can be implemented as a search process, and this is indicated
by the diagonal arrow running through the S:i module. The selected intention drives
both actual motor behaviour S:m and an emulation of possible motor behaviour S:E(S:m)
on the second layer. Sensory input feeds back into this second layer, providing
a check as to whether the desired intention has been met. If there is a mismatch
between intended behaviour and the perceived outcome, then the resulting error signal
will cause the system to alter its behaviour appropriately.
The third layer of the model captures the empathetic relationship between system
as a speaker and the user as a listener that conditions the speaking behaviour of
the system. U:E(S:i) represents the emulation by the user of the intentions of the
system, and S:E(U:E(S:i)) represents the emulation of that function by the system.
A similar arrangement applies to S:E(U:E(S:m)) - the system's emulation of the user's
emulation of the systems motor output. The fourth layer represents the system's
means for interpreting the needs, intentions and behaviour of a user though a process
of emulating the user's needs S:E(U:n), intentions S:E(U:i) and behaviour S:E(U:m).
The second, third and fourth layers are able to exploit the information embedded
in the previous layers, and this is indicated by the large block arrows. This process
is equivalent to parameter sharing between the different models and thus represents
not only an efficient use of information but also offers a mechanism for learning.
In fact such a process may be bi-directional, and the potential flow of information
in the opposite direction is indicated by the small block arrows.
The basic communicative loop in the PRESENCE architecture contains system components
that are themselves realized using similarly-structured building blocks. The PRESENCE
architecture is thus inherently nested recursively and hence hierarchical in structure.
As a result, further refinements in behaviour arise from the operation of the nested
Overview of the PRESENCE architecture (click to enlarge)
PRESENCE is based on the premise that there are three fundamental factors that ultimately
determine an organism's fitness to survive in an evolutionary framework:
a need to manage energy (facilitating efficient behaviour)
a need to manage entropy (facilitating efficient communications)
a need to manage time (facilitating efficient planning)
These constraints, coupled with an integrated and recursive processing architecture,
pave the way to a new approach to spoken language technology in which high-level
interactive behaviours such as prosody and emotion emerge as fundamental aspects
of a communicative system rather than as processing afterthoughts.
A new model of speech generation that …
selects its characteristics appropriate to the needs of the listener
monitors the effect of its own output
modifies its behaviour according to its internal model of the listener
A new model of speech recognition that …
uses a forward/generative model based on an internal emulation of the communicative
intentions of the speaker
adapts its forward/generative model to the voice of the speaker based on knowledge
of its own voice
Machines might talk with humans by putting themselves in our shoes, PhysOrg.com.
Nicolao, M., Tesser, F., & Moore, R. K. (2013). A phonetic-contrast motivated adaptation
to control the degree-of-articulation on Italian HMM-based synthetic voices. 8th
ISCA Speech Synthesis Workshop (SSW8). Barcelona, Spain.
Crook, N., Smith, C., Cavazza, M., Pulman, S., Moore, R. K., & Boye, J. (2010). Handling
user interruptions in an embodied conversational agent, AAMAS 2010: 9th International
Conference on Autonomous Agents and Multiagent Systems. Toronto.
Worgan, S., & Moore, R. K. (2008). Enabling reinforcement learning for open dialogue
systems through speech stress detection, Fourth International Workshop on Human-Computer
Conversation. Bellagio, Italy.
Hofe, R., & Moore, R. K. (2008). AnTon: Using an animatronic tongue and vocal tract
model to investigate human language learning from an energetics point of view, Epigenetic
Moore, R. K. 'Towards speech-based human-robot interaction', Proc. Symposium on Language
and Robotics, Aveiro, Portugal, 10-12 Dec. (2007)[pdf].