^
ERCIM News No.21 - April 1995 - SICS

A Speech Interface to Virtual Environments

by Jussi Karlgren, Ivan Bretan, Niklas Frost and and Lars Jonsson

Virtual reality interfaces sometimes seem to be thought of as embodying a return to a natural way of interaction - the way we interact with the real world. However, there is one aspect of real world interaction lacking in virtual reality interfaces of today: that of language, notably speech. Language is a versatile tool with which to both manipulate the world "Paint the table red and make it round." and oneself "Take me to the moon"; like virtual worlds, utterances do not need to obey laws of the physical world. Language has its natural place in interaction with other people, but can also be used to control one's environment.

We have built a prototype system to test synergy effects obtained when introducing speech into an immersive interface. Among other things, we currently investigate the effects of immersive interaction on linguistic problems such as reference resolution, and the effects of choice and design of interaction metaphor on interaction style.

Our system DIVERSE (DIVE Robust Speech Enhancement) is a speech interface to a generic virtual environment based on the virtual environment DIVE (Distributed Interactive Virtual Environment). DIVE can be used with complex worlds modelled in a variety of formats. DIVERSE allows a user to create, remove, select, and manipulate objects in the world and move about in it using spoken English.

DIVERSE is implemented as a cascaded sequence of components. Speech recognition is done by means of a Hidden Markov Model system HTK which has been trained for the domain. Text processing is performed by a general-purpose surface syntactic processor ENGCG which identifies syntactic roles and dependencies in the text. A resulting dependency graph is translated to a logical representation, which in turn is inspected for references to entities and objects and matched to the set of conceivable and possible actions. The resulting queries or commands are then sent to DIVE which manipulates or queries the world accordingly.

There is no obvious counterpart to the user for dialog with a system in a speech-controlled virtual environment. There are several conceivable interaction models; we have chosen an agent based interaction model for our implementation of DIVERSE. This is necessary to be able to integrate visual and spoken feedback naturally; with no feedback or interlocutor, the interaction situation would most likely be very unfamiliar and difficult to make use of. A consequence of machine use of a single interlocutor is that the system's linguistic competence can be modelled in this agent through its visual characteristics, its gestures, its language, and so on -- this will encourage convergence in one direction. Accordingly, the DIVERSE agent has been provided with a simple vocabulary and a small set of gestures.

One of the most challenging problems of language understanding is that of reference resolution: of tracking what referents referential expressions such as "it", "the ball", "that cube", "a house", and other noun phrases refer to. When language use is strongly situated, as in an immersive interface such as DIVERSE, many of the traditional problems of reference resolution change in character. To find the referent of an referential expression, we give each accessible object in the world a focus grade, based on three sorts of factors:

linguistic, such as recent mention;
gestural manipulation by the user, and most importantly,
visual awareness.

The set of candidate referents is constrained by focus grade, and the candidate with the highest focus grade is chosen as a referent. We are currently conducting empirical studies to determine exactly what sort of relative weighing these respective methods should have. The interactive design of the DIVERSE interface is related to recent trends in natural language interface research, where the underlying problem of interactive interfaces, especially natural language interfaces, today is identified as that of a low degree of interactivity or "one-shot"-interaction, where users believe - regardless of system competence - that systems expect them to pose queries in one go.

In DIVERSE we make use of what we call the errors-do-not-matter principle to the extent that we will not worry about the system misinterpreting the occasional user utterance: as long as the interface is interactive we do not expect misinterpretations to bee too crucial a problem. More important than error handling is a broad acceptance of user utterances: every utterance should produce some effect. The representation of the utterance is matched to representations of possible actions in the domain. If no good match is found, any referents that have been identified in the utterance are highlighted anyway, to facilitate users to continue the discourse, rather than starting from square one again. This is similar to recent ideas about how to generally design natural language interfaces.

Our prototype system has already surprised us by the extent of its effects on interaction with the speechless virtual environment it has been built on, and it has shed new light on the ways to think about some central problems in natural language understanding. We are now investigating the synergetic effects in a more systematic way. There are several other projects with the ambition of integrating speech and virtual reality currently in progress at research centers in Europe, in the US and in Japan.

As a consequence of this line of research, we expect to see significant results in several different subfields of both interactive systems and natural language understanding in the near future.

Please contact:
Jussi Karlgren - SICS
Tel: +46 8 752 1552
E-mail: diversesics.se

return to the contents page