Multimodal human-robot interaction
Updated: 19 December 2014 - 12:01pm by M.A. Salichs
We have developed an Autonomous Personal Robot, Maggie, that is going to interact with the user in a Peer-To-Peer way. That implies multimodality, personality, adaptivity, autonomy, learning ability, cooperativeness, reactivity and proactiveness.

To accomplish the Human-Robot Interaction problem we think that we have to resolve the following issues:

Meaning Generation

The robot has to be able to understand context, i.e. object and human detection and identification. The skill of giving meaning to the environment objects will make important progress in the robot interaction with them.

The essence of this problem is the formulation process: how to represent the meaning of something. It is a knowledge representation problem. But this problem has been treated for along human history. Philosophers, psychologists and other scientifics. A lot of different and interesitng approaches have emerged in the last years, but the application of these ideas to robotics is not a banal issue.

For interaction with the human, we are developing a model of the user that pretend to include: his mental models, emotions, beliefes, desires and intentions.

Human-Human Interaction

In this field we are going to use the subdivision established by Morris in "Foundations in the Theory of Signs", that subdivide the Human Communication in three areas:

  1. Syntax. It studies the theory of information: codification, channels, capacity, noise, redundancy and other probabilistic language properties.

  2. Semantic. The Meaning is the central goal of the Semantic. In the Communication process emitter and receiver have to agree in the meaning of a message.

  3. Pragmatic. Pragmatic domain studies the effects of the communication in the behaviour of both the emitter and the receiver.

This brief schema established the frame of our Human Communication research. The model that we are developing will be implemented by means of the Automatic-Deliberated Architecture. This architecture, that has been created by Ramón Barber, is an hybrid control architecture. The above Syntax Level coincides with the low level architecture or Automatic Level. The Semantic Level would coincide with the high level of the architecture: Deliberated Level

Human-Robot Interaction

The mean goal of the above issues is to make a model that can be implemented in a computer system. We want to give to the user the sensation that is interacting with the Personal Robot, efficiently. We have two different ways to solve the problem: the inner approach and the outer approach. In the inner one, we are interested in developing a Human-Human Interaction model in the robot, and then adjust the model in the pragmatic level to make the interaction dynamic works correctly. In the outer approach, we are more interested in developed a model that satisfies the interaction dynamic directly. This model doesn't have to be a human-based model.

By interaction dynamic we understand, the process along time where the robot is doing things to the user and is detecting things that the user does. This dynamic process has a special time parameters like for example silence time, time of a question, waiting times, etc, special movements like blinking, consent movements, etc, and special user gestures and movements detection like "user is speaking", "user is very close", etc.

Human-robot interaction is defined as the study of humans, robots, and the ways they influence each other. This interaction can be social if the robots are able to interact with human as partners if not peers. In this case, there is a need to provide humans and robots with models of each other. Sheridan argues that the ideal would be analogous to two people who know each other well and who can pick up subtle cues from one another (e.g., musician playing a duet).

A social robot has attitudes or behaviours that take the interests, intentions or needs of the humans into account. Bartneck and Forlizzi define a social robot as "an autonomous or semiautonomous robot that interacts and communicates with humans by following the behavioral norms expected by the people with whom the robot is intended to interact?. The term sociable robot has been coined by Breazeal in order to distinguish an anthropomorphic style of human-robot interaction from insect-inspired interaction behaviours. In this context, sociable robots can be considered as a distinct subclass of social robots. She defines sociable robots as socially participative creatures with their own internal goals and motivations.


Multimodality allows humans to move seamlessly between different modes of interaction, from visual to voice to touch, according to changes in context or user preference. A social robot must provide multimodal interfaces, which try to integrate speech, written text, body language, gestures, eye or lip movements and other forms of communication in order to better understand the human and to communicate more effectively and naturally.

We can enumerate the different modalities in HRI in two types: perception or expression modes. The different modes work in a separate way, that is, they do not communicate each other directly. To make a global synchronization between them an upper entity is used, that is called Communication Act Skill. Our multimodality model for robot interaction is based on these modes:

  • Visual: gesture expression and recognition.

  • Tactile: tactile sensor and tactile screen perception.

  • Voice: text-to-speech and automatic-speech-recognition.

  • Audiovisual: sound and visual expression

  • Remote: web-2.0 interaction.

Visual Interactive Mode: Gesture Expression Model

The Visual Mode includes all visible expressive acts. Traditionally, it is divided in kinesics: body gestures, and proxemics: body placing in the communication system. We differentiate as a special interactive mode, the audiovisual mode, that is explained later. It has been established the importance of body movements in the communication act because it contains a lot of information that flows very quickly. Birdwhistell argues that the 65% of the information in a human-human interaction is non-verbal. Visual gestures shows human thoughts, mood state, replaies, complements, accents and adjust verbal information. Several problems arise when we want to make a human gestures model that could be implemented in a robot. We differ two directions: gesture expression model and gesture recognition. At the moment only the former is being taken into account.

A discrete set of different atomic gestures has been implemented. An atomic-gesture duration is lower than approximately five seconds. Each atomic-gesture can be interrupted in real-time by another atomic-gesture to configure the final dynamic expression. Attending to the whole life of a gesture, they are divided in acquired and non-acquired or innate gestures. So, when the robot begins to be active it counts with a set of non-acquired gestures that can be or not be kept along its life. But the robot also can learn more gestures from the user. Attending to the gesture dynamics, we differ gestures that have or not have a final ending, and also gestures that should or should not start from a necessary initial position. Each atom-gesture has an intensity and velocity parameter that modulate it. Attending to the way that each gesture can be interpreted we consider:

  • Emblems: that replace words and sentences.
  • Instructs: that reinforce verbal messages.
  • Affective gestures: that show emotions and express affect.
  • Adjusting or control gestures: that regulate the flux and way of communication. They are one of the more culturally determined gestures.
  • Adaptors: release emotional and physical tension. They are in the low level awareness.

Tactile Mode

Two different kind of tactile modes can be differentiate: tactile skin sensing, and tactile screen sensing. The former is analogue to human skin sensing. The latter is exclusive for robotics. Depending on the hardware the robot can detect that something is touching it, where and get information about the force. The tactile screen gives the robot the possibility to perceive ink-gesture data introduced by the user. As the tactile screen is also showing an image, the ink-gesture data has to be interpreted in contrast with that image. The ability of showing an image by means of a tactile screen is explained latter in the audiovisual interactive mode.

Voice Mode

This mode is in charge of verbal human-robot communication.

Verbal Perception.

The verbal signal can be interpreted for speech recognition, but it also gives user prosodic and user localization information. Our automatic speech recognition model is based on a dynamic asr-grammars system. It works in a asr-engine. The set of active grammars can be changed in real-time. The set of asr-grammars is made a-priori attending of what information is useful for the robot. Each grammar is related to a Speech Act, so the speech recognition works as a speech act trigger. No ontological information is consider, at the moment.

Verbal Expression.

The speech system is based on two types of sentences: fixed sentences and variable sentences. The former is designed a-priori, and they are sentence related to constant episodes that always occur in a common conversation. The variable sentence are made using a fixed grammar. When the speech skill decides to use a variable sentence, it first chooses a grammar with slots. Then, the grammar holes are completed using the appropriate words for the context.

Audiovisual Mode

A personal robot incorporates and works by means one or more computers. So the range of possible communication ways can be extended from human communication emulation to other possibilities that a computer offers, for example electronic sound synthesis. The sound mode can be used in:
  • Mood, affect or emotion expression associated with long term states: happy/sad or angry/calm
  • Interjection expression associated with short term states: fright, scare, laughter, crying, etc.
  • Notice sounds, to get the user attention and notice some interaction prompts, etc.
  • Singing skill using synthesized instruments.
  • Sound imitation: siren sounds, dog barks and other nature sounds, ...
We are studying the sound in music in its communicational side, extracting a set of parameters for sound synthesis and the relation of these parameters with the kind of message or intention that the robot wants to communicate. We are implementing a Sound Synthesis System that takes internal robot state parameters as inputs and synthesizes sounds for expression. This internal parameters include but not limited to emotional state, emotional magnitude or mood energy.

Audiovisual mode refers to the expression of synchronized images, video and sound, music, or voice. Moreover, audiovisual expressions (sound, video and computer generated graphics) can also be triggered and provided to the user as feedback or respond to robots initiatives.

Remote Mode

This is the most robot specific interactive mode. As the core part of a robot is its computer, the robot is also able to use all the capabilities that the computer offers. And one of the most important thing that a computer can do is to connect to internet and access to remote information. In the other side, internet is growing so much, that net protocols are changing to more computer centered protocols. In this sense web-2.0 or so called semantic-web offers inter-computer communication as never has existed. In this way, internet works as a big sensor for the robot, that can access to weather reports, news, e-mail, bus timetables, etc. The robot can also receive remote orders from a remote user, and interact with a remote user using chat skill, video-conference, etc.

Important HRI researching groups

HRI events

Journal Publications

Conference Publications



Doctoral Thesis