Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 88 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
88
Dung lượng
1,35 MB
Nội dung
ELVA: AN EMBODIED CONVERSATIONAL AGENT IN
AN INTERACTIVE VIRTUAL WORLD
YUAN XIANG
(B.Comp.(Hons), NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2004
ACKNOWLEDGEMENT
First of all, I wish to thank my supervisor A/P Chee Yam San for his guidance,
encouragement and patience over the years. The many discussions we had, in which he
showed his enthusiasm towards the topic, kept me on the right track.
I would like to show my gratitude to the head of NUS Museums, Ms. Angela Sim, for
granting us the permission to use the Ng Eng Teng Gallery as our subject matter, and
making the project possible.
I would like to thank Dr. Sabapathy for reviewing the virtual tour, and providing his
insights on Ng Eng Teng’s art. Also thanks to Suhaimi, curator of the Ng Eng Teng
Gallery, who has contributed his expert knowledge about how to guide a tour. Special
thanks to those who lent their hands during the gallery-scanning sessions: Mr. Chong,
Beng Huat, Chao Chun, Ting, and Donald.
I am also grateful to the members of LELS Lab: Chao Chun, Jonathan, Lai Kuan, Liu Yi,
and Si Chao. It was an enjoyable and memorable experience studying in this lab.
Finally, it is time to thank my family. Their support and patience have accompanied me
throughout the whole period.
Yuan Xiang
ii
TABLE OF CONTENT
TABLE OF CONTENT ..................................................................................................III
LIST OF TABLES.............................................................................................................V
LIST OF FIGURES......................................................................................................... VI
SUMMARY .....................................................................................................................VII
CHAPTER 1 INTRODUCTION.......................................................................................1
1.1 EMBODIED CONVERSATIONAL AGENTS: A CHALLENGE FOR VIRTUAL INTERACTION .1
1.2 RESEARCH OBJECTIVES ...............................................................................................3
1.3 ELVA, AN EMBODIED TOUR GUIDE IN AN INTERACTIVE ART GALLERY.......................4
1.4 STRUCTURE OF THESIS .................................................................................................6
CHAPTER 2 RESEARCH BACKGROUND ..................................................................8
2.1 ECA AS A MULTI-DISCIPLINARY RESEARCH TOPIC .....................................................8
2.2 ARCHITECTURAL REQUIREMENT .................................................................................8
2.3 DISCOURSE MODEL .....................................................................................................9
2.4 MULTIMODAL COMMUNICATIONS .............................................................................10
2.5 SUMMARY ..................................................................................................................12
CHAPTER 3 AGENT ACHITECTURE........................................................................13
3.1 OVERVIEW .................................................................................................................13
3.2 PERCEPTION MODULE ................................................................................................14
3.3 ACTUATION MODULE ................................................................................................14
3.4 INTERPRETATION MODULE ........................................................................................15
3.4.1 Reflexive Layer ..................................................................................................15
3.4.2 Reactive Layer ...................................................................................................15
3.4.3 Deliberative Layer.............................................................................................15
3.4.4 Behavior Generation .........................................................................................16
3.5 KNOWLEDGE BASE ....................................................................................................16
CHAPTER 4 AGENT’S VERBAL BEHAVIORS ........................................................17
4.1 SCHEMA-BASED DISCOURSE FRAMEWORK ................................................................17
4.2 DISCOURSE MODELING ..............................................................................................19
4.2.1 Narrative Modeling ...........................................................................................19
4.2.2 Dialog Modeling................................................................................................20
4.3 REASONING ................................................................................................................21
4.3.1 Analysis Phase ...................................................................................................21
4.3.2 Retrieval Phase..................................................................................................27
4.4 PLANNING ..................................................................................................................29
4.5 DIALOG COORDINATION ............................................................................................31
iii
4.6 SUMMARY ..................................................................................................................32
CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS ...............................................33
5.1 TYPES OF NONVERBAL BEHAVIORS IN OUR DESIGN ..................................................33
5.1.1 Interactional Behaviors .....................................................................................33
5.1.2 Deictic Behaviors ..............................................................................................34
5.2 NONVERBAL BEHAVIOR GENERATION .......................................................................35
5.3 NONVERBAL BEHAVIOR EXPRESSION ........................................................................36
5.3.1 Building Blocks of Multimodality ......................................................................37
5.3.2 Multimodality Instantiation ...............................................................................38
5.3.3 Multimodality Regulation ..................................................................................40
5.4 SUMMARY ..................................................................................................................41
CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION ..........................42
6.1 AN INTERACTIVE ART GALLERY ...............................................................................42
6.2 VERBAL COMMUNICATION ........................................................................................43
6.2.1 Elva presents information..................................................................................44
6.2.2 Elva probes the user in a dialogue ....................................................................45
6.2.3 Elva answers questions......................................................................................46
6.3 NONVERBAL COMMUNICATION .................................................................................47
6.3.1 Deictic Gestures ................................................................................................47
6.3.2 Facial Display ...................................................................................................49
6.3.3
Locomotion .................................................................................................50
CHAPTER 7 EVALUATION..........................................................................................51
7.1 METHODOLOGIES.......................................................................................................51
7.1.1 Qualitative Analysis...........................................................................................51
7.1.2 Quantitative Analysis.........................................................................................52
7.2 SUBJECTS, TASK AND PROCEDURE.............................................................................52
7.3 EVALUATION RESULTS ..............................................................................................55
7.3.1 User Experience ................................................................................................55
7.3.2 Agent Believability.............................................................................................57
7.3.2.1 Evaluation results on Elva’s Verbal Behaviors ..........................................58
7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors...................................61
7.4 DISCUSSION ...............................................................................................................62
7.5 SUMMARY ..................................................................................................................63
CHAPTER 8 CONCLUSION..........................................................................................64
8.1 SUMMARY ..................................................................................................................64
8.2 CONTRIBUTIONS OF THE THESIS.................................................................................65
8.3 FUTURE WORK ..........................................................................................................66
REFERENCES .................................................................................................................68
iv
LIST OF TABLES
Table 1: List of speech acts ................................................................................................23
Table 2: Layout of a sample tour plan................................................................................30
Table 3: Elva’s interactional behaviors (portion only) .......................................................34
Table 4: Elva's multimodal behavior library ......................................................................38
Table 5: Evaluation results related to user satisfaction in interaction with Elva................55
Table 6: Users' responses to Elva's personality .................................................................57
Table 7: Evaluation results related to Elva's Verbal Behaviors..........................................58
Table 8: Evaluation results related to nonverbal behaviors................................................61
v
LIST OF FIGURES
Figure 1: Examples of embodied conversational agents ......................................................2
Figure 2: Steve's deictic gestures........................................................................................11
Figure 3: Greta's facial display: neutral, sorry-for, and surprise. .......................................11
Figure 4: Three-layered architecture for the agent’s mental processing ............................13
Figure 5: Schema-based discourse space (portion only) ....................................................18
Figure 6: Narrative modeling using a schema ....................................................................20
Figure 7: Dialog modeling using a schema ........................................................................21
Figure 8: Transform raw user input to utterance meaning .................................................22
Figure 9: State transition graph for QUESTION_MATERIAL .........................................25
Figure 10: Speech act classification using STG .................................................................26
Figure 11: Elva's tour plan..................................................................................................29
Figure 12: Conversational Modalities ................................................................................37
Figure 13: Synchronized modalities in an animation timeline ...........................................39
Figure 14: Elva invites the user to start the tour.................................................................42
Figure 15: Elva's basic facial expressions ..........................................................................49
Figure 16: Elva's locomotion..............................................................................................50
Figure 17: Procedure of user study on Elva .......................................................................53
vi
SUMMARY
The technology of Embodied Conversational Agents (ECAs) offers great promise for
natural and realistic human-computer interaction. However, interaction design in this
domain needs to be handled sensitively in order for verbal and nonverbal signals
conveyed by the agent to be understood by human users.
This thesis describes an integrative approach for building an ECA that is able to engage
conversationally with human users, and that is capable of behaving according to social
norms in terms of facial display and gestures. The research focuses on the attributes of the
agent’s autonomy and believability, not its truthfulness. To achieve autonomy, we present
a three-layered architectural design to ensure appropriate coupling between the agent’s
perception and action, and to hide the internal mechanism from users. In regard to agent
believability, we utilize the notion of “schema” to support structured and coherent verbal
behaviors. We also present a layered approach to generate and coordinate the agent’s
nonverbal behaviors, so as to establish social interactions within a virtual world.
Using the above approach, we developed an ECA called Elva. Elva is an embodied tour
guide that inhabits an interactive art gallery to offer guidance to human users. A user
study was performed to assess user satisfaction and agent believability when interacting
with Elva. The user feedback was generally positive. Most users interacted successfully
with Elva and enjoyed the tour. A majority agreed that Elva’s behaviors, both verbal and
nonverbal, were comprehensive and appropriate. The user study also revealed that
vii
emotive behaviors should be integrated into Elva to achieve a higher level of believability.
Future work will address the area of affective computing, multiparty interaction support,
and user modeling.
viii
CHAPTER 1 INTRODUCTION
1.1 Embodied Conversational Agents: A Challenge for Virtual
Interaction
People are spending more time interacting with computers which raises the question about
the design of natural human-computer interfaces. What is the most natural humancomputer interface? Speech is often seen as the most natural interface. However, it is only
part of the answer, as it is the ability to engage in conversation that allows for the most
natural interactions [5,6]. Conversation involves more than speech. A wide variety of
signals such as gaze, gesture, or body posture are used naturally in conversation.
One path towards achieving realistic conversational interfaces has been the creation of
embodied conversational agents or ECAs. An ECA is a life-like computer character that
can engage in a conversation with humans using a naturally spoken dialogue. It also has a
“body” and knows how to use it for effective communication.
In recent years, systems containing ECAs have been deployed in a number of domains.
There are pedagogical agents that educate and motivate students in E-learning systems
(see Figure 1: A, B), virtual actors or actresses that are developed for entertainment or
therapy (see Figure 1: C), sales agents that demonstrate products in e-commence
applications (see Figure 1: D), and web-agents that help the user surfing the web pages of
a company. In the future, ECAs are likely to become long-term companions to many
people and share much of their daily activity [20].
1
(A) Steve
(B) Cosmo
(C) Greta
(D) Rea
Figure 1: Examples of embodied conversational agents
While embodied conversational agents hold the promises of conducting effective and
comfortable conversation, the related research is still in its infancy. Conversation with
ECAs presents particular theoretical and practical problems that warrant further
investigation. Some of the most intensively pursued research questions include:
z
What are the appropriate computational models for building autonomous and
intelligent ECAs?
z
How can we draw on existing methods in research and effectively put the pieces
together to design characters that are compelling and helpful?
z
What is the set of communicative behaviors that are meaningful and useful for
humans, and are practical to implement?
2
It is evident that the field of embodied conversational agents remains open and requires
systematic research.
1.2 Research Objectives
This research project aims to develop an integrative approach for building an ECA that is
able to engage conversationally with human users, and capable of behaving according to
social norms in terms of facial display and gestures.
Our research into the embodied conversational agents focuses on the attributes of
autonomy and believability, not truthfulness. To achieve autonomy, the autonomous agent
we developed shall be capable of relating itself to the virtual world in a sensible manner.
In view of this, the architectural design aims to couple the agent’s perception and action
appropriately, and to hide its internal mechanism from the system users. To achieve
believability, the agent must demonstrate sufficient social skills in terms of speech and
animation. Towards this goal, we pay careful attention to the design of verbal and
nonverbal interaction and to the attainment of flows of behavior that help to establish
social facts within the virtual world.
In practice, we work towards assembling a coherent set of ECA technologies that help to
attain the attributes as mentioned above, and are feasible to implement given the
constraints in project timeline and resources. Our work has been emphasized on the
development of ECA technologies as described below:
3
•
Layered architectures: In our design, the agent is divided into three layers to
support its autonomous behaviors. The agent is able to sense the environment, and
react to an actually detected situation. The agent can also act deliberately, i.e., plan
a course of actions, over time, in pursuit of its own goal.
•
Discourse model: We aim to provide a unified discourse framework where
conversation about individual topics can be modeled. Under this framework, the
agent will able to narrate about a specific topic, as well as to respond to user
queries in appropriate manner.
•
Generation and coordination of multimodal behaviors: The multimodal
behaviors such as deictic gestures, turn taking, attention, and nonverbal feedback,
can be instantiated and expressed. We also need to devise an approach to
coordinate the modalities that are generated.
We adopt this integrative approach in the design and development of an agent called Elva.
Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to
the human users. A user study has been performed to measure user satisfaction and agent
believability. The human computer interaction evaluation methods are exploited to cover
novel embodied conversational settings.
1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery
Elva is an embodied conversational agent developed in the Learning Science and
Learning Environment Lab, National University of Singapore, for this project. Her job is
4
to guide a user to through an interactive art gallery, and engage conversationally with the
user about gallery artifacts.
In conventional tours, the guide decides where to take people and the exact timing of the
itinerary. However, under the circumstances of a virtual tour, we expect the system users
to behave more actively. In the interest of greater user freedom and participation, the
design especially caters to mixed-initiative interaction where both parties can direct the
navigation and conversation. The following paragraphs describe the Elva’s three
categories of behaviors: navigation, speech, and nonverbal behaviors.
z
Navigation: A tour can be divided into a sequence of locations. Through
planning of the itinerary, the guide is able to intentionally direct the user from
one artifact to the next. When the user navigates independently, Elva can track
the user.
z
Speech: Elva’s verbal communication can be classified into two modes:
narrative and dialog. In the narrative mode, the guide presents information
centering on the artifacts, artists, or the gallery being visited. In the dialog mode,
two parties engage in conversation about a specific topic.
z
Nonverbal Behaviors: The presence of body language and facial display helps
to convey communicative function to manage turn taking and feedback, to refer
to an artifact, and to create interesting effects.
5
We will use Elva as an example to illustrate various aspects of our approach throughout
this thesis.
1.4 Structure of Thesis
This thesis is divided into eight chapters, with each particular chapter detailing a specific
area of focus:
Chapter 1, Introduction, gives an overview of the motivations and objectives that this
thesis aims to achieve.
Chapter 2, Literature Review, discusses the various research areas that ground this multidisciplinary project. Existing methodologies are reviewed from the perspectives of
architectural requirements, discourse models, and multimodality.
Chapter 3, Agent Architecture, presents a generic three-layered architecture which forms
the backbone of our ECA. We then describe the construction of such architecture and the
interaction among the system components.
Chapter 4, Agent’s Verbal Behaviors, introduces the notion of schema, a discourse
template from which the narrative or dialog about a specific topic can be dynamically
generated. It then describes how structured and coherent verbal behaviors are supported
under schema-base knowledge framework.
6
Chapter 5, Agent’s Nonverbal Behavior, begins by highlighting the importance of
interactional behaviors and deictic gestures in enhancing agent-user communication. It
then presents our methods to generate and express the multimodal behaviors.
Chapter 6, Illustrated Agent-human Interaction, illustrates how Elva establishes verbal
and nonverbal communication with a system user, and guides the user in a gallery tour.
Chapter 7, Evaluation, describes the evaluation methodology and observed results of the
user study performed on the virtual tour with Elva.
Chapter 8, Conclusion, summarizes the thesis and documents our achievements and
contribution. Possible future work is also discussed.
7
CHAPTER 2 RESEARCH BACKGROUND
2.1 ECA as a Multi-disciplinary Research Topic
Research on embodied conversational agents embraces multiple dimensions: ECA inherits
the research problems from its super type, autonomous agent, which is a field of Artificial
Intelligence; the verbal communication aspect may require technologies from Natural
Language Processing (NLP) and Information Retrieval (IR); the embodiment and
communicative behaviors aspects are intrinsically social, and form a dimension of Human
Computer Interaction (HCI) research.
We identified the key aspects of ECA technology through a breadth-wise study on ECA
technology. We then concentrate our research on the development of ECA from the
aspects of agent architecture, conversation modal, and generation and coordination of
multimodal behaviors. The following sections will discuss the previous research in these
research areas.
2.2 Architectural Requirement
Architectural requirement has been widely discussed in the research literature [9,21,23].
Sloman described architectural layers that comprise mechanisms for reactive, deliberative,
and meta-management controls [23]:
z
The Reactive mechanism acts in response to the actually detected internal or
external event. Reactive systems may be very fast because they use highly
parallel implementations. However, such systems may lead to uncoordinated
8
behaviors due to their inability to contemplate, evaluate, and select possible
future course of action.
z
The Deliberative mechanism provides more sophisticated planning and
decision-making mechanisms. Therefore, it allows for more coordination and
global control.
z
The Meta-management allows the deliberative processes to be monitored and
improved. It is an ambitious step towards a human-like architecture. However,
due to extraordinary complexity, it is not yet practically viable for
implementation [23].
Most agent architectures conform to the action selection paradigm [1,10,11,18]. FSA
(Finite State Automaton) is a simple architecture based on the reactive mechanism
described above. The behavior state in FSA is controlled by events generated based on the
values of the internal mood variables [18]. BLUMBERG is another reactive architecture,
which employs a hierarchical behavioral representation where behaviors compete among
themselves according to their “levels of interest.” [18]. JAM is a traditional BDI (Belief
Desire Intention) architecture operating with explicit goals and plans. JAM agents can
deliberately trigger the search for the best plan from a plan library [11].
2.3 Discourse Model
Classical chatterbots typically employ simple pattern matching techniques and have a
limited model of discourse. ALICE is based on Artificial Intelligence Markup Language,
an XML-compliant language which contains tags to define pattern and response pairs [10].
9
ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching.
Another chatterbot, JULIA, employs an Activation Network to model possible
conversation on a specific topic [10]. When an input a pattern is matched, the node
containing the pattern has its activation level raised. And the node with the highest level is
then selected. This allows more structured and flowing conversations than simple pattern
matching. It is interesting to note that NLP is seen as less important in chatterbots,
whereas simple implementations are often sufficient for creating believable conversation
in a limited context.
More recently, researchers began to exploit the notion of speech acts to analyze the
intentions of speakers [8,15]. Based on Searle’s speech act theory [3], an utterance can be
classified into a speech act, such as stating, questioning, commanding, and promising. The
categories of speech act can be domain-independent, as described in TRAINS and
DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of
conversation. When the intention of the speaker is captured with speech act, the response
given by the agent is likely to be more accurate, as compared to the situation of
expressing with simple keywords.
2.4 Multimodal Communications
Embodiment allows multimodality, therefore making interaction more natural and robust.
Research on multimodal communications has concentrated on the question of generating
understandable nonverbal modalities. In what follows, we review some previous research
in this field.
10
Steve was developed by the Center for Advanced Research in Technology for Education
(CARTE) [16]. As shown in Figure 2, Steve can demonstrate actions, use gaze and deictic
gestures to direct the students’ attention, and he can guide the students around with
locomotion. In order to do these, Steve exploits its knowledge of the position of objects in
the world, its relative location with respect to these object, as well as its prior explanations
to create deictic gesture, motions, and utterances that are both natural and unambiguous.
Figure 2: Steve's deictic gestures
Greta, embodied in a 3D talking head (see Figure 3), shows a rich expressiveness during
the natural conversation with user. Greta manifests emotion by employing a Belief
Network that links facial communicative functions and facial signals that are consist with
the discourse content [19,20]. The facial communicative functions that are typically used
in human-human dialogs; for instance: syntactic, dialogic, meta-cognitive, performative,
deictic, adjectival and belief relation functions.
Figure 3: Greta's facial display: neutral, sorry-for, and surprise.
REA [6] is an embodied estate-seller that is able to describe feature of a house using a
11
combination of speech utterances and gesture. Rea’s speech and gesture output is
generated in real time from the knowledge base and the description of communicative
goals, using SPUD (“Sentence Planning Using Description”) engine.
2.5 Summary
We have reviewed some of the previous research works in the areas of architectural
requirements, discourse models, and multimodality. In the next chapters, we proceed to
present our integrative approaches.
12
CHAPTER 3 AGENT ACHITECTURE
In this chapter, we present a generic three-layered architecture which forms the backbone
of our ECA. We then describe the construction of such architecture and the interaction
among the system components.
3.1 Overview
In our design, an embodied conversational agent interfaces with the virtual world via a
perception module and an actuation module. The perception module provides the agent
with high-level sensory information. The actuation module enables the ECA to walk and
to perform gestures and facial expressions. The gap between perception and action is
bridged by mental processing in the Interpretation Module, a knowledge-based inference
Plan Library
Agent Goal
Agent State
DELIBERATIVE
Planner
Scheduler
KNOWLEDGE BASE
PERCEPTION
MODULE
Dialog
Coordinator
FOCUS
Utterance
Analyzer
Enricher
ACTUATION
MODULE
Response
Retriever
REACTIVE
Lexicon
SpeechAct
Context
Q&I
Generator
REFLEXIVE
INTERPRETATION MODULE
Figure 4: Three-layered architecture for the agent’s mental processing
13
engine. We model this module using a combination of reactive and deliberative
mechanisms in order to cope with mixed-initiative situations. The architecture (see Figure
4) comprises three layers: a reflexive layer (section 3.4.1), a reactive layer (section 3.4.2),
and a deliberative layer (section 3.4.3).
3.2 Perception Module
The Perception Module endows an ECA with the ability to “see” and “hear” in the virtual
world. It supplies sensory information about self, system users (as avatars), environment,
and the relationships among them. For example, “user arrives at artifact2,” “user is facing
artifact2,” “I am at artifact2,” “artifact2 is on my left hand side,” “user clicks on the
artifact2,” and “User said to me.”
When a state change in the world is detected, the Perception Module will abstract the
information, and propagate this event to the reactive layer for appraisal. Simultaneously,
raw perceptual data, such as the coordinates of user’s current position, are fed to reflexive
layer for quick processing.
3.3 Actuation Module
The Actuation Module drives the character animation and generates sythesized speech.
The Interpretation Module produces behavior scripts which specify multimodal behaviors
over a span of time, and sends them to the event queue of the Actuation Module. The
Actuation Module sequences and coordinates the multimodal signals in different channels
(described in section 5.4) and plays them back. If necessary, the Actuation Module will
update the agent’s goal state and coversational state upon the completion of actuation.
14
3.4 Interpretation Module
The Interpretation Module an inference engine that bridge the agent’s perception and
actions. It comprises three layers, i.e., a reflexive layer, a reactive layer, and a deliberative
layer.
3.4.1 Reflexive Layer
The reflexive layer implements a series of reflex behaviors that are quick and inflexible
(Q&I). For example, the agent is able to track the user with head movement in response to
the user’s subtle location change. And the agent glances at artifact when the user performs
a click on the artifacts.
3.4.2 Reactive Layer
The reactive layer handles the user’s utterances as they arise. The Utterance Analyzer
performs natural language understanding tasks to determine the meaning of the utterance
so that it can be recognized by the agent. Based on the meaning, the Response Retriever
queries the Knowledge Base for the matched schema node that contains an appropriate
response. The details about each component are described in chapter 4.
3.4.3 Deliberative Layer
The deliberative layer provides planning and decision-making mechanisms. The ECA in
our environment is equipped with information-delivery goals and plans that accomplish
the goals. The Planner selects an appropriate plan from the plan library based on the agent
goal. During the interaction, Planner may adopt new plans according to the change of goal.
The Scheduler instantiates and executes the selected plan. With its help, the agent is able
15
to advance the conversation at regular time intervals. In chapter 5, we will go through the
details.
3.4.4 Behavior Generation
The utterances produced by the reactive layer and deliberative layer flow into a Dialog
Coordinator where turn taking is regulated. Finally, the Enricher generates appropriate
nonverbal behaviors to accompany the speech before the behavior is actuated. Details are
described in chapter 6.
3.5 Knowledge Base
The Knowledge Base (KB) at the heart of the system consists of schema nodes that
encode the domain knowledge as condition-action pairs. FOCUS refers to the schema
node that constitutes the agent’s current focus of attention during the conversation.
FOCUS serves as a reference point for both reactive and deliberative processes, and it is
updated every time a new behavior has been actuated. The details will be covered in
section 4.1.
16
CHAPTER 4 AGENT’S VERBAL BEHAVIORS
This chapter introduces the notion of schema, a discourse template from which the
narrative and dialog about a specific topic can be dynamically generated. It then describes
how structured and coherent verbal behaviors are supported under schema-base
knowledge framework.
4.1 Schema-based Discourse Framework
A schema defines a template from which narrative or dialog about a specific topic can be
dynamically generated. Each schema is related to a discourse type and a topic. For
example, the schema named ELABORATE_BIODATA_OF_ ARTIST1 indicates that the
topic on BIODATA_OF_ARTIST1 is elaborated using this schema. A schema is modeled
as a network of utterances that contribute to its information-delivery goal. When a schema,
e.g., ELABORATE_BIODATA_OF_ARTIST1, is instantiated, the agent has to fulfill a
goal ELABORATE (ARTIST, BIODATA) by producing a sequence of utterances from
the template.
A pictorial view of the discourse space is shown in Figure 5: a network of utterances
(black dots) forms a schema (dark grey ovals). In turn, a few schemas are grouped into a
domain entity (light grey oval), e.g., an artist, an artifact, or an art-related concept.
Transiting from one schema to another simulates shifting topic or referencing to another
concept in a conversation.
17
INTRO_PARTICULAR_OF_ARTIST1
ELABORATE_BIODA ELABORATE_GEN
TA_OF_ARTIST1
RE_OF_ARTIST1
artist1
ELABORATE_TECHNIQUE_OF_
SCULPTART
DEFINE_CONCEPT_OF_
SCULTPART
sculptural art
Figure 5: Schema-based discourse space (portion only)
An utterance is encapsulated as a schema node in the knowledge base. Each schema node
can have multiple conditions, an action, and several links. A condition describes a pattern
of user input, in terms of a speech act [3], and a keyword list, which activates the node.
The action contains a list of interchangeable responses. The link specifies three types of
relationships between two adjacent nodes. (1) A sequential link defines a pair of nodes in
the order of narration. (2) A dialog link connects the agent’s adjacent turns in a dialog.
The link contains a pattern which is expected from the user’s turn. (3) A reference link
bridges two schemas, which is analogous to a hyper link.
We employ an XML tree structure to represent the schema nodes. The definition is shown
below:
18
schema schema_id ID #REQUIRED>
schema goal CDATA #REQUIRED>
condition type CDATA (dependant|independent) “independent” >
node id ID #REQUIRED>
node schema_id IDREF #REQUIRED>
action affective_type CDATA #IMPLIED>
link relationship (sequential|dialog|reference) “sequential”>
link node_id IDREF #REQUIRED>
Using this definition, a schema node can be encoded as following:
QUESTION_DESCRIPTION
nonrepresentative,figure
Non-representative figures are very abstract human
figures. The artist purposely abandons the details
like gender and age, so that the figure becomes
anonymous.
4.2 Discourse Modeling
In this section, we describe how narrative and dialogs can be modeled using the notion of
schema.
4.2.1 Narrative Modeling
In the narrative mode, the agent presents information centering on a topic. It is similar to
the situation in a traditional guided tour: a guide initiates the topic and delivers a
monologue unless interrupted by the visitor. The schema used to model narrative is
19
usually organized in a linear form. As depicted in Figure 4, each schema node
encapsulates an utterance. The nodes must be well ordered so as to generate a clear
narrative about a specific topic.
Here you looked at the sculpture
called Love and Hate.
If you look carefully, you probably
see two persons bound together.
It suggests love because they
are very close together.
It also suggests hate because the two faces
are pulled away from each other.
This sculpture deals with the
"love and hate" relationship.
Figure 6: Narrative modeling using a schema
4.2.2 Dialog Modeling
In the dialog mode, the agent and the user are engaged in the discussion of a topic. Turns
are frequently taken. A sample schema for dialog modeling is shown in Figure 5. Clearly,
the organization of schema nodes in a dialog is more dynamic and complicated than in
narrative. Some schema nodes branch out into a few dialog links, which describe the
user’s possible responses.
20
Can you guess what this
sculpture looks like?
body, face, bone…
Good try! Any more?
Any guess?
No
body, face
body, face
No
Your answer is
right!
No, hint
Try to rotate the
sculpture.
ok
Can you see a
human face?
Let me give you
a hint
Figure 7: Dialog modeling using a schema
4.3 Reasoning
This section introduces the agent’s reasoning process. We utilize the concept of casebased reasoning and pattern matching. As a key area of improvement, the user’s query
patterns are represented using the extracted meaning of the utterance, instead of simple
keywords.
4.3.1 Analysis Phase
The analysis phase attempts to “understand” the user utterance and transform it to an
internally recognizable meaning. In our system, the meaning of an utterance is represented
by the speaker’s intention (speech act) and semantics (keywords). The transformation
steps are depicted in Figure 8.
21
Raw Utterance S
Correct ill-formed input
Resolve Reference
Tokenize into linked word
list
Transform into stem list
using PORTER
Classify into
Speech Acts
Speech Act
Remove stop
words
Replace
Synonym
Keyword List
Utterance Meaning (intention + semantics + S)
Figure 8: Transform raw user input to utterance meaning (query pattern)
•
Preparation Steps
We examine the utterance to correct ill-formed inputs, resolve references 1 ,
tokenize the utterance into linked world list, and finally apply the PORTER
stemming algorithm [22] to transform the words to their stem form.
z
Speech Act Classification
1
Reference resolution aims to determine which noun phrases refer to each real-world entity mentioned in a
document or discourse.
22
The next crucial step is to determine users’ intention via speech act classification.
Inspired by Searle’s speech act theory [3], the system defines over thirty speech
acts (see Table 1) to cover the user’s common queries and statements. Some
speech acts are domain independent, while the rest are related to our application,
i.e., virtual tour guide in an art gallery.
Illocution
QUESTION
Matters
WHY, WHEN, WHO, WHERE, QUANTITY,
CONSTRUCT, MATERIAL, COMMENT, EXAMPLE,
TYPE, DESCRIPTION, COMPARISON, MORE
REQUEST
MOVE, TURN, POINT, EXIT, REPEAT
STATE
FACT, REASON, EXAMPLE, PREFERENCE,
COMPARE, LOOKLIKE, POSITIVECOMMENT,
NEGATIVECOMMENT
COMMUNICATE
GREET, THANK, INTERRUPT, BYE,
ACKNOWLEDGE, YES, NO, REPAIR
Table 1: List of speech acts
A speech act is described by its illocutionary force and the object. For example,
QUESTION_MATERIAL relates to an inquiry about the material.
Can you tell me what is this sculpture made of?
It is made of Ciment-fondu, an industrial material.
STATE_POSITIVECOMMENT reveals a user’s positive comment.
I like this sculpture very much.
I am glad to hear that.
Our early classification approach was based on the list of acceptable patterns for a
speech act. For example, “do you like *” is an acceptable pattern for
23
QUESTION_COMMENT. However, this approach results in relatively low
accuracy. Consider the following two utterances:
A: do you like this sculpture?
B: do you like to go for supper?
Utterance B is classified wrongly.
This problem was resolved by adding in the list of rejected patterns for each
speech act. In this case, “do you like to *” shall be added to the reject patterns for
QUESTION_COMMENT.
We also encountered the overlapping problem in speech act classification. As it is
impossible to enumerate all possible patterns for a speech act, the classification is
not entirely clean-and-dry. In some cases, one utterance can elicit more than one
speech act. For example, the following utterance can be classified into both
QUESTION_CONSTRUCT and QUESTION_DESCRIBE.
Can you describe how to make ceramics?
Further
investigation
reveals
that
some
speech
acts,
e.g.,
QUESTION_CONSTRUCT are, in fact, the specialized instances of another
speech act (QUESTION_DESCRIBE). Therefore, their patterns may overlap. To
get around this problem, we assess the level of specialization for each speech.
Priority is given to a specialized speech act when several speech acts are copresent.
24
We developed a speech act classifier based on a data modal called state transition
graphs (STGs) as shown in Figure 7. An STG encapsulates the acceptable and
rejected patterns for a speech act. A path from start state to terminal state indicates
an acceptable or a reject pattern.
START
I
OR (like, want)
do
what
me
you
to
tell
know
be
what
make
SYNONYM
(material)
NOT (to)
of
REJECT
ACCEPT
Figure 9: State transition graph for QUESTION_MATERIAL
During classification, an utterance is validated against every STG defined for
agent’s set of speech acts. For each STG, the utterance performs a word-by-word
traversal through the STG. If the utterance runs into a state where there is no
escape, or terminates in the REJECT state, this indicates that validation has failed.
Otherwise, the utterance will terminate in the ACCEPT state, which means the
utterance matches with the corresponding speech act. For example, for an
utterance, “Do you know what this sculpture is made of,” there is a valid path
25
“doÆyouÆknowÆwhatÆmakeÆof” in QUESTION_MATERIAL (see Figure 8).
Do you know what this
sculpture is made of
START
I
OR (like, want)
to
do
tell
what
me
you
know
be
what
make
%MATERIAL%
of
REJECT
ACCEPT
Figure 10: Speech act classification using STG
z
Keyword Extraction
We proceed to extract semantics, in terms of keywords, from an utterance. Our
approach is based on the Information Retrieval (IR) techniques of stop word 2
removal and synonym replacement.
Synonym replacement is a desired feature for embodied conversational agents. It
allows a single set of keywords to be scripted in the Knowledge Base, and a list of
2
Stop words refer to commonly used words which are usually ignored by a search engine in response to a
search query.
26
its synonyms to be specified in a lexical database (e.g. WordNet [24]). The feature
will help reduce the number of similar patterns we have to script because of
various synonyms of a certain word. And it will allow us to adapt the agent to a
new version of words easily.
After keywords are extracted, they are combined with speech act to form the
utterance meaning. Utterance meaning is then escalated to the next level of
processing.
4.3.2 Retrieval Phase
In the retrieval phase, the Response Retriever (refer to Figure 4) searches for the most
appropriate schema node that matches the user’s utterance meaning.
We have developed a heuristic searching approach, named locality-prioritized search,
which is sensitive to the locality of the schema nodes. The heuristic rules are well based
on the characteristics of schema-based discourse framework that we have proposed. Using
such a framework, the basic assumption is that conversation around a topic shall be
modeled as a schema. In other words, the schema shall capture the relevant and coherent
information about the topic. Given this assumption, the idea of locality-prioritized search
is to use locality as a clue about the information relevance, so as to perform effective
search. The step steps are described as the following:
27
1. The search starts from the successive nodes of FOCUS3 (refer to Figure 4). A
locality weight w1 is assigned to a successive node which is connected via a dialog
link. If a node is connected to FOCUS via a sequential link, a relatively low
weight w2 will be assigned.
2. Scan the nodes within the current schema, assigning a weight w3 that is lower than
w2.
3. If necessary, the searching space is expanded to the whole knowledge base and a
lowest weight w4 is assigned.
For each schema node, the similarity between its ith condition and the user’s utterance
meaning is measured using the sum of both the hits in their speech acts Si and the hits in
the keyword lists Ki. We then employ a matching function f to compute the matching
score by multiplying the maximum similarity value with the locality weight w of the node.
The node with the highest matching score is accepted if the score exceeds a predefined
threshold.
f = Max (α Si + β Ki) w
i:1Æ n
The formulation of the matching function reveals our design rationale: first, speech acts
have been integrated as part of the pattern so as to enhance the robustness of pattern
matching; second, to avoid flattened retrieval, i.e. treating every node equally, we have
3
Recap that FOCUS is the agent’s attentional node during the conversation, i.e., “what we were talking
about in this topic”
28
introduced locality weight, which gives priority to those potentially more related schema
nodes.
4.4 Planning
The implemented planner relies on a repository of plans that are manually scripted by
content experts. There are two levels of plans: an activity plan constitutes a series of
activities to be carried out by the agent; a discourse plan lays out several topics to be
covered for an activity. In our context, Elva’s activity plan outlines the skeleton of a tour.
It includes: (1) welcome remarks; (2) an “appetizer”, i.e., a short and sweet introduction
before the tour; (3) the sequence of “must-sees”, i.e., artifacts to be visited along the
itinerary; (4) a “dessert”, i.e., summarization. A sample plan is shown in Figure 9. An
itinerary is planned at the level of domain entities (light grey ovals), whereas discourse
planning is done at schema level (dark grey ovals). A path through the discourse space
indicates the sequence of schemas, i.e. the topics, to be instantiated.
artifact#1
artifact#5
artifact#6
artifact
#8
artifact
#3
Figure 11: Elva's tour plan
Tour Plan
Discourse Plan
welcome remarks
WELCOME
INTRO_PARTICULAR_OF_MYSELF
INTRO_THEME_OF_EXHIBITION
29
appetizer
INTRO_BACKGROUND_OF_GALLERY
INTRO_PARTICULAR_OF_ARTIST
BRIEF_PLAN_OF_TOUR
artifact#1
PROBE_METAPHOR_OF_ARTIFACT1
DESCRIBE_METAPHOR_OF_ARTIFACT1
DESCRIBE_MATERIAL_OF_ARTIFACT1
IDENTITY_ARTISTINTENTION_OF_ARTIFACT1
artifact#3
DESCRIBE_METAPHOR_OF_ARTIFACT3
DESCRIBE_TEXTURE_OF_ARTIFACT3
IDENTITY_ARTISTINTENTION_OF_ARTIFACT3
...
...
dessert
SUMMARIZE_TOUR
BYE
Table 2: Layout of a sample tour plan
The plan library serves as a repository of activity plans and discourse plans. Each plan
specifies the goal it intends to achieve. At run time, the Planner picks up a plan to fulfill
the agent goal. The goal can be modified by a user’s navigation or query. For instance,
when a user takes navigational initiative and stops at an artifact ARTIFACT1, the Planner
will be notified about the new goal NARRATE (ARTIFACT1). A discourse plan can
fulfill this goal by decomposing it into sub-goals. Say a discourse plan,
DESCRIBE_METAPHOR_OF_ARTIFACT1ÆDESCRIBE_TEXTURE_OF_ARTIFACT
1ÆDESCRIBE_MATERIAL_OF_ARTIFACT1, is selected. The goal NARRATE
(ARTIFACT1) is then achieved through the accomplishment of the sub-goals DESCRIBE
(ARTIFACT1, METAPHOR), DESCRIBE (ARTIFACT1, TEXTURE) and DESCRIBE
(ARTIFACT1, MATERIAL).
In the present development, the Planner provides Elva with a means to generate a random
tour plan, and localize the discourse content for individual sculptures when visited. The
choices of plans are limited to the available plans in the plan library. To some extent, the
30
Planner functions in a deterministic manner. While this seems adequate for the domain of
the intended application, a museum where a guide plans a tour of “must-sees” in a predefined sequence, the future development of the planner should favor a more flexible way
to generate plans that can cater to a dynamic world situation. For example, it is desirable
if a guide could customize a tour plan in accordance with the user’s preference.
The Scheduler parses and executes a plan. Before the execution of a plan, the agent’s
attentional state, i.e., FOCUS, will be positioned at a starting node of the targeted schema.
The scheduling task is performed at regular intervals. At each interval, the Scheduler has
to decide “what to say next” by selecting one of the successive nodes of FOCUS (this
selected node will become the next FOCUS). During the conversation, FOCUS keeps
advancing until the Scheduler establishes a dynamic sequence of utterances to span all the
schemas in the plan.
4.5 Dialog Coordination
The Dialog Coordinator (recall Figure 4) is responsible for turn taking management. It
keeps a conversation state diagram that contains possible states: NARRATE, DIALOG,
MAKE_REFERENCE, SHIFT_TOPIC, and WITNESS. Dialog Coordinator examines the
inputs from the reactive layer, the deliberative layer, and the agent’s perceptual data to
determine the next appropriate state. For example, the agent transits the conversation state
to DIALOG if dialog links are frequently traversed. It changes to WITNESS state when
the user starts to manipulate an artifact or suddenly moves away. The treatment of turn
taking is designed carefully for each conversation state. For example, in DIALOG state,
31
the agent temporarily suspends the execution of the scheduler to wait for a user turn. In
WITNESS state, the waiting period can be even longer.
Another functionality of Dialog Coordinator is to manage conversation topics using a
topic stack. Considering when Elva makes a reference to a concept, the related topic
(represented with schema ID) will be pushed on to the topic stack. When reference is
completed, the topic will be cleared from the top of the stack, so that the agent can
proceed with its original topic.
4.6 Summary
We utilized the notion of “schema” to support structured and coherent verbal behaviors.
Under a unified discourse framework, ECA is able to answer appropriately to the user
enquiries, as well as to generate plans to fulfill its information delivery goals. In addition,
we briefly described how turn taking was coordinated based on the state of conversation.
32
CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS
This chapter begins by highlighting the importance of interactional behaviors and deictic
gestures in enhancing agent-user communication. It then presents our methods to generate
and express the multimodal behaviors.
5.1 Types of Nonverbal Behaviors in Our Design
In interpersonal conversations, people make complex representational gestures with their
hands, gaze away from as well as towards one another, and tilt heads once a while.
Conversation exploits almost all the affordances of the human body. However, a
successful model of nonverbal behaviors in ECA does not necessarily resemble
interpersonal conversations in all respects. As Cassell et al [4] argued, the ECA should
“pick out those facets of human-human conversation that are feasible to implement, and
without which the implementation of an ECA would make no sense.”
In light of this, we implemented two vital types of nonverbal behaviors. They are
interactional behaviors that help to regulate turn taking and feedback in conversation, and
deictic behaviors that help to refer to an object in the virtual world.
5.1.1 Interactional Behaviors
We have expressed our primary interest in designing interactional behaviors for turn
taking and feedback functions. The interactional behaviors convey significant
communicative information during conversation [6]. Gaze can be used to regulate turn
33
taking in mixed-initiative dialogue. For example, Elva gives a turn by looking at the user
and raising the eyebrows. And she seeks a turn by raising her hands in the gesture space.
Head nods and facial expression can provide unobtrusive feedback to the user’s utterances
and actions without unnecessarily disrupting the user’s train of thought [6,13]. For
example, Elva gives feedback by looking at the user and nodding her head. She requests
feedback by looking at the user and raising her eyebrows.
The following table lists Elva’s interactional behaviors.
Communicative
Function
Interactional Behavior
Welcome and React to user arrival
bye
React to user exit
Look at user, nod head, wave
Turn taking
Give turn
Look at user, raise eyebrows, (silence)
Want turn
Raise hands in gesture space
Take turn
Glance away, (start speaking)
Give feedback
Look at user, nod head
Request feedback
Look at user, raise eyebrows
Invite for navigation
Look at user, raise eyebrows, show the way
Follow navigation
Look at user, nod head, start walking
Witness when a use
manipulates an object
Look at object, glance at user periodically
Feedback
Navigation
Others
Short glance at user
Table 3: Elva’s interactional behaviors (portion only)
5.1.2 Deictic Behaviors
Deictic behavior is another important aspect of our design. With deictic behaviors, the
agent can direct the users’ attention to the entities that exist in the virtual space [5,16]. For
example, Elva points at an artifact before it starts to narrate. Elva points at a direction,
34
when she invites the user to go the next artifact. Moreover, Elva is also able to
differentiate left and right directions. For example, when narrating in front of two artifacts,
Elva is able to point to a correct artifact based on the discourse content.
5.2 Nonverbal Behavior Generation
Interactional behaviors usually occur in between two turns or during the user turn. In our
design, we adopt a goal-based model to generate interactional behaviors. The Dialog
Coordinator (refer to Figure 4) reports the changes of the conversational states and events
of turn taking to the Enricher module. Based on these clues, the Enricher determines the
communicative goal (such as take turn, give turn, seek turn, give feedback, request
feedback) to be achieved. Then the communicative goal is mapped to a behavior script
which corresponds to an interactional behavior.
In our design, deictic behaviors occur just before the speech. For example, Elva points at
artifact12 and started to narrate. Deictic behaviors are propositional and therefore require
behavior planning. In order to integrate deictic behaviors, an utterance in the schema node
is annotated with additional information. For example, “#point(this_artifact) you are now
looking at ‘Love and Hate’” denotes that a deictic behavior of pointing at the sculpture
“Love and Hate” will be acted out before the utterance is spoken. Based on the annotation,
the Enricher first scans the behavior script that is generated. If there is no existing
propositional gesture introduced by the communicative goal, Enricher will inserting the
pointing gestures into the script, and acts it out just before the speech.
35
In addition to these, a series of reflex behaviors have been implemented in the agent. For
example, Elva is able to track the user with head movement in response to the user’s
subtle location change; Elva glances at artifact when the user selects an artifact by
clicking on it. In implementation, the Q&I module (see Figure 3) processes stimuli from
virtual world, such as the user’s change of position. The stimuli are quickly mapped to a
reflex behavior. Then the reflex behaviors are fed to the Actuator Module for actuation.
Notice that the nonverbal behaviors can be generated simultaneously from two
components of agent architecture. Interactional behaviors and deictic behaviors are
generated from the Enricher. Reflex behaviors are generated from the Q&I module. In
section 5.3.3, we describe how these behaviors are regulated.
5.3 Nonverbal Behavior Expression
A common way to generate agent animation is to define a collection of animation
primitives, then use a sequencing engine to string some primitives together. However, this
approach is not sufficiently flexible to generate dynamic and complicated behaviors. For
example, it is not possible to implement a head track behavior, because the animations are
pre-defined.
In regard to this, we propose that nonverbal behaviors are expressed as multimodal signals
among different channels. We present, in this section, our methods that instantiate and
regulate more realistic multimodal behaviors in real time.
36
5.3.1 Building Blocks of Multimodality
As shown in Figure 10, Elva’s nonverbal behavior is realized as animations that distribute
among several conversational modalities: facial display, head movement, gesture, and
locomotion.
Figure 12: Conversational Modalities
For each modality, a set of behavior segments is defined (Table 4). A behavior segment is
basically a small building block of real-time animation. Some behavior segments can be
realized using pre-recorded animation, e.g., smile, puzzled look, glance_away. However,
others may require real-time information to create dynamic effect. For example, in order
37
to play back look_at_user, the turning angle needs to be calculated based on the user’s
location, agent’s location, body orientation, and head orientation. As another example,
point_at_object requires information about the object, which helps to determine which
hand the agent will reach out.
Modality
Behavior Segment
Facial Display
speak ($affective_type, $utterance)
glance_away
raise_eyebrow
smile
puzzled
normal
Head Movement
look_at_user
look_at_object ($object)
nod
shake
tilt
Gesture
point_at_object ($object)
unpoint
show_the_way
batonic
wave_hand
raise_hand ($whichhand)
clap_hands
Locomotion
go_to_user
go_to_artifact
turn_to_user
turn_to_object ($object)
half_turn_to_object ($object)
Table 4: Elva's multimodal behavior library
5.3.2 Multimodality Instantiation
A behavior script defines a complex nonverbal behavior that is linked to a specific
communicative goal. The script prescribes the animations along four animation timelines,
38
i.e., FACE, TURN HEAD, GESTURE, and LOCOMOTION. On each timeline is a
sequence of behavior segments that form a seamless animation.
The problem with instantiating multimodality is that animations can run out of
synchronization if there is no proper control over the different channels. In regard to this,
a lock mechanism has been developed to ensure the synchronization of animations. In the
behavior script, we specify a control statement “WAIT(lock#),” which locks its
successive behavior segment. The statement “RELEASE(lock#)” serves to free the lock
and resume that behavior segment. In this way, time dependency can be enforced on two
behavior segments even when they are in different channels.
FACE
Neutral
TURN HEAD
Look at artifact
GESTURE
Point at artifact
Raise eyebrow
Look at user
Speak
Look at artifact
Neutral
Look at user
Unpoint
LOCOMOTION Half turn to artifact
Turn to user
Figure 13: Synchronized modalities in an animation timeline
Below is a behavior script, which prescribes the animation in Figure 11.
39
SCRIPT START
name=point_and_talk
communitiveGoal=START_NARRATIVE
type=MIXED
preemptive=false
FACE START
WAIT(lock2)
raise_eyebrow
speak($affective_type, $utterance)
RELEASE(lock3)
FACE END
TURNHEAD START
WAIT(lock1)
look_at_artifact($artifact)
WAIT(1)
look_at_user
RELEASE(lock2)
Look_at_artifact($artifact)
WAIT(lock3)
look_at_user
TURNHEAD END
POSE START
WAIT(lock1)
point_at_artifact($artifact)
WAIT(lock3)
Unpoint
RELEASE(lock4)
POSE END
LOCOMOTION START
half_turn_to_artifact($artifact)
RELEASE(lock1)
WAIT(lock4)
turn_to_user
LOCOMOTION END
SCRIPT END
5.3.3 Multimodality Regulation
Nonverbal behaviors are generated from the two layers of agent architecture. Behavior
scripts are generated from the Enricher (refer to Figure 4). Reflex behaviors are generated
40
from the Q&I component. It is possible that behaviors from two layers attempt to use the
same animation channels, e.g., head movement, at the same time. To resolve such conflict,
we first define a Boolean variable, preemptive, in the behavior script. If preemptive value
is true, it allows a reflexive behavior to overlap with the behavior script under certain
conditions. The regulation is based on the rules described as follows:
Given a behavior script and a conflicting reflexive behavior,
Rule 1: If the behavior script is not preemptive, the reflexive behavior will be
prohibited. Otherwise, use Rule 2.
Rule 2: If the channel requested by the reflexive behaviors not vacant as specified
the behavior script, then the reflexive behavior is prohibited during the execution of
the behavior script. Otherwise, act out the reflexive behavior.
5.4 Summary
To sum up, we have introduced a layered approach to generate appropriate nonverbal
behaviors. We also presented mechanisms to tackle the issues on instantiation and
regulation of multimodality.
41
CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION
This chapter illustrates how Elva communicates verbally and nonverbally with a system
user, and guides the user in a gallery tour.
6.1 An Interactive Art Gallery
A virtual art gallery (Figure 11) has been developed using the design framework of the CVISions, Collaborative Virtual Interactive Simulations, system [7]. The C-VISions
browser allows the user to interface with the virtual worlds: navigating and acting upon
objects. A chatterbox allows the user to carry out conversation. The virtual guide, named
Elva, appears in front of the user as an animated female character. Elva talks with system
Figure 14: Elva invites the user to start the tour
42
user through a speech synthesizer.
At present, the gallery houses a virtual exhibition “Configuring the Body: Form and
Tenor,” which utilizes the existing content in the Ng Eng Teng Gallery of NUS Museums.
Ng Eng Teng was Singapore’s foremost sculptor. He produced a body of consistent and
distinct artwork over a period of forty-five years. Altogether thirteen pieces of Ng Eng
Teng’s artworks were carefully selected for this virtual exhibition.
In the tour, Elva begins by highlighting the "body configuration" concept. Then Elva
presents some of the artist’s extraordinary works to the user. If the user decides to browse
the artwork at his/her own pace, Elva will track the user. In the next sections, we will use
a few episodes to illustrate the verbal and nonverbal communication between Elva and a
system user, during a virtual tour.
6.2 Verbal Communication
The following sections illustrate Elva’s two modes of conversations, i.e., narrative
(section 6.2.1) and dialogue (section 6.2.2).
43
6.2.1 Elva presents information
The following episode illustrates a narrative situation. Elva provides an overview of the
artist, Ng Eng Teng. Then she gave a short briefing on the tour.
Elva started to narrate at Ng Eng Teng’s head sketch.
1. Right here you are looking at the head portrait
of the artist Ng Eng Teng.
2. Ng Eng Teng is one of the most
sculptors and ceramic artist in Singapore.
important
3. Over a period of forty-five years, he produced
a body of work which is consistent and distinct.
4. His work also marks milestones in Singapore's
art history, especially in the development of modern
sculpture and ceramics.
5. In this tour, I will highlight
configuration" concept using few artworks.
the
"body
6. After that, if you want to browse the artwork
at your own pace, please feel free to do so.
7. Would you like to go to the next piece?
8. yes
9. Please follow me
44
6.2.2 Elva probes the user in a dialogue
The following episode illustrates a dialog situation. Elva invited a user a guess the
meaning of sculpture called “untitled.”
1. This sculpture is "untitled". Can you make a
guess what it looks like?
There was a short silence.
2. Any guess, Raymond?
3. is it a bone?
4. Bone. That is interesting. More?
5. i am not sure.
6. Ok, let me give you a hint!
7. Can you see it as a human body?
8. not really.
9. Well, Raymond. Try to rotate the sculpture to
have a better view.
User clicked on sculpture. The sculpture started to
rotate.
Elva stood by and witnessed.
45
Silence lasted for in 8 seconds.
10. To me, it is a female body with the head and
limbs omitted, namely a "torso".
Elva continued to elaborate on the sculpture.
6.2.3 Elva answers questions
The following episode illustrates another dialog situation where user took initiative. The
user asked a few questions.
1. What is the material used for this sculpture?
2. The material used here is ciment-fondu.
3. Can you tell me what is ciment-fondu?
4. Ciment fondu is originally an industrial
material. It is also used in sculpting as well as
refractoring.
5. Tell me more about it.
6. OK. I can share with you some basics. Cimentfondu is a rapid-hardening material. Normally it
takes one day to become solid. Therefore it is ideal
for rapid strength development like sculpting.
7. Where can I buy ciment-fondu?
8. Sorry but I am not sure where the place is.
46
6.3 Nonverbal Communication
6.3.1 Deictic Gestures
The following episode depicts how Elva uses deictic gestures to direct a user’s attention.
Elva stands at the sculpture, “Oh my bump”. She is
going to explain to figures on the wall.
1. Elva says: Now Dr. Tan, please look at the wall
beside this sculpture.
2. These figures are extracted from a sketch book,
where Eng Teng experimented on the "torso-to-face"
concept.
Elva walks towards the figures
stands just in front of the wall.
on
the
wall,
and
3. Elva says: As you can see on the right most, a human
body is displayed.
47
As seen in the Figure above, Elva raised her hand
and point to the right most of the wall-figures.
4. Elva says: The body gradually transforms to a face,
as you can see on the left most.
As seen in the Figure above, Elva gradually moved
her hand towards the left most of the wall-figures.
5. Elva says: Notice that from right to left, the
interpretation of the human body becomes more and
more abstract.
6. Elva says: Interestingly, the idea
face" comes from the Eng Teng's
experience.
of "torso-tolife drawing
48
As seen in the Figure above, after Elva narrated
about the wall-figures, she turned to the user.
6.3.2 Facial Display
The following figure displays Elva’s basic facial expressions.
Neutral
Puzzled
Raise Eyebrow
Smile
Figure 15: Elva's basic facial expressions
49
6.3.3 Locomotion
The following figure depicts Elva’s navigation around the gallery.
Figure 16: Elva's locomotion
50
CHAPTER 7 EVALUATION
This chapter describes the evaluation methodology and observed results of the user study
performed on the virtual tour with Elva.
7.1 Methodologies
The user study aims to evaluate the present development of the agent Elva. We focused on
the measurements of agent believability and user satisfaction. An agent is considered
believable if it allows the users to suspend their disbelief and become cognitively, and
emotionally engaged in the interaction with the character [17]. User satisfaction refers to
the engagement and fun experienced in the interaction. Our evaluation method consists of
both qualitative analysis and quantitative analysis.
7.1.1 Qualitative Analysis
For qualitative analysis, each user or subject will be interviewed about their experience
after the virtual tour. Agent believability is measured through the subjects’ post-usage
descriptions about their interaction with Elva [11,14].
z
If subjects used an emotionally rich vocabulary and described the Elva’s
personality without hesitation, this would be an indication of believability.
z
If subjects hesitated when describing Elva’s personality, or found her ‘strange’
or ‘incomprehensible’, this would indicate low believability.
51
z
If subjects notice nothing peculiar about Elva’s verbal and non-verbal behaviors,
this would indicate that her behaviors are consistent with their expectations, and
therefore it would be a sign of believability.
To measure user satisfaction, we observed the subjects’ emotional responses during their
interaction with Elva. Did they respond positively, neutrally, or negatively? In addition to
observation, direct questions about their engagement and enjoyment were asked during
the post-usage interview.
7.1.2 Quantitative Analysis
In the quantitative part of the user study, we collected user feedback using an evaluation
form (See Appendix). The form consists of the three parts. These are user experience,
feedback on Elva’s verbal behaviors, and feedback on Elva’s nonverbal behaviors. The
subjects rated the agreement or disagreement with a certain statement on a 5-point scale.
7.2 Subjects, Task and Procedure
A total of ten university students (six males and four females) participated in the user
study. The subjects comprised six computer science students, two information system
students, one chemistry student, and one engineering student. Among them, five subjects
had prior experiences in chatting with text-based chatterbots like ELIZA and ALICE. But
none of them had experiences using any embodied conversational agent.
52
The user study was conducted on an individual basis. Each subject went through five
sections as shown in Figure 17: briefing, training, virtual tour, post-usage interview, and
evaluation form filling.
1. Briefing
2. Training
3. Virtual tour
4. Post-usage
interview
5. Evaluation
form filling
Figure 17: Procedure of user study on Elva
1. Briefing: The experimenter briefed the subject on the procedure of the user study.
Then the subject was asked to read through an instruction sheet carefully. The
sheet offered an introduction about Elva and the exhibition. It also contained
several useful tips for the subject to get started. These were:
a. Show some politeness to Elva.
b. Be patient at the start of the tour. You may be anxious, but you also do not
want to miss out the important concepts.
c. Do not hesitate to ask questions.
d. Do not always follow. You can lead the way when you become confident.
53
e. Try to use simple and standard English.
2. Training: The subject was trained on how to use the C-VISions browser. Through
training, the subject familiarized himself/herself with VR user interface.
3. Virtual tour: The subject started the virtual tour and traveled through the gallery
with Elva. The whole process was videotaped.
4. Post-usage interview: After the tour, the subject was interviewed about his/her
experience. The interview was un-cued and had an open structure where each
subject was asked to freely describe his/her interaction with Elva. At the end of the
interview, more direct questions about believability and user satisfaction were
asked.
5. Evaluation form filling: The subject was asked to complete the questions in an
evaluation form, and to give suggestion on how to improve the present virtual
guide.
Note that, the knowledge base being used for the evaluation defines a total of 65 schema
or 225 schema nodes. Among these schema nodes, 120 of them are applicable for
matching with users’ utterances.
54
7.3 Evaluation Results
The observed results are derived from video records, chat logs, post-usage interview, as
well as evaluation forms.
7.3.1 User Experience
Table 5 depicts the summarized results related to user experience. In the right-most
column, the average scores are displayed.
1 for strongly disagree; 2 for disagree; 3 for
neutral; 4 for agree; 5 for strongly agree
1
2
3
4
5
avg
z
I enjoyed the virtual gallery experience.
0
0
0
5
5
4.5
z
Elva triggered my interest when exploring the
artworks in the exhibition.
0
0
4
2
4
4
z
She helped me understand concepts in the
exhibition.
0
0
3
4
3
4
z
I found the interaction with her was engaging.
0
1
1
5
3
4
z
I found the interaction with her was enjoyable.
0
0
0
4
6
4.6
Table 5: Evaluation results related to user satisfaction in interaction with Elva
All subjects enjoyed the virtual tour with the presence of the virtual guide. During the
interview, some subjects used positive terms such as “enjoyable,” “amazing,”
“interesting” when describing the experience. From the video recording, the facial
expression of subjects also clearly showed signs of enjoyment.
At the beginning of the tour, at least half of the subjects were quiet. As Elva walked up
and said hello to the subjects, unexpectedly, only three subjects greeted back immediately.
One subject, instead of typing “hello”, typed “it is cool!” When asked about the reason in
55
the interview, the subject stated that it was because his attention was grasped by Elva’s
appearance and the environment. For the quiet subjects, the common reason was that they
had no idea how to start interaction with a robot.
Elva’s probing questions were effective in encouraging participation from initially reticent
subjects. When Elva invited the visitor to guess the meaning of an untitled sculpture (as
illustrated in section 6.2.2), eight subjects, out of ten, followed up and made some guesses.
We observed that, in general, subjects talked more often after a few successful
interactions with Elva. As seen in table 5, a majority agreed that Elva triggered their
interest when exploring the artworks in the exhibition.
The overall responses to Elva’s guidance were positive. Subjects generally agreed that
Elva helped them to understand the concept of the exhibition. According to them, Elva’s
narratives about the artifacts were relatively easy to follow, and she was able to answer
some of the most basic questions.
A majority of subjects found the interaction with Elva engaging. From the video record,
we observed that, most subjects exhibited positive emotions, and they were focusing their
attention on the computer screen throughout the virtual tour. One student stated that the
interaction was not engaging. And he attributed it to three factors: the look of virtual
environment, the richness of character animation, and choices of actions that users can
take.
All subjects found the interaction with Elva very enjoyable. The score is 4.6 out of 5.
56
7.3.2 Agent Believability
In the qualitative analysis, we derived some useful finding from the subjects’ post-usage
description. Table 6 lists the subjects’ responses to Elva’s personality. We observed that,
three subjects described the Elva’s personality without hesitation, using emotional terms
such as “agreeable,” “lovely,” and “pleasant,” “black humorous”, and “polite.” Half of the
subjects described Elva’s personality without hesitation, but they used the neutral terms
such as “professional,” “polite,” “neutral,” and one of the subjects remarked that a neutral
personality suited a guide. Two subjects believed that Elva did not possess a personality
and attributed it to Elva’s “limited response to questions” and “lack of sufficient
emotions.” Overall, Elva’s behaviors were consistent with the subjects’ expectations. The
qualitative analysis indicated an intermediate-to-high level of believability.
Subject
Description of Elva’s personality
Jeremiah
She is pleasant and helpful. Sometime blur.
Gary
Patient and pleasant. Sometime she is “sassy”, (because she) asked
me to go when I was waiting for her answer.
LHP
She is very professional. (She is) not very personal, because she
does not smile.
LY
Agreeable, lovely, and appealing.
Melvyn
Not yet. (It is due to her) limited response to questions, and (she
is) lack of sufficient emotions.
Pasar
She is neutral. A guide should be neutral, so she is good.
TCT
She is kind and polite. (She is) patient as well.
TSF
Straight-forward and polite.
Cloud
She is elegant, helpful, (and) back humorous.
Snow
She is professional and polite.
Table 6: Users' responses to Elva's personality
57
The quantitative analysis zoomed into the aspects of Elva’s verbal and nonverbal
behaviors, in order to assess her believability. We will elaborate the findings in the next
two sections.
7.3.2.1 Evaluation results on Elva’s Verbal Behaviors
Table 7 presents the results from the evaluation related to Elva’s verbal behaviors.
1 for strongly disagree; 2 for disagree; 3 for neutral;
4 for agree; 5 for strongly agree
1
2
3
4
5
avg
z
Her speech was well organized.
0
0
1
8
1
4
z
Her speech was coherent.
0
0
1
8
1
4
z
Her speech was well based on the location and
the subject of the conversation.
0
0
1
4
5
4.4
z
The amount of information about the exhibits
was appropriate, namely not lengthy, nor over
simple.
0
1
2
6
1
3.7
z
The pace at which she presented information was
appropriate.
0
2
5
2
1
3.2
z
She reacted to my request4 properly.
0
2
3
4
1
3.4
z
She answered my enquiries5 properly.
0
2
3
5
0
3.3
Table 7: Evaluation results related to Elva's Verbal Behaviors
Elva’s narrative skills received relatively high scores. The first three entries in Table 7
indicated that the underlying discourse framework had been well utilized to generate
structured, coherent, and situated speech.
4
5
To request is to ask some body to do something. For example, “Can you please come here?”
To enquiry is to ask some body for specify information. For example, “What is this?”
58
The ratings on Elva’s question-answering capability produced mixed results. As seen in
Table 7, for enquiries, 50% of the subjects agreed that Elva answered properly, 20% of
the subjects disagreed, and 30% of the subjects neither agreed nor disagreed. For requests,
the distribution of the scores was similar.
Further investigation into the chat logs showed that among a total of 95 enquiries and
requests, excluding small talks, Elva was able to detect correct speech acts for 68
questions or 69% of the total, indicating an acceptable performance in utterance analysis.
The study confirmed that incorporating speech acts into the traditional keyword-based
pattern-matching approach yields improved performance. The enhanced approach helps to
capture the essential form of an utterance in addition to its semantics. On one hand, the
speech act classification demonstrated its power in recognizing a same query asked in
different forms. In practice, an effective, but not necessarily a complete, set of patterns
was often sufficient to classify an utterance correctly. On the other hand, keywords
capture the semantics of the utterance and therefore help to handle queries asked in
different context. Keywords extraction proved effective for queries asked using complete
sentences
With regard to Elva’s capability in response retrieval, she was only able to answers
correctly for 35 questions or 37% of the total. This indicated a relatively low performance.
The most well answered questions turned out to be the most basic questions, such as
“what is this sculpture?” “What is the material used,” “Who created this sculpture,” and
“what does this sculpture mean.”
59
Overall, if we use a 5-point scale to quantify Elva’s question-answering capability, the
capability should be around 3, an intermediate level.
Another findings from the user study confirmed that with speech act classification, Elva
was able to show a relatively good understanding to users’ questions without necessarily
knowing the answer. Consider the following dialogue.
so where did he learn this stuff?
Sorry but I am not sure where the place is.
It is ok.
... ...
why is it brown in color?
In fact I am also interested in finding out why, Melvyn.
In above cases, Elva uttered default responses for each speech act as she could not find an
appropriate answer from the knowledge base. Most subjects found the default responses
tolerable, because they thought “she seemed to understand (the question),” and “it is
sensible to answer it that way.”
As compared, Elva tended to give seemingly peculiar answers when no suitable speech
act was found. Consider the following dialog.
Over a period of forty-five years, he produced a body of work
which is consistent and distinct.
what kind of works?
Why do you ask?
i am interested
I see, please continue.
Answer my question!
In this case, the subject felt offended, and complained that, “she is blur,” and “she doesn’t
know what I’m talking about.”
60
Through the user study, we collected a fruitful set of “real data” i.e., around 200 patterns
of user inputs. They can be utilized to refine Elva’s speech acts and to expand her
knowledge base accordingly.
7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors
Table 8 depicts the evaluation results from the evaluation related to the Elva’s verbal
behaviors.
1 for strongly disagree; 2 for disagree; 3 for neutral;
4 for agree; 5 for strongly agree
1
2
3
4
5
avg
z
Her eye gaze was natural.
0
1
3
3
3
3.8
z
Her head movement was natural.
0
1
1
5
3
4
z
Her facial expression was natural.
0
2
2
2
4
3.8
z
Her gestures coupled well with her speech.
0
1
1
5
3
4
z
Her pointing gestures effectively helped me to
address the correct artifact.
0
0
2
4
3
4.1
z
During conversation, she took turns (started to
speak) and gave turns (went silent and listened to
user) at right time.
0
0
1
6
2
4.1
z
She took moves at right time.
1
0
2
7
0
3.5
z
I was able to interpret her nonverbal behaviors
easily.
0
1
0
5
4
4.1
z
Overall her nonverbal behaviors were meaningful
and effective.
0
1
0
7
2
4
Table 8: Evaluation results related to nonverbal behaviors
Elva’s gaze and head movement can be comprehended by most subjects. They remarked
that “she is quite attentive,” and “when she nods, it looks as if she is listening.” One
61
subject, who disagreed, commented that Elva’s eye gaze was not obvious and her head
movement “somehow looked mechanical”. Another subject remarked that “she often
stares at me” and suggested that Elva should glance away more frequently, especially
while she was speaking. The rating on Elva’s facial expression produced mixed results. A
common opinion was that Elva did not show sufficient facial expressions. About Elva’s
locomotion, a major agreed that she took move at right time. But one subject strongly
disagreed, and complained that it often took a long time for Elva to arrive at him. The
result reveals that a majority of the subjects agreed that Elva’s nonverbal behaviors were
easy to interpret. Overall, the user study indicated that Elva’s nonverbal behaviors were
meaningful and effective.
7.4 Discussion
First, there is a pressing need to improve Elva’s capability in answering questions.
Improvement can be made on Elva’s natural language understanding and domain
knowledge. In regard to natural language understanding, the study revealed a more clearcut and comprehensive set of speech acts should be defined to increase the accuracy of
classification. Moreover, considerably amount of effort is required to expand the
knowledge base. In addition to the most basic question, Elva should be able to answer
some advanced questions in this domain, so that users will have an enjoyable but also
enriching experience.
Second, emotive behaviors should be integrated into Elva to achieve a higher level of
believability. Elva is embodied in a human form. Therefore, users expect to see not only
human intelligence, but also human-like emotions. It is not to say that a virtual guide
62
should laugh or cry all the time. Elva should exhibit appropriate emotive behaviors that
are consistent with users’ expectation, so that users’ can suspend their disbelief and
become more engaged in the interaction.
7.5 Summary
This chapter describes a user study to evaluate the present development of Elva. Most
users interacted successfully with Elva and enjoyed the tour. Regarding agent
believability, the qualitative analysis indicated an intermediate-to-high level of agent
believability. From the quantitative analysis, Elva’s narrative skills received relatively
high scores. However the ratings on Elva’s question-answering capability produced mixed
results. A majority agreed that Elva’s behaviors, both verbal and nonverbal, were
comprehensive and appropriate.
Through the user study, we collected a fruitful set of “real data.” They are valuable for the
improvements of the agent’s question-answering capability. The user study also revealed
that emotive behaviors should be integrated into Elva to achieve a higher level of
believability.
63
CHAPTER 8 CONCLUSION
8.1 Summary
This research proposed an integrative approach for building a novel type of conversational
interface, an embodied conversational agent (ECA). ECA is still an immature field which
warrants further investigation in various areas. In general, our research focused on the
attributes of the agent’s autonomy and believability.
The agent’s three-layered architectural design ensures appropriate coupling between the
agent’s perception and action, and to hide the internal mechanism from users. We utilized
the notion of “schema” to support structured and coherent verbal behaviors. For the agent
to effectively take part in conversation with humans, we modeled domain knowledge as
individual schemas. The agent’s narrative and question-answering are well supported
using this unified discourse framework. The approach of agent’s reasoning is derived
from case-based reasoning. The use of speech acts allows the agent to capture the user’s
intention embedded in the speech content. The response given by the agent is more
accurate when the intention of the user is captured, as compared to the situation of simply
expressing queries in keywords [15]. The agent’s planning relies on manually authored
plans that form the plan library. We also presented the approaches to generate and
coordinate the agent’s nonverbal behaviors. In particular, we implemented two vital types
of nonverbal behaviors. They are interactional behaviors that help to regulate turn taking
64
and feedback in conversation, and deictic behaviors that help to refer to an object in the
virtual world.
Our evaluation method assessed the user satisfaction and agent believability. The study
consisted of both qualitative analysis and quantitative analysis. The result reveals that a
major of the subjects enjoyed the tour and interacted successfully with Elva. However,
both types of analyses have produced mixed-result in agent believability. There are two
major implications for the further development on Elva. First, Elva’s language
understanding skills and knowledge base should be improved and enlarged. Second,
emotive behaviors should be integrated into Elva to achieve a higher level of believability;
8.2 Contributions of the Thesis
For research community on ECA, Elva provides a case study in a specific application, a
virtual guide. In this domain, we have studied a virtual guide’s behavior dimensions, such
as narrating, probing, question-answering, and mixed-initiative navigation. We have also
shown as an example how a museum’s contents are digitized, and how knowledge in the
art domain can be modeled. In the future, Elva can serve as a test bed to perform various
empirical studies in this domain.
As the first iteration of ECA research in the Learning Science and Learning Environment
Lab (LELS), this research work benefits the ongoing and future agent projects in various
ways. First, we have provided a generic computational model that can be further studied
or extended, in order to build ECAs in other domains. Second, several key technologies
introduced in the research are reusable, e.g., three-layered agent architecture, schema-
65
based discourse framework, speech act classification, and multimodal behavior generation
and coordination. Third, it is the lessons we learnt and the experiences we acquired that is
valued. To date, an honours year project, Multi-agent Virtual Movie Gallery, has extended
the current ECA model to support conversations among three agents and a single user.
Another project has been proposed to extend Elva’s ability to guide a group of visitors.
There are more projects proposed to perform in-depth research on different dimensions.
8.3 Future Work
Multiparty interaction support is a desirable feature for the further development of Elva. It
would be exciting if Elva is able to guide a group of visitors. A multiparty scenario not
only extends the questions from single-party scenario, such as mixed-initiative, modeling
of conversation, coordination of multimodal behaviors, but also presents entirely new
challenges posed by the large number of users. For example, Elva may need to track
several topics simultaneously and to relate them to individual users.
The next phase of ECA research can also address the area of affective computing.
Currently, Elva exhibits a static personality style that can be classified as pleasant, polite,
and patient. One way is to build an emotion engine that is capable of generating emotions
which dynamically evolve with the situation of the conversation. The agent’s expression
of emotional state will be moderated with a set of parameters, such as social variable, the
agent’s linguistic style and personality. In that way, the agent can be elevated to the level
of a virtual world actor.
66
Another open research direction is towards modeling a user’s cognitive map. A cognitive
map is a personal mind map about what a person is thinking and expecting to see.
Currently, Elva is able to capture a user’s mouse actions and navigation in the virtual
world. And Elva is also able to analyze a user’s intention via speech act classification.
Given these observations, it is possible to construct a cognitive map of the user.
Researchers examining cognitive mapping in the virtual environment have proposed some
measurements of a person’s cognitive map [23]. These measurements should be carefully
examined about whether it is feasible to apply them to our application domain. By
understanding the cognitive mapping of a user, Elva can generate proper plans or
decisions to accomplish her tasks, and provide customized services to users.
67
REFERENCES
[1]
Andersen, P.B. and Callesen, J. Agents as actors. In Qvortrup, L. (ed), Virtual
Interaction: Interaction in Virtual Inhabited 3D Worlds. Springer, London, 2001,
pp. 182-208.
[2]
Billinghurst, M., and Weghorst, S. The use of sketch maps to measure cognitive
maps of virtual environments, In the Proceedings of Virtual Reality Annual
International Symposium, IEEE Computer Society, Triangle Park, NC, 1995,
pp.40-47.
[3]
Burkhardt, A. Speech Act, Meaning and Intentions, Critical Approach to the
Philosophy of John R. Searle. Walter de Gruyter, New York, 1990.
[4]
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. Embodied Conversational
Agents. MIT Press, Cambridge, 2000.
[5]
Cassell, J. Nudge nudge wink wink: elements of face-to-face conversation for
embodied conversational agents, In Cassell, J., Prevost, S., Sullivan, J., and
Churchill, E. (ed), Embodied Conversational Agents. MIT Press, Cambridge, 2000,
pp. 1-27.
[6]
Cassell, J., Bickmore, T., Camphell, L., Vilhjalmsson, H., and Yan, H. The human
conversation as a system framework: designing embodied conversational agents.
68
In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied
Conversational Agents. MIT Press, Cambridge, 2000, pp. 29-63.
[7]
Chee, Y. S. & Hooi, C. M. C-VISions: socialized learning through collaborative,
virtual, interactive simulations. In Proceedings of CSCL 2002: Conference on
Computer Support for Collaborative Learning, Boulder, CO, USA, 2002, pp. 687696.
[8]
Core, M., and Allen, J. Coding dialogs with the DAMSL annotation scheme, In
Proceeding of AAAI Fall Symposium on Communicative Action in Humans and
Machines, Boston, MA, November 1997, pp 28-35.
[9]
Davis, D.N. Multiple level representations of emotion in computational agents. In
AISB’01 Symposium on Emotion, Cognition and Affective Computing.
University of York.
[10]
Foner, L.N. Are we having fun yet? using social agents in social domains, In K.
Dautenhahn (ed), Human Cognition and Social Agent Technology. John
Benjamins, Amsterdam, 2000, pp. 323-348.
[11]
Höök, K., Persson, P. & Sjölinder, M. Evaluating user’s experience of a characterenhanced information space, AI Communications, 13, 2000, pp. 195-21.
[12]
Huber, M. J. JAM: a BDI-theoretic mobile agent architecture. In Proceedings of
the third annual conference on Autonomous Agents (Seattle, 1999), ACM Press,
New York, 1999, pp. 236-243
69
[13]
Johnson, W.L. and Rickel, J.W. Animated pedagogical agents: face-to-face
interaction in interactive learning environments. International Journal of AI in
Education, Nov 2000, pp 47-78.
[14]
Laaksolahti, J., Persson, P., and Palo, Carolina. Evaluate believability in an
interactive narrative. In Proceedings of the second International Conference on
Intelligent Agent Technology, Maebashi City, Japan, 2001, pp 30-35.
[15]
Lee, S. I. and Cho, S. B. (2001). An intelligent agent with structure pattern
matching for a virtual representative. In Zhong, N., Liu, J., Ohsuga, S. and
Bradshaw, J. (ed), Intelligent Agent Technology: Research and Development.
World Scientific. 2001. pp. 305-309.
[16]
Lester, J.C., Towns, S, Callaway, C.B., Voerman, J.L., and FitzGerald, P. J.
(2000). Deictic and emotive communication in animated pedagogical agents. In
Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied
Conversational Agents. MIT Press, Cambridge, MA, 2000, pp. 123-154.
[17]
Loyall, L. Believable Agents: Building interactive personalities, Ph.D. Thesis.
Technical Report CMU-CS-97-123, School of Computer Science, Carnegie
Mellon University, Pittsburgh, PA. May 1997.
[18]
Madsen, C.B. and Granum, E. (2001). Aspects of interactive autonomy and
perception. In Qvortrup, L. (ed), Virtual Interaction: Interaction in Virtual
Inhabited 3D Worlds. Springer, London, 2001, pp. 182-208.
70
[19]
Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F. and Poggi, I. (2002).
Embodied contextual agent in information delivering application. In Proceeding of
AAMAS 02, ACM Press, New York, 2002, pp. 758-765.
[20]
Poggi, I. and Pelachaud, C. (2002). Performative facial expressions in animated
faces. In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied
Conversational Agents. MIT Press, Cambridge, MA, 2000, pp. 155-188.
[21]
Rooney, C.F.B., O'Donoghue, R.P.S., Duffy, B.R., O'Hare, G.M.P. and Collier,
R.W. The social robot architecture: towards sociality in a real world domain. In
Proceeding of Towards Intelligent Mobile Robots 99, Bristol, UK, 1999.
[22]
Porter, M. The Porter Stemming Algorithm URL: http://www.tartarus.org/~martin
/PorterStemmer, accessed on 20/03/2003.
[23]
Sloman, A. Architectural requirements for human-like agents both natural and
artificial, In K. Dautenhahn (ed), Human Cognition and Social Agent Technology.
John Benjamins, Amsterdam, 2000, pp. 163-195.
[24]
The WordNet official homepage. URL: http://www.cogsci.princeton.edu/~wn/,
accessed on 12/02/2003.
71
User study on Elva
Appendix I. Instruction Sheet
This user study aims to evaluate the social robot Elva that has been developed in
the Learning Environment and Learning Science Lab, NUS. Elva mimics a tour guide
in a virtual art gallery: Ng Eng Teng Art Gallery. When a user logs in, she will guide
the user through the gallery, present information on the gallery exhibits, and attend
to user’s enquires.
In this study, you are about to visit the galley and interact with Elva. The study consists of five sessions
(figure below). In session 1, read through the rest of the sheet carefully to understand the basic facts
about the artist and the gallery, as well as some useful tips. In session 2, you will be trained to walk in
the virtual world and to play with the exhibits. You will then spend 20 – 30 minutes on the gallery tour
(session 3). After the tour, you will do a story-telling about your experience (session 4) and fill an
evaluation form (session 5).
1. Learn about facts
and tips
2. Be trained on user
interface
3. Enjoy the gallery
tour
4. Be interviewed on
your experience
5. Complete an
evaluation form
A-1
> Basic Facts
The Ng Eng Teng’s gallery is situated at NUS
Museums in the central heart of the National
University of Singapore. The gallery houses
the most comprehensive collection of works
by Singapore's foremost sculptor, Ng Eng
Teng. It was established from generous
donations by the artist himself of 760 works in
July 1997.
In general, we
encourage users to
create their very
own experiences
freely.
Notwithstanding,
if you want to
achieve a
smoother
interaction, these
tips may be useful.
The exhibition, titled “Configuring the body:
Form & Tensor” features works selected from
the artist's 3rd donation. Ng’s main source of
inspiration has always been the human
figure. As you will see, even his abstract
works are experiments based in elements of
the human form.
Tips [...]... provides planning and decision-making mechanisms The ECA in our environment is equipped with information-delivery goals and plans that accomplish the goals The Planner selects an appropriate plan from the plan library based on the agent goal During the interaction, Planner may adopt new plans according to the change of goal The Scheduler instantiates and executes the selected plan With its help, the agent. .. matching As a key area of improvement, the user’s query patterns are represented using the extracted meaning of the utterance, instead of simple keywords 4.3.1 Analysis Phase The analysis phase attempts to “understand” the user utterance and transform it to an internally recognizable meaning In our system, the meaning of an utterance is represented by the speaker’s intention (speech act) and semantics... satisfaction and agent believability The human computer interaction evaluation methods are exploited to cover novel embodied conversational settings 1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery Elva is an embodied conversational agent developed in the Learning Science and Learning Environment Lab, National University of Singapore, for this project Her job is 4 to guide a user to through an interactive. .. theory [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of conversation When the intention of the speaker is captured with speech act, the response given by the agent is likely... as deictic gestures, turn taking, attention, and nonverbal feedback, can be instantiated and expressed We also need to devise an approach to coordinate the modalities that are generated We adopt this integrative approach in the design and development of an agent called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to the human users A user study has been... semantics (keywords) The transformation steps are depicted in Figure 8 21 Raw Utterance S Correct ill-formed input Resolve Reference Tokenize into linked word list Transform into stem list using PORTER Classify into Speech Acts Speech Act Remove stop words Replace Synonym Keyword List Utterance Meaning (intention + semantics + S) Figure 8: Transform raw user input to utterance meaning (query pattern) •... Language, an XML-compliant language which contains tags to define pattern and response pairs [10] 9 ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching Another chatterbot, JULIA, employs an Activation Network to model possible conversation on a specific topic [10] When an input a pattern is matched, the node containing the pattern has its activation level raised And... field of embodied conversational agents remains open and requires systematic research 1.2 Research Objectives This research project aims to develop an integrative approach for building an ECA that is able to engage conversationally with human users, and capable of behaving according to social norms in terms of facial display and gestures Our research into the embodied conversational agents focuses on... reactive layer handles the user’s utterances as they arise The Utterance Analyzer performs natural language understanding tasks to determine the meaning of the utterance so that it can be recognized by the agent Based on the meaning, the Response Retriever queries the Knowledge Base for the matched schema node that contains an appropriate response The details about each component are described in chapter... Figure 6: Narrative modeling using a schema 4.2.2 Dialog Modeling In the dialog mode, the agent and the user are engaged in the discussion of a topic Turns are frequently taken A sample schema for dialog modeling is shown in Figure 5 Clearly, the organization of schema nodes in a dialog is more dynamic and complicated than in narrative Some schema nodes branch out into a few dialog links, which describe ... [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and... “understand” the user utterance and transform it to an internally recognizable meaning In our system, the meaning of an utterance is represented by the speaker’s intention (speech act) and semantics... novel embodied conversational settings 1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery Elva is an embodied conversational agent developed in the Learning Science and Learning Environment