Conversational agent in an interactive virtual world

ELVA: AN EMBODIED CONVERSATIONAL AGENT IN AN INTERACTIVE VIRTUAL WORLD YUAN XIANG (B.Comp.(Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2004 ACKNOWLEDGEMENT First of all, I wish to thank my supervisor A/P Chee Yam San for his guidance, encouragement and patience over the years. The many discussions we had, in which he showed his enthusiasm towards the topic, kept me on the right track. I would like to show my gratitude to the head of NUS Museums, Ms. Angela Sim, for granting us the permission to use the Ng Eng Teng Gallery as our subject matter, and making the project possible. I would like to thank Dr. Sabapathy for reviewing the virtual tour, and providing his insights on Ng Eng Teng’s art. Also thanks to Suhaimi, curator of the Ng Eng Teng Gallery, who has contributed his expert knowledge about how to guide a tour. Special thanks to those who lent their hands during the gallery-scanning sessions: Mr. Chong, Beng Huat, Chao Chun, Ting, and Donald. I am also grateful to the members of LELS Lab: Chao Chun, Jonathan, Lai Kuan, Liu Yi, and Si Chao. It was an enjoyable and memorable experience studying in this lab. Finally, it is time to thank my family. Their support and patience have accompanied me throughout the whole period. Yuan Xiang ii TABLE OF CONTENT TABLE OF CONTENT ..................................................................................................III LIST OF TABLES.............................................................................................................V LIST OF FIGURES......................................................................................................... VI SUMMARY .....................................................................................................................VII CHAPTER 1 INTRODUCTION.......................................................................................1 1.1 EMBODIED CONVERSATIONAL AGENTS: A CHALLENGE FOR VIRTUAL INTERACTION .1 1.2 RESEARCH OBJECTIVES ...............................................................................................3 1.3 ELVA, AN EMBODIED TOUR GUIDE IN AN INTERACTIVE ART GALLERY.......................4 1.4 STRUCTURE OF THESIS .................................................................................................6 CHAPTER 2 RESEARCH BACKGROUND ..................................................................8 2.1 ECA AS A MULTI-DISCIPLINARY RESEARCH TOPIC .....................................................8 2.2 ARCHITECTURAL REQUIREMENT .................................................................................8 2.3 DISCOURSE MODEL .....................................................................................................9 2.4 MULTIMODAL COMMUNICATIONS .............................................................................10 2.5 SUMMARY ..................................................................................................................12 CHAPTER 3 AGENT ACHITECTURE........................................................................13 3.1 OVERVIEW .................................................................................................................13 3.2 PERCEPTION MODULE ................................................................................................14 3.3 ACTUATION MODULE ................................................................................................14 3.4 INTERPRETATION MODULE ........................................................................................15 3.4.1 Reflexive Layer ..................................................................................................15 3.4.2 Reactive Layer ...................................................................................................15 3.4.3 Deliberative Layer.............................................................................................15 3.4.4 Behavior Generation .........................................................................................16 3.5 KNOWLEDGE BASE ....................................................................................................16 CHAPTER 4 AGENT’S VERBAL BEHAVIORS ........................................................17 4.1 SCHEMA-BASED DISCOURSE FRAMEWORK ................................................................17 4.2 DISCOURSE MODELING ..............................................................................................19 4.2.1 Narrative Modeling ...........................................................................................19 4.2.2 Dialog Modeling................................................................................................20 4.3 REASONING ................................................................................................................21 4.3.1 Analysis Phase ...................................................................................................21 4.3.2 Retrieval Phase..................................................................................................27 4.4 PLANNING ..................................................................................................................29 4.5 DIALOG COORDINATION ............................................................................................31 iii 4.6 SUMMARY ..................................................................................................................32 CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS ...............................................33 5.1 TYPES OF NONVERBAL BEHAVIORS IN OUR DESIGN ..................................................33 5.1.1 Interactional Behaviors .....................................................................................33 5.1.2 Deictic Behaviors ..............................................................................................34 5.2 NONVERBAL BEHAVIOR GENERATION .......................................................................35 5.3 NONVERBAL BEHAVIOR EXPRESSION ........................................................................36 5.3.1 Building Blocks of Multimodality ......................................................................37 5.3.2 Multimodality Instantiation ...............................................................................38 5.3.3 Multimodality Regulation ..................................................................................40 5.4 SUMMARY ..................................................................................................................41 CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION ..........................42 6.1 AN INTERACTIVE ART GALLERY ...............................................................................42 6.2 VERBAL COMMUNICATION ........................................................................................43 6.2.1 Elva presents information..................................................................................44 6.2.2 Elva probes the user in a dialogue ....................................................................45 6.2.3 Elva answers questions......................................................................................46 6.3 NONVERBAL COMMUNICATION .................................................................................47 6.3.1 Deictic Gestures ................................................................................................47 6.3.2 Facial Display ...................................................................................................49 6.3.3 Locomotion .................................................................................................50 CHAPTER 7 EVALUATION..........................................................................................51 7.1 METHODOLOGIES.......................................................................................................51 7.1.1 Qualitative Analysis...........................................................................................51 7.1.2 Quantitative Analysis.........................................................................................52 7.2 SUBJECTS, TASK AND PROCEDURE.............................................................................52 7.3 EVALUATION RESULTS ..............................................................................................55 7.3.1 User Experience ................................................................................................55 7.3.2 Agent Believability.............................................................................................57 7.3.2.1 Evaluation results on Elva’s Verbal Behaviors ..........................................58 7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors...................................61 7.4 DISCUSSION ...............................................................................................................62 7.5 SUMMARY ..................................................................................................................63 CHAPTER 8 CONCLUSION..........................................................................................64 8.1 SUMMARY ..................................................................................................................64 8.2 CONTRIBUTIONS OF THE THESIS.................................................................................65 8.3 FUTURE WORK ..........................................................................................................66 REFERENCES .................................................................................................................68 iv LIST OF TABLES Table 1: List of speech acts ................................................................................................23 Table 2: Layout of a sample tour plan................................................................................30 Table 3: Elva’s interactional behaviors (portion only) .......................................................34 Table 4: Elva's multimodal behavior library ......................................................................38 Table 5: Evaluation results related to user satisfaction in interaction with Elva................55 Table 6: Users' responses to Elva's personality .................................................................57 Table 7: Evaluation results related to Elva's Verbal Behaviors..........................................58 Table 8: Evaluation results related to nonverbal behaviors................................................61 v LIST OF FIGURES Figure 1: Examples of embodied conversational agents ......................................................2 Figure 2: Steve's deictic gestures........................................................................................11 Figure 3: Greta's facial display: neutral, sorry-for, and surprise. .......................................11 Figure 4: Three-layered architecture for the agent’s mental processing ............................13 Figure 5: Schema-based discourse space (portion only) ....................................................18 Figure 6: Narrative modeling using a schema ....................................................................20 Figure 7: Dialog modeling using a schema ........................................................................21 Figure 8: Transform raw user input to utterance meaning .................................................22 Figure 9: State transition graph for QUESTION_MATERIAL .........................................25 Figure 10: Speech act classification using STG .................................................................26 Figure 11: Elva's tour plan..................................................................................................29 Figure 12: Conversational Modalities ................................................................................37 Figure 13: Synchronized modalities in an animation timeline ...........................................39 Figure 14: Elva invites the user to start the tour.................................................................42 Figure 15: Elva's basic facial expressions ..........................................................................49 Figure 16: Elva's locomotion..............................................................................................50 Figure 17: Procedure of user study on Elva .......................................................................53 vi SUMMARY The technology of Embodied Conversational Agents (ECAs) offers great promise for natural and realistic human-computer interaction. However, interaction design in this domain needs to be handled sensitively in order for verbal and nonverbal signals conveyed by the agent to be understood by human users. This thesis describes an integrative approach for building an ECA that is able to engage conversationally with human users, and that is capable of behaving according to social norms in terms of facial display and gestures. The research focuses on the attributes of the agent’s autonomy and believability, not its truthfulness. To achieve autonomy, we present a three-layered architectural design to ensure appropriate coupling between the agent’s perception and action, and to hide the internal mechanism from users. In regard to agent believability, we utilize the notion of “schema” to support structured and coherent verbal behaviors. We also present a layered approach to generate and coordinate the agent’s nonverbal behaviors, so as to establish social interactions within a virtual world. Using the above approach, we developed an ECA called Elva. Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to human users. A user study was performed to assess user satisfaction and agent believability when interacting with Elva. The user feedback was generally positive. Most users interacted successfully with Elva and enjoyed the tour. A majority agreed that Elva’s behaviors, both verbal and nonverbal, were comprehensive and appropriate. The user study also revealed that vii emotive behaviors should be integrated into Elva to achieve a higher level of believability. Future work will address the area of affective computing, multiparty interaction support, and user modeling. viii CHAPTER 1 INTRODUCTION 1.1 Embodied Conversational Agents: A Challenge for Virtual Interaction People are spending more time interacting with computers which raises the question about the design of natural human-computer interfaces. What is the most natural humancomputer interface? Speech is often seen as the most natural interface. However, it is only part of the answer, as it is the ability to engage in conversation that allows for the most natural interactions [5,6]. Conversation involves more than speech. A wide variety of signals such as gaze, gesture, or body posture are used naturally in conversation. One path towards achieving realistic conversational interfaces has been the creation of embodied conversational agents or ECAs. An ECA is a life-like computer character that can engage in a conversation with humans using a naturally spoken dialogue. It also has a “body” and knows how to use it for effective communication. In recent years, systems containing ECAs have been deployed in a number of domains. There are pedagogical agents that educate and motivate students in E-learning systems (see Figure 1: A, B), virtual actors or actresses that are developed for entertainment or therapy (see Figure 1: C), sales agents that demonstrate products in e-commence applications (see Figure 1: D), and web-agents that help the user surfing the web pages of a company. In the future, ECAs are likely to become long-term companions to many people and share much of their daily activity [20]. 1 (A) Steve (B) Cosmo (C) Greta (D) Rea Figure 1: Examples of embodied conversational agents While embodied conversational agents hold the promises of conducting effective and comfortable conversation, the related research is still in its infancy. Conversation with ECAs presents particular theoretical and practical problems that warrant further investigation. Some of the most intensively pursued research questions include: z What are the appropriate computational models for building autonomous and intelligent ECAs? z How can we draw on existing methods in research and effectively put the pieces together to design characters that are compelling and helpful? z What is the set of communicative behaviors that are meaningful and useful for humans, and are practical to implement? 2 It is evident that the field of embodied conversational agents remains open and requires systematic research. 1.2 Research Objectives This research project aims to develop an integrative approach for building an ECA that is able to engage conversationally with human users, and capable of behaving according to social norms in terms of facial display and gestures. Our research into the embodied conversational agents focuses on the attributes of autonomy and believability, not truthfulness. To achieve autonomy, the autonomous agent we developed shall be capable of relating itself to the virtual world in a sensible manner. In view of this, the architectural design aims to couple the agent’s perception and action appropriately, and to hide its internal mechanism from the system users. To achieve believability, the agent must demonstrate sufficient social skills in terms of speech and animation. Towards this goal, we pay careful attention to the design of verbal and nonverbal interaction and to the attainment of flows of behavior that help to establish social facts within the virtual world. In practice, we work towards assembling a coherent set of ECA technologies that help to attain the attributes as mentioned above, and are feasible to implement given the constraints in project timeline and resources. Our work has been emphasized on the development of ECA technologies as described below: 3 • Layered architectures: In our design, the agent is divided into three layers to support its autonomous behaviors. The agent is able to sense the environment, and react to an actually detected situation. The agent can also act deliberately, i.e., plan a course of actions, over time, in pursuit of its own goal. • Discourse model: We aim to provide a unified discourse framework where conversation about individual topics can be modeled. Under this framework, the agent will able to narrate about a specific topic, as well as to respond to user queries in appropriate manner. • Generation and coordination of multimodal behaviors: The multimodal behaviors such as deictic gestures, turn taking, attention, and nonverbal feedback, can be instantiated and expressed. We also need to devise an approach to coordinate the modalities that are generated. We adopt this integrative approach in the design and development of an agent called Elva. Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to the human users. A user study has been performed to measure user satisfaction and agent believability. The human computer interaction evaluation methods are exploited to cover novel embodied conversational settings. 1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery Elva is an embodied conversational agent developed in the Learning Science and Learning Environment Lab, National University of Singapore, for this project. Her job is 4 to guide a user to through an interactive art gallery, and engage conversationally with the user about gallery artifacts. In conventional tours, the guide decides where to take people and the exact timing of the itinerary. However, under the circumstances of a virtual tour, we expect the system users to behave more actively. In the interest of greater user freedom and participation, the design especially caters to mixed-initiative interaction where both parties can direct the navigation and conversation. The following paragraphs describe the Elva’s three categories of behaviors: navigation, speech, and nonverbal behaviors. z Navigation: A tour can be divided into a sequence of locations. Through planning of the itinerary, the guide is able to intentionally direct the user from one artifact to the next. When the user navigates independently, Elva can track the user. z Speech: Elva’s verbal communication can be classified into two modes: narrative and dialog. In the narrative mode, the guide presents information centering on the artifacts, artists, or the gallery being visited. In the dialog mode, two parties engage in conversation about a specific topic. z Nonverbal Behaviors: The presence of body language and facial display helps to convey communicative function to manage turn taking and feedback, to refer to an artifact, and to create interesting effects. 5 We will use Elva as an example to illustrate various aspects of our approach throughout this thesis. 1.4 Structure of Thesis This thesis is divided into eight chapters, with each particular chapter detailing a specific area of focus: Chapter 1, Introduction, gives an overview of the motivations and objectives that this thesis aims to achieve. Chapter 2, Literature Review, discusses the various research areas that ground this multidisciplinary project. Existing methodologies are reviewed from the perspectives of architectural requirements, discourse models, and multimodality. Chapter 3, Agent Architecture, presents a generic three-layered architecture which forms the backbone of our ECA. We then describe the construction of such architecture and the interaction among the system components. Chapter 4, Agent’s Verbal Behaviors, introduces the notion of schema, a discourse template from which the narrative or dialog about a specific topic can be dynamically generated. It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework. 6 Chapter 5, Agent’s Nonverbal Behavior, begins by highlighting the importance of interactional behaviors and deictic gestures in enhancing agent-user communication. It then presents our methods to generate and express the multimodal behaviors. Chapter 6, Illustrated Agent-human Interaction, illustrates how Elva establishes verbal and nonverbal communication with a system user, and guides the user in a gallery tour. Chapter 7, Evaluation, describes the evaluation methodology and observed results of the user study performed on the virtual tour with Elva. Chapter 8, Conclusion, summarizes the thesis and documents our achievements and contribution. Possible future work is also discussed. 7 CHAPTER 2 RESEARCH BACKGROUND 2.1 ECA as a Multi-disciplinary Research Topic Research on embodied conversational agents embraces multiple dimensions: ECA inherits the research problems from its super type, autonomous agent, which is a field of Artificial Intelligence; the verbal communication aspect may require technologies from Natural Language Processing (NLP) and Information Retrieval (IR); the embodiment and communicative behaviors aspects are intrinsically social, and form a dimension of Human Computer Interaction (HCI) research. We identified the key aspects of ECA technology through a breadth-wise study on ECA technology. We then concentrate our research on the development of ECA from the aspects of agent architecture, conversation modal, and generation and coordination of multimodal behaviors. The following sections will discuss the previous research in these research areas. 2.2 Architectural Requirement Architectural requirement has been widely discussed in the research literature [9,21,23]. Sloman described architectural layers that comprise mechanisms for reactive, deliberative, and meta-management controls [23]: z The Reactive mechanism acts in response to the actually detected internal or external event. Reactive systems may be very fast because they use highly parallel implementations. However, such systems may lead to uncoordinated 8 behaviors due to their inability to contemplate, evaluate, and select possible future course of action. z The Deliberative mechanism provides more sophisticated planning and decision-making mechanisms. Therefore, it allows for more coordination and global control. z The Meta-management allows the deliberative processes to be monitored and improved. It is an ambitious step towards a human-like architecture. However, due to extraordinary complexity, it is not yet practically viable for implementation [23]. Most agent architectures conform to the action selection paradigm [1,10,11,18]. FSA (Finite State Automaton) is a simple architecture based on the reactive mechanism described above. The behavior state in FSA is controlled by events generated based on the values of the internal mood variables [18]. BLUMBERG is another reactive architecture, which employs a hierarchical behavioral representation where behaviors compete among themselves according to their “levels of interest.” [18]. JAM is a traditional BDI (Belief Desire Intention) architecture operating with explicit goals and plans. JAM agents can deliberately trigger the search for the best plan from a plan library [11]. 2.3 Discourse Model Classical chatterbots typically employ simple pattern matching techniques and have a limited model of discourse. ALICE is based on Artificial Intelligence Markup Language, an XML-compliant language which contains tags to define pattern and response pairs [10]. 9 ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching. Another chatterbot, JULIA, employs an Activation Network to model possible conversation on a specific topic [10]. When an input a pattern is matched, the node containing the pattern has its activation level raised. And the node with the highest level is then selected. This allows more structured and flowing conversations than simple pattern matching. It is interesting to note that NLP is seen as less important in chatterbots, whereas simple implementations are often sufficient for creating believable conversation in a limited context. More recently, researchers began to exploit the notion of speech acts to analyze the intentions of speakers [8,15]. Based on Searle’s speech act theory [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising. The categories of speech act can be domain-independent, as described in TRAINS and DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of conversation. When the intention of the speaker is captured with speech act, the response given by the agent is likely to be more accurate, as compared to the situation of expressing with simple keywords. 2.4 Multimodal Communications Embodiment allows multimodality, therefore making interaction more natural and robust. Research on multimodal communications has concentrated on the question of generating understandable nonverbal modalities. In what follows, we review some previous research in this field. 10 Steve was developed by the Center for Advanced Research in Technology for Education (CARTE) [16]. As shown in Figure 2, Steve can demonstrate actions, use gaze and deictic gestures to direct the students’ attention, and he can guide the students around with locomotion. In order to do these, Steve exploits its knowledge of the position of objects in the world, its relative location with respect to these object, as well as its prior explanations to create deictic gesture, motions, and utterances that are both natural and unambiguous. Figure 2: Steve's deictic gestures Greta, embodied in a 3D talking head (see Figure 3), shows a rich expressiveness during the natural conversation with user. Greta manifests emotion by employing a Belief Network that links facial communicative functions and facial signals that are consist with the discourse content [19,20]. The facial communicative functions that are typically used in human-human dialogs; for instance: syntactic, dialogic, meta-cognitive, performative, deictic, adjectival and belief relation functions. Figure 3: Greta's facial display: neutral, sorry-for, and surprise. REA [6] is an embodied estate-seller that is able to describe feature of a house using a 11 combination of speech utterances and gesture. Rea’s speech and gesture output is generated in real time from the knowledge base and the description of communicative goals, using SPUD (“Sentence Planning Using Description”) engine. 2.5 Summary We have reviewed some of the previous research works in the areas of architectural requirements, discourse models, and multimodality. In the next chapters, we proceed to present our integrative approaches. 12 CHAPTER 3 AGENT ACHITECTURE In this chapter, we present a generic three-layered architecture which forms the backbone of our ECA. We then describe the construction of such architecture and the interaction among the system components. 3.1 Overview In our design, an embodied conversational agent interfaces with the virtual world via a perception module and an actuation module. The perception module provides the agent with high-level sensory information. The actuation module enables the ECA to walk and to perform gestures and facial expressions. The gap between perception and action is bridged by mental processing in the Interpretation Module, a knowledge-based inference Plan Library Agent Goal Agent State DELIBERATIVE Planner Scheduler KNOWLEDGE BASE PERCEPTION MODULE Dialog Coordinator FOCUS Utterance Analyzer Enricher ACTUATION MODULE Response Retriever REACTIVE Lexicon SpeechAct Context Q&I Generator REFLEXIVE INTERPRETATION MODULE Figure 4: Three-layered architecture for the agent’s mental processing 13 engine. We model this module using a combination of reactive and deliberative mechanisms in order to cope with mixed-initiative situations. The architecture (see Figure 4) comprises three layers: a reflexive layer (section 3.4.1), a reactive layer (section 3.4.2), and a deliberative layer (section 3.4.3). 3.2 Perception Module The Perception Module endows an ECA with the ability to “see” and “hear” in the virtual world. It supplies sensory information about self, system users (as avatars), environment, and the relationships among them. For example, “user arrives at artifact2,” “user is facing artifact2,” “I am at artifact2,” “artifact2 is on my left hand side,” “user clicks on the artifact2,” and “User said to me.” When a state change in the world is detected, the Perception Module will abstract the information, and propagate this event to the reactive layer for appraisal. Simultaneously, raw perceptual data, such as the coordinates of user’s current position, are fed to reflexive layer for quick processing. 3.3 Actuation Module The Actuation Module drives the character animation and generates sythesized speech. The Interpretation Module produces behavior scripts which specify multimodal behaviors over a span of time, and sends them to the event queue of the Actuation Module. The Actuation Module sequences and coordinates the multimodal signals in different channels (described in section 5.4) and plays them back. If necessary, the Actuation Module will update the agent’s goal state and coversational state upon the completion of actuation. 14 3.4 Interpretation Module The Interpretation Module an inference engine that bridge the agent’s perception and actions. It comprises three layers, i.e., a reflexive layer, a reactive layer, and a deliberative layer. 3.4.1 Reflexive Layer The reflexive layer implements a series of reflex behaviors that are quick and inflexible (Q&I). For example, the agent is able to track the user with head movement in response to the user’s subtle location change. And the agent glances at artifact when the user performs a click on the artifacts. 3.4.2 Reactive Layer The reactive layer handles the user’s utterances as they arise. The Utterance Analyzer performs natural language understanding tasks to determine the meaning of the utterance so that it can be recognized by the agent. Based on the meaning, the Response Retriever queries the Knowledge Base for the matched schema node that contains an appropriate response. The details about each component are described in chapter 4. 3.4.3 Deliberative Layer The deliberative layer provides planning and decision-making mechanisms. The ECA in our environment is equipped with information-delivery goals and plans that accomplish the goals. The Planner selects an appropriate plan from the plan library based on the agent goal. During the interaction, Planner may adopt new plans according to the change of goal. The Scheduler instantiates and executes the selected plan. With its help, the agent is able 15 to advance the conversation at regular time intervals. In chapter 5, we will go through the details. 3.4.4 Behavior Generation The utterances produced by the reactive layer and deliberative layer flow into a Dialog Coordinator where turn taking is regulated. Finally, the Enricher generates appropriate nonverbal behaviors to accompany the speech before the behavior is actuated. Details are described in chapter 6. 3.5 Knowledge Base The Knowledge Base (KB) at the heart of the system consists of schema nodes that encode the domain knowledge as condition-action pairs. FOCUS refers to the schema node that constitutes the agent’s current focus of attention during the conversation. FOCUS serves as a reference point for both reactive and deliberative processes, and it is updated every time a new behavior has been actuated. The details will be covered in section 4.1. 16 CHAPTER 4 AGENT’S VERBAL BEHAVIORS This chapter introduces the notion of schema, a discourse template from which the narrative and dialog about a specific topic can be dynamically generated. It then describes how structured and coherent verbal behaviors are supported under schema-base knowledge framework. 4.1 Schema-based Discourse Framework A schema defines a template from which narrative or dialog about a specific topic can be dynamically generated. Each schema is related to a discourse type and a topic. For example, the schema named ELABORATE_BIODATA_OF_ ARTIST1 indicates that the topic on BIODATA_OF_ARTIST1 is elaborated using this schema. A schema is modeled as a network of utterances that contribute to its information-delivery goal. When a schema, e.g., ELABORATE_BIODATA_OF_ARTIST1, is instantiated, the agent has to fulfill a goal ELABORATE (ARTIST, BIODATA) by producing a sequence of utterances from the template. A pictorial view of the discourse space is shown in Figure 5: a network of utterances (black dots) forms a schema (dark grey ovals). In turn, a few schemas are grouped into a domain entity (light grey oval), e.g., an artist, an artifact, or an art-related concept. Transiting from one schema to another simulates shifting topic or referencing to another concept in a conversation. 17 INTRO_PARTICULAR_OF_ARTIST1 ELABORATE_BIODA ELABORATE_GEN TA_OF_ARTIST1 RE_OF_ARTIST1 artist1 ELABORATE_TECHNIQUE_OF_ SCULPTART DEFINE_CONCEPT_OF_ SCULTPART sculptural art Figure 5: Schema-based discourse space (portion only) An utterance is encapsulated as a schema node in the knowledge base. Each schema node can have multiple conditions, an action, and several links. A condition describes a pattern of user input, in terms of a speech act [3], and a keyword list, which activates the node. The action contains a list of interchangeable responses. The link specifies three types of relationships between two adjacent nodes. (1) A sequential link defines a pair of nodes in the order of narration. (2) A dialog link connects the agent’s adjacent turns in a dialog. The link contains a pattern which is expected from the user’s turn. (3) A reference link bridges two schemas, which is analogous to a hyper link. We employ an XML tree structure to represent the schema nodes. The definition is shown below: 18 schema schema_id ID #REQUIRED> schema goal CDATA #REQUIRED> condition type CDATA (dependant|independent) “independent” > node id ID #REQUIRED> node schema_id IDREF #REQUIRED> action affective_type CDATA #IMPLIED> link relationship (sequential|dialog|reference) “sequential”> link node_id IDREF #REQUIRED> Using this definition, a schema node can be encoded as following: QUESTION_DESCRIPTION nonrepresentative,figure Non-representative figures are very abstract human figures. The artist purposely abandons the details like gender and age, so that the figure becomes anonymous. 4.2 Discourse Modeling In this section, we describe how narrative and dialogs can be modeled using the notion of schema. 4.2.1 Narrative Modeling In the narrative mode, the agent presents information centering on a topic. It is similar to the situation in a traditional guided tour: a guide initiates the topic and delivers a monologue unless interrupted by the visitor. The schema used to model narrative is 19 usually organized in a linear form. As depicted in Figure 4, each schema node encapsulates an utterance. The nodes must be well ordered so as to generate a clear narrative about a specific topic. Here you looked at the sculpture called Love and Hate. If you look carefully, you probably see two persons bound together. It suggests love because they are very close together. It also suggests hate because the two faces are pulled away from each other. This sculpture deals with the "love and hate" relationship. Figure 6: Narrative modeling using a schema 4.2.2 Dialog Modeling In the dialog mode, the agent and the user are engaged in the discussion of a topic. Turns are frequently taken. A sample schema for dialog modeling is shown in Figure 5. Clearly, the organization of schema nodes in a dialog is more dynamic and complicated than in narrative. Some schema nodes branch out into a few dialog links, which describe the user’s possible responses. 20 Can you guess what this sculpture looks like? body, face, bone… Good try! Any more? Any guess? No body, face body, face No Your answer is right! No, hint Try to rotate the sculpture. ok Can you see a human face? Let me give you a hint Figure 7: Dialog modeling using a schema 4.3 Reasoning This section introduces the agent’s reasoning process. We utilize the concept of casebased reasoning and pattern matching. As a key area of improvement, the user’s query patterns are represented using the extracted meaning of the utterance, instead of simple keywords. 4.3.1 Analysis Phase The analysis phase attempts to “understand” the user utterance and transform it to an internally recognizable meaning. In our system, the meaning of an utterance is represented by the speaker’s intention (speech act) and semantics (keywords). The transformation steps are depicted in Figure 8. 21 Raw Utterance S Correct ill-formed input Resolve Reference Tokenize into linked word list Transform into stem list using PORTER Classify into Speech Acts Speech Act Remove stop words Replace Synonym Keyword List Utterance Meaning (intention + semantics + S) Figure 8: Transform raw user input to utterance meaning (query pattern) • Preparation Steps We examine the utterance to correct ill-formed inputs, resolve references 1 , tokenize the utterance into linked world list, and finally apply the PORTER stemming algorithm [22] to transform the words to their stem form. z Speech Act Classification 1 Reference resolution aims to determine which noun phrases refer to each real-world entity mentioned in a document or discourse. 22 The next crucial step is to determine users’ intention via speech act classification. Inspired by Searle’s speech act theory [3], the system defines over thirty speech acts (see Table 1) to cover the user’s common queries and statements. Some speech acts are domain independent, while the rest are related to our application, i.e., virtual tour guide in an art gallery. Illocution QUESTION Matters WHY, WHEN, WHO, WHERE, QUANTITY, CONSTRUCT, MATERIAL, COMMENT, EXAMPLE, TYPE, DESCRIPTION, COMPARISON, MORE REQUEST MOVE, TURN, POINT, EXIT, REPEAT STATE FACT, REASON, EXAMPLE, PREFERENCE, COMPARE, LOOKLIKE, POSITIVECOMMENT, NEGATIVECOMMENT COMMUNICATE GREET, THANK, INTERRUPT, BYE, ACKNOWLEDGE, YES, NO, REPAIR Table 1: List of speech acts A speech act is described by its illocutionary force and the object. For example, QUESTION_MATERIAL relates to an inquiry about the material. Can you tell me what is this sculpture made of? It is made of Ciment-fondu, an industrial material. STATE_POSITIVECOMMENT reveals a user’s positive comment. I like this sculpture very much. I am glad to hear that. Our early classification approach was based on the list of acceptable patterns for a speech act. For example, “do you like *” is an acceptable pattern for 23 QUESTION_COMMENT. However, this approach results in relatively low accuracy. Consider the following two utterances: A: do you like this sculpture? B: do you like to go for supper? Utterance B is classified wrongly. This problem was resolved by adding in the list of rejected patterns for each speech act. In this case, “do you like to *” shall be added to the reject patterns for QUESTION_COMMENT. We also encountered the overlapping problem in speech act classification. As it is impossible to enumerate all possible patterns for a speech act, the classification is not entirely clean-and-dry. In some cases, one utterance can elicit more than one speech act. For example, the following utterance can be classified into both QUESTION_CONSTRUCT and QUESTION_DESCRIBE. Can you describe how to make ceramics? Further investigation reveals that some speech acts, e.g., QUESTION_CONSTRUCT are, in fact, the specialized instances of another speech act (QUESTION_DESCRIBE). Therefore, their patterns may overlap. To get around this problem, we assess the level of specialization for each speech. Priority is given to a specialized speech act when several speech acts are copresent. 24 We developed a speech act classifier based on a data modal called state transition graphs (STGs) as shown in Figure 7. An STG encapsulates the acceptable and rejected patterns for a speech act. A path from start state to terminal state indicates an acceptable or a reject pattern. START I OR (like, want) do what me you to tell know be what make SYNONYM (material) NOT (to) of REJECT ACCEPT Figure 9: State transition graph for QUESTION_MATERIAL During classification, an utterance is validated against every STG defined for agent’s set of speech acts. For each STG, the utterance performs a word-by-word traversal through the STG. If the utterance runs into a state where there is no escape, or terminates in the REJECT state, this indicates that validation has failed. Otherwise, the utterance will terminate in the ACCEPT state, which means the utterance matches with the corresponding speech act. For example, for an utterance, “Do you know what this sculpture is made of,” there is a valid path 25 “doÆyouÆknowÆwhatÆmakeÆof” in QUESTION_MATERIAL (see Figure 8). Do you know what this sculpture is made of START I OR (like, want) to do tell what me you know be what make %MATERIAL% of REJECT ACCEPT Figure 10: Speech act classification using STG z Keyword Extraction We proceed to extract semantics, in terms of keywords, from an utterance. Our approach is based on the Information Retrieval (IR) techniques of stop word 2 removal and synonym replacement. Synonym replacement is a desired feature for embodied conversational agents. It allows a single set of keywords to be scripted in the Knowledge Base, and a list of 2 Stop words refer to commonly used words which are usually ignored by a search engine in response to a search query. 26 its synonyms to be specified in a lexical database (e.g. WordNet [24]). The feature will help reduce the number of similar patterns we have to script because of various synonyms of a certain word. And it will allow us to adapt the agent to a new version of words easily. After keywords are extracted, they are combined with speech act to form the utterance meaning. Utterance meaning is then escalated to the next level of processing. 4.3.2 Retrieval Phase In the retrieval phase, the Response Retriever (refer to Figure 4) searches for the most appropriate schema node that matches the user’s utterance meaning. We have developed a heuristic searching approach, named locality-prioritized search, which is sensitive to the locality of the schema nodes. The heuristic rules are well based on the characteristics of schema-based discourse framework that we have proposed. Using such a framework, the basic assumption is that conversation around a topic shall be modeled as a schema. In other words, the schema shall capture the relevant and coherent information about the topic. Given this assumption, the idea of locality-prioritized search is to use locality as a clue about the information relevance, so as to perform effective search. The step steps are described as the following: 27 1. The search starts from the successive nodes of FOCUS3 (refer to Figure 4). A locality weight w1 is assigned to a successive node which is connected via a dialog link. If a node is connected to FOCUS via a sequential link, a relatively low weight w2 will be assigned. 2. Scan the nodes within the current schema, assigning a weight w3 that is lower than w2. 3. If necessary, the searching space is expanded to the whole knowledge base and a lowest weight w4 is assigned. For each schema node, the similarity between its ith condition and the user’s utterance meaning is measured using the sum of both the hits in their speech acts Si and the hits in the keyword lists Ki. We then employ a matching function f to compute the matching score by multiplying the maximum similarity value with the locality weight w of the node. The node with the highest matching score is accepted if the score exceeds a predefined threshold. f = Max (α Si + β Ki) w i:1Æ n The formulation of the matching function reveals our design rationale: first, speech acts have been integrated as part of the pattern so as to enhance the robustness of pattern matching; second, to avoid flattened retrieval, i.e. treating every node equally, we have 3 Recap that FOCUS is the agent’s attentional node during the conversation, i.e., “what we were talking about in this topic” 28 introduced locality weight, which gives priority to those potentially more related schema nodes. 4.4 Planning The implemented planner relies on a repository of plans that are manually scripted by content experts. There are two levels of plans: an activity plan constitutes a series of activities to be carried out by the agent; a discourse plan lays out several topics to be covered for an activity. In our context, Elva’s activity plan outlines the skeleton of a tour. It includes: (1) welcome remarks; (2) an “appetizer”, i.e., a short and sweet introduction before the tour; (3) the sequence of “must-sees”, i.e., artifacts to be visited along the itinerary; (4) a “dessert”, i.e., summarization. A sample plan is shown in Figure 9. An itinerary is planned at the level of domain entities (light grey ovals), whereas discourse planning is done at schema level (dark grey ovals). A path through the discourse space indicates the sequence of schemas, i.e. the topics, to be instantiated. artifact#1 artifact#5 artifact#6 artifact #8 artifact #3 Figure 11: Elva's tour plan Tour Plan Discourse Plan welcome remarks WELCOME INTRO_PARTICULAR_OF_MYSELF INTRO_THEME_OF_EXHIBITION 29 appetizer INTRO_BACKGROUND_OF_GALLERY INTRO_PARTICULAR_OF_ARTIST BRIEF_PLAN_OF_TOUR artifact#1 PROBE_METAPHOR_OF_ARTIFACT1 DESCRIBE_METAPHOR_OF_ARTIFACT1 DESCRIBE_MATERIAL_OF_ARTIFACT1 IDENTITY_ARTISTINTENTION_OF_ARTIFACT1 artifact#3 DESCRIBE_METAPHOR_OF_ARTIFACT3 DESCRIBE_TEXTURE_OF_ARTIFACT3 IDENTITY_ARTISTINTENTION_OF_ARTIFACT3 ... ... dessert SUMMARIZE_TOUR BYE Table 2: Layout of a sample tour plan The plan library serves as a repository of activity plans and discourse plans. Each plan specifies the goal it intends to achieve. At run time, the Planner picks up a plan to fulfill the agent goal. The goal can be modified by a user’s navigation or query. For instance, when a user takes navigational initiative and stops at an artifact ARTIFACT1, the Planner will be notified about the new goal NARRATE (ARTIFACT1). A discourse plan can fulfill this goal by decomposing it into sub-goals. Say a discourse plan, DESCRIBE_METAPHOR_OF_ARTIFACT1ÆDESCRIBE_TEXTURE_OF_ARTIFACT 1ÆDESCRIBE_MATERIAL_OF_ARTIFACT1, is selected. The goal NARRATE (ARTIFACT1) is then achieved through the accomplishment of the sub-goals DESCRIBE (ARTIFACT1, METAPHOR), DESCRIBE (ARTIFACT1, TEXTURE) and DESCRIBE (ARTIFACT1, MATERIAL). In the present development, the Planner provides Elva with a means to generate a random tour plan, and localize the discourse content for individual sculptures when visited. The choices of plans are limited to the available plans in the plan library. To some extent, the 30 Planner functions in a deterministic manner. While this seems adequate for the domain of the intended application, a museum where a guide plans a tour of “must-sees” in a predefined sequence, the future development of the planner should favor a more flexible way to generate plans that can cater to a dynamic world situation. For example, it is desirable if a guide could customize a tour plan in accordance with the user’s preference. The Scheduler parses and executes a plan. Before the execution of a plan, the agent’s attentional state, i.e., FOCUS, will be positioned at a starting node of the targeted schema. The scheduling task is performed at regular intervals. At each interval, the Scheduler has to decide “what to say next” by selecting one of the successive nodes of FOCUS (this selected node will become the next FOCUS). During the conversation, FOCUS keeps advancing until the Scheduler establishes a dynamic sequence of utterances to span all the schemas in the plan. 4.5 Dialog Coordination The Dialog Coordinator (recall Figure 4) is responsible for turn taking management. It keeps a conversation state diagram that contains possible states: NARRATE, DIALOG, MAKE_REFERENCE, SHIFT_TOPIC, and WITNESS. Dialog Coordinator examines the inputs from the reactive layer, the deliberative layer, and the agent’s perceptual data to determine the next appropriate state. For example, the agent transits the conversation state to DIALOG if dialog links are frequently traversed. It changes to WITNESS state when the user starts to manipulate an artifact or suddenly moves away. The treatment of turn taking is designed carefully for each conversation state. For example, in DIALOG state, 31 the agent temporarily suspends the execution of the scheduler to wait for a user turn. In WITNESS state, the waiting period can be even longer. Another functionality of Dialog Coordinator is to manage conversation topics using a topic stack. Considering when Elva makes a reference to a concept, the related topic (represented with schema ID) will be pushed on to the topic stack. When reference is completed, the topic will be cleared from the top of the stack, so that the agent can proceed with its original topic. 4.6 Summary We utilized the notion of “schema” to support structured and coherent verbal behaviors. Under a unified discourse framework, ECA is able to answer appropriately to the user enquiries, as well as to generate plans to fulfill its information delivery goals. In addition, we briefly described how turn taking was coordinated based on the state of conversation. 32 CHAPTER 5 AGENT’S NONVERBAL BEHAVIORS This chapter begins by highlighting the importance of interactional behaviors and deictic gestures in enhancing agent-user communication. It then presents our methods to generate and express the multimodal behaviors. 5.1 Types of Nonverbal Behaviors in Our Design In interpersonal conversations, people make complex representational gestures with their hands, gaze away from as well as towards one another, and tilt heads once a while. Conversation exploits almost all the affordances of the human body. However, a successful model of nonverbal behaviors in ECA does not necessarily resemble interpersonal conversations in all respects. As Cassell et al [4] argued, the ECA should “pick out those facets of human-human conversation that are feasible to implement, and without which the implementation of an ECA would make no sense.” In light of this, we implemented two vital types of nonverbal behaviors. They are interactional behaviors that help to regulate turn taking and feedback in conversation, and deictic behaviors that help to refer to an object in the virtual world. 5.1.1 Interactional Behaviors We have expressed our primary interest in designing interactional behaviors for turn taking and feedback functions. The interactional behaviors convey significant communicative information during conversation [6]. Gaze can be used to regulate turn 33 taking in mixed-initiative dialogue. For example, Elva gives a turn by looking at the user and raising the eyebrows. And she seeks a turn by raising her hands in the gesture space. Head nods and facial expression can provide unobtrusive feedback to the user’s utterances and actions without unnecessarily disrupting the user’s train of thought [6,13]. For example, Elva gives feedback by looking at the user and nodding her head. She requests feedback by looking at the user and raising her eyebrows. The following table lists Elva’s interactional behaviors. Communicative Function Interactional Behavior Welcome and React to user arrival bye React to user exit Look at user, nod head, wave Turn taking Give turn Look at user, raise eyebrows, (silence) Want turn Raise hands in gesture space Take turn Glance away, (start speaking) Give feedback Look at user, nod head Request feedback Look at user, raise eyebrows Invite for navigation Look at user, raise eyebrows, show the way Follow navigation Look at user, nod head, start walking Witness when a use manipulates an object Look at object, glance at user periodically Feedback Navigation Others Short glance at user Table 3: Elva’s interactional behaviors (portion only) 5.1.2 Deictic Behaviors Deictic behavior is another important aspect of our design. With deictic behaviors, the agent can direct the users’ attention to the entities that exist in the virtual space [5,16]. For example, Elva points at an artifact before it starts to narrate. Elva points at a direction, 34 when she invites the user to go the next artifact. Moreover, Elva is also able to differentiate left and right directions. For example, when narrating in front of two artifacts, Elva is able to point to a correct artifact based on the discourse content. 5.2 Nonverbal Behavior Generation Interactional behaviors usually occur in between two turns or during the user turn. In our design, we adopt a goal-based model to generate interactional behaviors. The Dialog Coordinator (refer to Figure 4) reports the changes of the conversational states and events of turn taking to the Enricher module. Based on these clues, the Enricher determines the communicative goal (such as take turn, give turn, seek turn, give feedback, request feedback) to be achieved. Then the communicative goal is mapped to a behavior script which corresponds to an interactional behavior. In our design, deictic behaviors occur just before the speech. For example, Elva points at artifact12 and started to narrate. Deictic behaviors are propositional and therefore require behavior planning. In order to integrate deictic behaviors, an utterance in the schema node is annotated with additional information. For example, “#point(this_artifact) you are now looking at ‘Love and Hate’” denotes that a deictic behavior of pointing at the sculpture “Love and Hate” will be acted out before the utterance is spoken. Based on the annotation, the Enricher first scans the behavior script that is generated. If there is no existing propositional gesture introduced by the communicative goal, Enricher will inserting the pointing gestures into the script, and acts it out just before the speech. 35 In addition to these, a series of reflex behaviors have been implemented in the agent. For example, Elva is able to track the user with head movement in response to the user’s subtle location change; Elva glances at artifact when the user selects an artifact by clicking on it. In implementation, the Q&I module (see Figure 3) processes stimuli from virtual world, such as the user’s change of position. The stimuli are quickly mapped to a reflex behavior. Then the reflex behaviors are fed to the Actuator Module for actuation. Notice that the nonverbal behaviors can be generated simultaneously from two components of agent architecture. Interactional behaviors and deictic behaviors are generated from the Enricher. Reflex behaviors are generated from the Q&I module. In section 5.3.3, we describe how these behaviors are regulated. 5.3 Nonverbal Behavior Expression A common way to generate agent animation is to define a collection of animation primitives, then use a sequencing engine to string some primitives together. However, this approach is not sufficiently flexible to generate dynamic and complicated behaviors. For example, it is not possible to implement a head track behavior, because the animations are pre-defined. In regard to this, we propose that nonverbal behaviors are expressed as multimodal signals among different channels. We present, in this section, our methods that instantiate and regulate more realistic multimodal behaviors in real time. 36 5.3.1 Building Blocks of Multimodality As shown in Figure 10, Elva’s nonverbal behavior is realized as animations that distribute among several conversational modalities: facial display, head movement, gesture, and locomotion. Figure 12: Conversational Modalities For each modality, a set of behavior segments is defined (Table 4). A behavior segment is basically a small building block of real-time animation. Some behavior segments can be realized using pre-recorded animation, e.g., smile, puzzled look, glance_away. However, others may require real-time information to create dynamic effect. For example, in order 37 to play back look_at_user, the turning angle needs to be calculated based on the user’s location, agent’s location, body orientation, and head orientation. As another example, point_at_object requires information about the object, which helps to determine which hand the agent will reach out. Modality Behavior Segment Facial Display speak ($affective_type, $utterance) glance_away raise_eyebrow smile puzzled normal Head Movement look_at_user look_at_object ($object) nod shake tilt Gesture point_at_object ($object) unpoint show_the_way batonic wave_hand raise_hand ($whichhand) clap_hands Locomotion go_to_user go_to_artifact turn_to_user turn_to_object ($object) half_turn_to_object ($object) Table 4: Elva's multimodal behavior library 5.3.2 Multimodality Instantiation A behavior script defines a complex nonverbal behavior that is linked to a specific communicative goal. The script prescribes the animations along four animation timelines, 38 i.e., FACE, TURN HEAD, GESTURE, and LOCOMOTION. On each timeline is a sequence of behavior segments that form a seamless animation. The problem with instantiating multimodality is that animations can run out of synchronization if there is no proper control over the different channels. In regard to this, a lock mechanism has been developed to ensure the synchronization of animations. In the behavior script, we specify a control statement “WAIT(lock#),” which locks its successive behavior segment. The statement “RELEASE(lock#)” serves to free the lock and resume that behavior segment. In this way, time dependency can be enforced on two behavior segments even when they are in different channels. FACE Neutral TURN HEAD Look at artifact GESTURE Point at artifact Raise eyebrow Look at user Speak Look at artifact Neutral Look at user Unpoint LOCOMOTION Half turn to artifact Turn to user Figure 13: Synchronized modalities in an animation timeline Below is a behavior script, which prescribes the animation in Figure 11. 39 SCRIPT START name=point_and_talk communitiveGoal=START_NARRATIVE type=MIXED preemptive=false FACE START WAIT(lock2) raise_eyebrow speak($affective_type, $utterance) RELEASE(lock3) FACE END TURNHEAD START WAIT(lock1) look_at_artifact($artifact) WAIT(1) look_at_user RELEASE(lock2) Look_at_artifact($artifact) WAIT(lock3) look_at_user TURNHEAD END POSE START WAIT(lock1) point_at_artifact($artifact) WAIT(lock3) Unpoint RELEASE(lock4) POSE END LOCOMOTION START half_turn_to_artifact($artifact) RELEASE(lock1) WAIT(lock4) turn_to_user LOCOMOTION END SCRIPT END 5.3.3 Multimodality Regulation Nonverbal behaviors are generated from the two layers of agent architecture. Behavior scripts are generated from the Enricher (refer to Figure 4). Reflex behaviors are generated 40 from the Q&I component. It is possible that behaviors from two layers attempt to use the same animation channels, e.g., head movement, at the same time. To resolve such conflict, we first define a Boolean variable, preemptive, in the behavior script. If preemptive value is true, it allows a reflexive behavior to overlap with the behavior script under certain conditions. The regulation is based on the rules described as follows: Given a behavior script and a conflicting reflexive behavior, Rule 1: If the behavior script is not preemptive, the reflexive behavior will be prohibited. Otherwise, use Rule 2. Rule 2: If the channel requested by the reflexive behaviors not vacant as specified the behavior script, then the reflexive behavior is prohibited during the execution of the behavior script. Otherwise, act out the reflexive behavior. 5.4 Summary To sum up, we have introduced a layered approach to generate appropriate nonverbal behaviors. We also presented mechanisms to tackle the issues on instantiation and regulation of multimodality. 41 CHAPTER 6 ILLUSTRATED AGENT-HUMAN INTERACTION This chapter illustrates how Elva communicates verbally and nonverbally with a system user, and guides the user in a gallery tour. 6.1 An Interactive Art Gallery A virtual art gallery (Figure 11) has been developed using the design framework of the CVISions, Collaborative Virtual Interactive Simulations, system [7]. The C-VISions browser allows the user to interface with the virtual worlds: navigating and acting upon objects. A chatterbox allows the user to carry out conversation. The virtual guide, named Elva, appears in front of the user as an animated female character. Elva talks with system Figure 14: Elva invites the user to start the tour 42 user through a speech synthesizer. At present, the gallery houses a virtual exhibition “Configuring the Body: Form and Tenor,” which utilizes the existing content in the Ng Eng Teng Gallery of NUS Museums. Ng Eng Teng was Singapore’s foremost sculptor. He produced a body of consistent and distinct artwork over a period of forty-five years. Altogether thirteen pieces of Ng Eng Teng’s artworks were carefully selected for this virtual exhibition. In the tour, Elva begins by highlighting the "body configuration" concept. Then Elva presents some of the artist’s extraordinary works to the user. If the user decides to browse the artwork at his/her own pace, Elva will track the user. In the next sections, we will use a few episodes to illustrate the verbal and nonverbal communication between Elva and a system user, during a virtual tour. 6.2 Verbal Communication The following sections illustrate Elva’s two modes of conversations, i.e., narrative (section 6.2.1) and dialogue (section 6.2.2). 43 6.2.1 Elva presents information The following episode illustrates a narrative situation. Elva provides an overview of the artist, Ng Eng Teng. Then she gave a short briefing on the tour. Elva started to narrate at Ng Eng Teng’s head sketch. 1. Right here you are looking at the head portrait of the artist Ng Eng Teng. 2. Ng Eng Teng is one of the most sculptors and ceramic artist in Singapore. important 3. Over a period of forty-five years, he produced a body of work which is consistent and distinct. 4. His work also marks milestones in Singapore's art history, especially in the development of modern sculpture and ceramics. 5. In this tour, I will highlight configuration" concept using few artworks. the "body 6. After that, if you want to browse the artwork at your own pace, please feel free to do so. 7. Would you like to go to the next piece? 8. yes 9. Please follow me 44 6.2.2 Elva probes the user in a dialogue The following episode illustrates a dialog situation. Elva invited a user a guess the meaning of sculpture called “untitled.” 1. This sculpture is "untitled". Can you make a guess what it looks like? There was a short silence. 2. Any guess, Raymond? 3. is it a bone? 4. Bone. That is interesting. More? 5. i am not sure. 6. Ok, let me give you a hint! 7. Can you see it as a human body? 8. not really. 9. Well, Raymond. Try to rotate the sculpture to have a better view. User clicked on sculpture. The sculpture started to rotate. Elva stood by and witnessed. 45 Silence lasted for in 8 seconds. 10. To me, it is a female body with the head and limbs omitted, namely a "torso". Elva continued to elaborate on the sculpture. 6.2.3 Elva answers questions The following episode illustrates another dialog situation where user took initiative. The user asked a few questions. 1. What is the material used for this sculpture? 2. The material used here is ciment-fondu. 3. Can you tell me what is ciment-fondu? 4. Ciment fondu is originally an industrial material. It is also used in sculpting as well as refractoring. 5. Tell me more about it. 6. OK. I can share with you some basics. Cimentfondu is a rapid-hardening material. Normally it takes one day to become solid. Therefore it is ideal for rapid strength development like sculpting. 7. Where can I buy ciment-fondu? 8. Sorry but I am not sure where the place is. 46 6.3 Nonverbal Communication 6.3.1 Deictic Gestures The following episode depicts how Elva uses deictic gestures to direct a user’s attention. Elva stands at the sculpture, “Oh my bump”. She is going to explain to figures on the wall. 1. Elva says: Now Dr. Tan, please look at the wall beside this sculpture. 2. These figures are extracted from a sketch book, where Eng Teng experimented on the "torso-to-face" concept. Elva walks towards the figures stands just in front of the wall. on the wall, and 3. Elva says: As you can see on the right most, a human body is displayed. 47 As seen in the Figure above, Elva raised her hand and point to the right most of the wall-figures. 4. Elva says: The body gradually transforms to a face, as you can see on the left most. As seen in the Figure above, Elva gradually moved her hand towards the left most of the wall-figures. 5. Elva says: Notice that from right to left, the interpretation of the human body becomes more and more abstract. 6. Elva says: Interestingly, the idea face" comes from the Eng Teng's experience. of "torso-tolife drawing 48 As seen in the Figure above, after Elva narrated about the wall-figures, she turned to the user. 6.3.2 Facial Display The following figure displays Elva’s basic facial expressions. Neutral Puzzled Raise Eyebrow Smile Figure 15: Elva's basic facial expressions 49 6.3.3 Locomotion The following figure depicts Elva’s navigation around the gallery. Figure 16: Elva's locomotion 50 CHAPTER 7 EVALUATION This chapter describes the evaluation methodology and observed results of the user study performed on the virtual tour with Elva. 7.1 Methodologies The user study aims to evaluate the present development of the agent Elva. We focused on the measurements of agent believability and user satisfaction. An agent is considered believable if it allows the users to suspend their disbelief and become cognitively, and emotionally engaged in the interaction with the character [17]. User satisfaction refers to the engagement and fun experienced in the interaction. Our evaluation method consists of both qualitative analysis and quantitative analysis. 7.1.1 Qualitative Analysis For qualitative analysis, each user or subject will be interviewed about their experience after the virtual tour. Agent believability is measured through the subjects’ post-usage descriptions about their interaction with Elva [11,14]. z If subjects used an emotionally rich vocabulary and described the Elva’s personality without hesitation, this would be an indication of believability. z If subjects hesitated when describing Elva’s personality, or found her ‘strange’ or ‘incomprehensible’, this would indicate low believability. 51 z If subjects notice nothing peculiar about Elva’s verbal and non-verbal behaviors, this would indicate that her behaviors are consistent with their expectations, and therefore it would be a sign of believability. To measure user satisfaction, we observed the subjects’ emotional responses during their interaction with Elva. Did they respond positively, neutrally, or negatively? In addition to observation, direct questions about their engagement and enjoyment were asked during the post-usage interview. 7.1.2 Quantitative Analysis In the quantitative part of the user study, we collected user feedback using an evaluation form (See Appendix). The form consists of the three parts. These are user experience, feedback on Elva’s verbal behaviors, and feedback on Elva’s nonverbal behaviors. The subjects rated the agreement or disagreement with a certain statement on a 5-point scale. 7.2 Subjects, Task and Procedure A total of ten university students (six males and four females) participated in the user study. The subjects comprised six computer science students, two information system students, one chemistry student, and one engineering student. Among them, five subjects had prior experiences in chatting with text-based chatterbots like ELIZA and ALICE. But none of them had experiences using any embodied conversational agent. 52 The user study was conducted on an individual basis. Each subject went through five sections as shown in Figure 17: briefing, training, virtual tour, post-usage interview, and evaluation form filling. 1. Briefing 2. Training 3. Virtual tour 4. Post-usage interview 5. Evaluation form filling Figure 17: Procedure of user study on Elva 1. Briefing: The experimenter briefed the subject on the procedure of the user study. Then the subject was asked to read through an instruction sheet carefully. The sheet offered an introduction about Elva and the exhibition. It also contained several useful tips for the subject to get started. These were: a. Show some politeness to Elva. b. Be patient at the start of the tour. You may be anxious, but you also do not want to miss out the important concepts. c. Do not hesitate to ask questions. d. Do not always follow. You can lead the way when you become confident. 53 e. Try to use simple and standard English. 2. Training: The subject was trained on how to use the C-VISions browser. Through training, the subject familiarized himself/herself with VR user interface. 3. Virtual tour: The subject started the virtual tour and traveled through the gallery with Elva. The whole process was videotaped. 4. Post-usage interview: After the tour, the subject was interviewed about his/her experience. The interview was un-cued and had an open structure where each subject was asked to freely describe his/her interaction with Elva. At the end of the interview, more direct questions about believability and user satisfaction were asked. 5. Evaluation form filling: The subject was asked to complete the questions in an evaluation form, and to give suggestion on how to improve the present virtual guide. Note that, the knowledge base being used for the evaluation defines a total of 65 schema or 225 schema nodes. Among these schema nodes, 120 of them are applicable for matching with users’ utterances. 54 7.3 Evaluation Results The observed results are derived from video records, chat logs, post-usage interview, as well as evaluation forms. 7.3.1 User Experience Table 5 depicts the summarized results related to user experience. In the right-most column, the average scores are displayed. 1 for strongly disagree; 2 for disagree; 3 for neutral; 4 for agree; 5 for strongly agree 1 2 3 4 5 avg z I enjoyed the virtual gallery experience. 0 0 0 5 5 4.5 z Elva triggered my interest when exploring the artworks in the exhibition. 0 0 4 2 4 4 z She helped me understand concepts in the exhibition. 0 0 3 4 3 4 z I found the interaction with her was engaging. 0 1 1 5 3 4 z I found the interaction with her was enjoyable. 0 0 0 4 6 4.6 Table 5: Evaluation results related to user satisfaction in interaction with Elva All subjects enjoyed the virtual tour with the presence of the virtual guide. During the interview, some subjects used positive terms such as “enjoyable,” “amazing,” “interesting” when describing the experience. From the video recording, the facial expression of subjects also clearly showed signs of enjoyment. At the beginning of the tour, at least half of the subjects were quiet. As Elva walked up and said hello to the subjects, unexpectedly, only three subjects greeted back immediately. One subject, instead of typing “hello”, typed “it is cool!” When asked about the reason in 55 the interview, the subject stated that it was because his attention was grasped by Elva’s appearance and the environment. For the quiet subjects, the common reason was that they had no idea how to start interaction with a robot. Elva’s probing questions were effective in encouraging participation from initially reticent subjects. When Elva invited the visitor to guess the meaning of an untitled sculpture (as illustrated in section 6.2.2), eight subjects, out of ten, followed up and made some guesses. We observed that, in general, subjects talked more often after a few successful interactions with Elva. As seen in table 5, a majority agreed that Elva triggered their interest when exploring the artworks in the exhibition. The overall responses to Elva’s guidance were positive. Subjects generally agreed that Elva helped them to understand the concept of the exhibition. According to them, Elva’s narratives about the artifacts were relatively easy to follow, and she was able to answer some of the most basic questions. A majority of subjects found the interaction with Elva engaging. From the video record, we observed that, most subjects exhibited positive emotions, and they were focusing their attention on the computer screen throughout the virtual tour. One student stated that the interaction was not engaging. And he attributed it to three factors: the look of virtual environment, the richness of character animation, and choices of actions that users can take. All subjects found the interaction with Elva very enjoyable. The score is 4.6 out of 5. 56 7.3.2 Agent Believability In the qualitative analysis, we derived some useful finding from the subjects’ post-usage description. Table 6 lists the subjects’ responses to Elva’s personality. We observed that, three subjects described the Elva’s personality without hesitation, using emotional terms such as “agreeable,” “lovely,” and “pleasant,” “black humorous”, and “polite.” Half of the subjects described Elva’s personality without hesitation, but they used the neutral terms such as “professional,” “polite,” “neutral,” and one of the subjects remarked that a neutral personality suited a guide. Two subjects believed that Elva did not possess a personality and attributed it to Elva’s “limited response to questions” and “lack of sufficient emotions.” Overall, Elva’s behaviors were consistent with the subjects’ expectations. The qualitative analysis indicated an intermediate-to-high level of believability. Subject Description of Elva’s personality Jeremiah She is pleasant and helpful. Sometime blur. Gary Patient and pleasant. Sometime she is “sassy”, (because she) asked me to go when I was waiting for her answer. LHP She is very professional. (She is) not very personal, because she does not smile. LY Agreeable, lovely, and appealing. Melvyn Not yet. (It is due to her) limited response to questions, and (she is) lack of sufficient emotions. Pasar She is neutral. A guide should be neutral, so she is good. TCT She is kind and polite. (She is) patient as well. TSF Straight-forward and polite. Cloud She is elegant, helpful, (and) back humorous. Snow She is professional and polite. Table 6: Users' responses to Elva's personality 57 The quantitative analysis zoomed into the aspects of Elva’s verbal and nonverbal behaviors, in order to assess her believability. We will elaborate the findings in the next two sections. 7.3.2.1 Evaluation results on Elva’s Verbal Behaviors Table 7 presents the results from the evaluation related to Elva’s verbal behaviors. 1 for strongly disagree; 2 for disagree; 3 for neutral; 4 for agree; 5 for strongly agree 1 2 3 4 5 avg z Her speech was well organized. 0 0 1 8 1 4 z Her speech was coherent. 0 0 1 8 1 4 z Her speech was well based on the location and the subject of the conversation. 0 0 1 4 5 4.4 z The amount of information about the exhibits was appropriate, namely not lengthy, nor over simple. 0 1 2 6 1 3.7 z The pace at which she presented information was appropriate. 0 2 5 2 1 3.2 z She reacted to my request4 properly. 0 2 3 4 1 3.4 z She answered my enquiries5 properly. 0 2 3 5 0 3.3 Table 7: Evaluation results related to Elva's Verbal Behaviors Elva’s narrative skills received relatively high scores. The first three entries in Table 7 indicated that the underlying discourse framework had been well utilized to generate structured, coherent, and situated speech. 4 5 To request is to ask some body to do something. For example, “Can you please come here?” To enquiry is to ask some body for specify information. For example, “What is this?” 58 The ratings on Elva’s question-answering capability produced mixed results. As seen in Table 7, for enquiries, 50% of the subjects agreed that Elva answered properly, 20% of the subjects disagreed, and 30% of the subjects neither agreed nor disagreed. For requests, the distribution of the scores was similar. Further investigation into the chat logs showed that among a total of 95 enquiries and requests, excluding small talks, Elva was able to detect correct speech acts for 68 questions or 69% of the total, indicating an acceptable performance in utterance analysis. The study confirmed that incorporating speech acts into the traditional keyword-based pattern-matching approach yields improved performance. The enhanced approach helps to capture the essential form of an utterance in addition to its semantics. On one hand, the speech act classification demonstrated its power in recognizing a same query asked in different forms. In practice, an effective, but not necessarily a complete, set of patterns was often sufficient to classify an utterance correctly. On the other hand, keywords capture the semantics of the utterance and therefore help to handle queries asked in different context. Keywords extraction proved effective for queries asked using complete sentences With regard to Elva’s capability in response retrieval, she was only able to answers correctly for 35 questions or 37% of the total. This indicated a relatively low performance. The most well answered questions turned out to be the most basic questions, such as “what is this sculpture?” “What is the material used,” “Who created this sculpture,” and “what does this sculpture mean.” 59 Overall, if we use a 5-point scale to quantify Elva’s question-answering capability, the capability should be around 3, an intermediate level. Another findings from the user study confirmed that with speech act classification, Elva was able to show a relatively good understanding to users’ questions without necessarily knowing the answer. Consider the following dialogue. so where did he learn this stuff? Sorry but I am not sure where the place is. It is ok. ... ... why is it brown in color? In fact I am also interested in finding out why, Melvyn. In above cases, Elva uttered default responses for each speech act as she could not find an appropriate answer from the knowledge base. Most subjects found the default responses tolerable, because they thought “she seemed to understand (the question),” and “it is sensible to answer it that way.” As compared, Elva tended to give seemingly peculiar answers when no suitable speech act was found. Consider the following dialog. Over a period of forty-five years, he produced a body of work which is consistent and distinct. what kind of works? Why do you ask? i am interested I see, please continue. Answer my question! In this case, the subject felt offended, and complained that, “she is blur,” and “she doesn’t know what I’m talking about.” 60 Through the user study, we collected a fruitful set of “real data” i.e., around 200 patterns of user inputs. They can be utilized to refine Elva’s speech acts and to expand her knowledge base accordingly. 7.3.2.2 Evaluation results on Elva’s Non-verbal Behaviors Table 8 depicts the evaluation results from the evaluation related to the Elva’s verbal behaviors. 1 for strongly disagree; 2 for disagree; 3 for neutral; 4 for agree; 5 for strongly agree 1 2 3 4 5 avg z Her eye gaze was natural. 0 1 3 3 3 3.8 z Her head movement was natural. 0 1 1 5 3 4 z Her facial expression was natural. 0 2 2 2 4 3.8 z Her gestures coupled well with her speech. 0 1 1 5 3 4 z Her pointing gestures effectively helped me to address the correct artifact. 0 0 2 4 3 4.1 z During conversation, she took turns (started to speak) and gave turns (went silent and listened to user) at right time. 0 0 1 6 2 4.1 z She took moves at right time. 1 0 2 7 0 3.5 z I was able to interpret her nonverbal behaviors easily. 0 1 0 5 4 4.1 z Overall her nonverbal behaviors were meaningful and effective. 0 1 0 7 2 4 Table 8: Evaluation results related to nonverbal behaviors Elva’s gaze and head movement can be comprehended by most subjects. They remarked that “she is quite attentive,” and “when she nods, it looks as if she is listening.” One 61 subject, who disagreed, commented that Elva’s eye gaze was not obvious and her head movement “somehow looked mechanical”. Another subject remarked that “she often stares at me” and suggested that Elva should glance away more frequently, especially while she was speaking. The rating on Elva’s facial expression produced mixed results. A common opinion was that Elva did not show sufficient facial expressions. About Elva’s locomotion, a major agreed that she took move at right time. But one subject strongly disagreed, and complained that it often took a long time for Elva to arrive at him. The result reveals that a majority of the subjects agreed that Elva’s nonverbal behaviors were easy to interpret. Overall, the user study indicated that Elva’s nonverbal behaviors were meaningful and effective. 7.4 Discussion First, there is a pressing need to improve Elva’s capability in answering questions. Improvement can be made on Elva’s natural language understanding and domain knowledge. In regard to natural language understanding, the study revealed a more clearcut and comprehensive set of speech acts should be defined to increase the accuracy of classification. Moreover, considerably amount of effort is required to expand the knowledge base. In addition to the most basic question, Elva should be able to answer some advanced questions in this domain, so that users will have an enjoyable but also enriching experience. Second, emotive behaviors should be integrated into Elva to achieve a higher level of believability. Elva is embodied in a human form. Therefore, users expect to see not only human intelligence, but also human-like emotions. It is not to say that a virtual guide 62 should laugh or cry all the time. Elva should exhibit appropriate emotive behaviors that are consistent with users’ expectation, so that users’ can suspend their disbelief and become more engaged in the interaction. 7.5 Summary This chapter describes a user study to evaluate the present development of Elva. Most users interacted successfully with Elva and enjoyed the tour. Regarding agent believability, the qualitative analysis indicated an intermediate-to-high level of agent believability. From the quantitative analysis, Elva’s narrative skills received relatively high scores. However the ratings on Elva’s question-answering capability produced mixed results. A majority agreed that Elva’s behaviors, both verbal and nonverbal, were comprehensive and appropriate. Through the user study, we collected a fruitful set of “real data.” They are valuable for the improvements of the agent’s question-answering capability. The user study also revealed that emotive behaviors should be integrated into Elva to achieve a higher level of believability. 63 CHAPTER 8 CONCLUSION 8.1 Summary This research proposed an integrative approach for building a novel type of conversational interface, an embodied conversational agent (ECA). ECA is still an immature field which warrants further investigation in various areas. In general, our research focused on the attributes of the agent’s autonomy and believability. The agent’s three-layered architectural design ensures appropriate coupling between the agent’s perception and action, and to hide the internal mechanism from users. We utilized the notion of “schema” to support structured and coherent verbal behaviors. For the agent to effectively take part in conversation with humans, we modeled domain knowledge as individual schemas. The agent’s narrative and question-answering are well supported using this unified discourse framework. The approach of agent’s reasoning is derived from case-based reasoning. The use of speech acts allows the agent to capture the user’s intention embedded in the speech content. The response given by the agent is more accurate when the intention of the user is captured, as compared to the situation of simply expressing queries in keywords [15]. The agent’s planning relies on manually authored plans that form the plan library. We also presented the approaches to generate and coordinate the agent’s nonverbal behaviors. In particular, we implemented two vital types of nonverbal behaviors. They are interactional behaviors that help to regulate turn taking 64 and feedback in conversation, and deictic behaviors that help to refer to an object in the virtual world. Our evaluation method assessed the user satisfaction and agent believability. The study consisted of both qualitative analysis and quantitative analysis. The result reveals that a major of the subjects enjoyed the tour and interacted successfully with Elva. However, both types of analyses have produced mixed-result in agent believability. There are two major implications for the further development on Elva. First, Elva’s language understanding skills and knowledge base should be improved and enlarged. Second, emotive behaviors should be integrated into Elva to achieve a higher level of believability; 8.2 Contributions of the Thesis For research community on ECA, Elva provides a case study in a specific application, a virtual guide. In this domain, we have studied a virtual guide’s behavior dimensions, such as narrating, probing, question-answering, and mixed-initiative navigation. We have also shown as an example how a museum’s contents are digitized, and how knowledge in the art domain can be modeled. In the future, Elva can serve as a test bed to perform various empirical studies in this domain. As the first iteration of ECA research in the Learning Science and Learning Environment Lab (LELS), this research work benefits the ongoing and future agent projects in various ways. First, we have provided a generic computational model that can be further studied or extended, in order to build ECAs in other domains. Second, several key technologies introduced in the research are reusable, e.g., three-layered agent architecture, schema- 65 based discourse framework, speech act classification, and multimodal behavior generation and coordination. Third, it is the lessons we learnt and the experiences we acquired that is valued. To date, an honours year project, Multi-agent Virtual Movie Gallery, has extended the current ECA model to support conversations among three agents and a single user. Another project has been proposed to extend Elva’s ability to guide a group of visitors. There are more projects proposed to perform in-depth research on different dimensions. 8.3 Future Work Multiparty interaction support is a desirable feature for the further development of Elva. It would be exciting if Elva is able to guide a group of visitors. A multiparty scenario not only extends the questions from single-party scenario, such as mixed-initiative, modeling of conversation, coordination of multimodal behaviors, but also presents entirely new challenges posed by the large number of users. For example, Elva may need to track several topics simultaneously and to relate them to individual users. The next phase of ECA research can also address the area of affective computing. Currently, Elva exhibits a static personality style that can be classified as pleasant, polite, and patient. One way is to build an emotion engine that is capable of generating emotions which dynamically evolve with the situation of the conversation. The agent’s expression of emotional state will be moderated with a set of parameters, such as social variable, the agent’s linguistic style and personality. In that way, the agent can be elevated to the level of a virtual world actor. 66 Another open research direction is towards modeling a user’s cognitive map. A cognitive map is a personal mind map about what a person is thinking and expecting to see. Currently, Elva is able to capture a user’s mouse actions and navigation in the virtual world. And Elva is also able to analyze a user’s intention via speech act classification. Given these observations, it is possible to construct a cognitive map of the user. Researchers examining cognitive mapping in the virtual environment have proposed some measurements of a person’s cognitive map [23]. These measurements should be carefully examined about whether it is feasible to apply them to our application domain. By understanding the cognitive mapping of a user, Elva can generate proper plans or decisions to accomplish her tasks, and provide customized services to users. 67 REFERENCES [1] Andersen, P.B. and Callesen, J. Agents as actors. In Qvortrup, L. (ed), Virtual Interaction: Interaction in Virtual Inhabited 3D Worlds. Springer, London, 2001, pp. 182-208. [2] Billinghurst, M., and Weghorst, S. The use of sketch maps to measure cognitive maps of virtual environments, In the Proceedings of Virtual Reality Annual International Symposium, IEEE Computer Society, Triangle Park, NC, 1995, pp.40-47. [3] Burkhardt, A. Speech Act, Meaning and Intentions, Critical Approach to the Philosophy of John R. Searle. Walter de Gruyter, New York, 1990. [4] Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. Embodied Conversational Agents. MIT Press, Cambridge, 2000. [5] Cassell, J. Nudge nudge wink wink: elements of face-to-face conversation for embodied conversational agents, In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied Conversational Agents. MIT Press, Cambridge, 2000, pp. 1-27. [6] Cassell, J., Bickmore, T., Camphell, L., Vilhjalmsson, H., and Yan, H. The human conversation as a system framework: designing embodied conversational agents. 68 In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied Conversational Agents. MIT Press, Cambridge, 2000, pp. 29-63. [7] Chee, Y. S. & Hooi, C. M. C-VISions: socialized learning through collaborative, virtual, interactive simulations. In Proceedings of CSCL 2002: Conference on Computer Support for Collaborative Learning, Boulder, CO, USA, 2002, pp. 687696. [8] Core, M., and Allen, J. Coding dialogs with the DAMSL annotation scheme, In Proceeding of AAAI Fall Symposium on Communicative Action in Humans and Machines, Boston, MA, November 1997, pp 28-35. [9] Davis, D.N. Multiple level representations of emotion in computational agents. In AISB’01 Symposium on Emotion, Cognition and Affective Computing. University of York. [10] Foner, L.N. Are we having fun yet? using social agents in social domains, In K. Dautenhahn (ed), Human Cognition and Social Agent Technology. John Benjamins, Amsterdam, 2000, pp. 323-348. [11] Höök, K., Persson, P. & Sjölinder, M. Evaluating user’s experience of a characterenhanced information space, AI Communications, 13, 2000, pp. 195-21. [12] Huber, M. J. JAM: a BDI-theoretic mobile agent architecture. In Proceedings of the third annual conference on Autonomous Agents (Seattle, 1999), ACM Press, New York, 1999, pp. 236-243 69 [13] Johnson, W.L. and Rickel, J.W. Animated pedagogical agents: face-to-face interaction in interactive learning environments. International Journal of AI in Education, Nov 2000, pp 47-78. [14] Laaksolahti, J., Persson, P., and Palo, Carolina. Evaluate believability in an interactive narrative. In Proceedings of the second International Conference on Intelligent Agent Technology, Maebashi City, Japan, 2001, pp 30-35. [15] Lee, S. I. and Cho, S. B. (2001). An intelligent agent with structure pattern matching for a virtual representative. In Zhong, N., Liu, J., Ohsuga, S. and Bradshaw, J. (ed), Intelligent Agent Technology: Research and Development. World Scientific. 2001. pp. 305-309. [16] Lester, J.C., Towns, S, Callaway, C.B., Voerman, J.L., and FitzGerald, P. J. (2000). Deictic and emotive communication in animated pedagogical agents. In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied Conversational Agents. MIT Press, Cambridge, MA, 2000, pp. 123-154. [17] Loyall, L. Believable Agents: Building interactive personalities, Ph.D. Thesis. Technical Report CMU-CS-97-123, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. May 1997. [18] Madsen, C.B. and Granum, E. (2001). Aspects of interactive autonomy and perception. In Qvortrup, L. (ed), Virtual Interaction: Interaction in Virtual Inhabited 3D Worlds. Springer, London, 2001, pp. 182-208. 70 [19] Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F. and Poggi, I. (2002). Embodied contextual agent in information delivering application. In Proceeding of AAMAS 02, ACM Press, New York, 2002, pp. 758-765. [20] Poggi, I. and Pelachaud, C. (2002). Performative facial expressions in animated faces. In Cassell, J., Prevost, S., Sullivan, J., and Churchill, E. (ed), Embodied Conversational Agents. MIT Press, Cambridge, MA, 2000, pp. 155-188. [21] Rooney, C.F.B., O'Donoghue, R.P.S., Duffy, B.R., O'Hare, G.M.P. and Collier, R.W. The social robot architecture: towards sociality in a real world domain. In Proceeding of Towards Intelligent Mobile Robots 99, Bristol, UK, 1999. [22] Porter, M. The Porter Stemming Algorithm URL: http://www.tartarus.org/~martin /PorterStemmer, accessed on 20/03/2003. [23] Sloman, A. Architectural requirements for human-like agents both natural and artificial, In K. Dautenhahn (ed), Human Cognition and Social Agent Technology. John Benjamins, Amsterdam, 2000, pp. 163-195. [24] The WordNet official homepage. URL: http://www.cogsci.princeton.edu/~wn/, accessed on 12/02/2003. 71 User study on Elva Appendix I. Instruction Sheet This user study aims to evaluate the social robot Elva that has been developed in the Learning Environment and Learning Science Lab, NUS. Elva mimics a tour guide in a virtual art gallery: Ng Eng Teng Art Gallery. When a user logs in, she will guide the user through the gallery, present information on the gallery exhibits, and attend to user’s enquires. In this study, you are about to visit the galley and interact with Elva. The study consists of five sessions (figure below). In session 1, read through the rest of the sheet carefully to understand the basic facts about the artist and the gallery, as well as some useful tips. In session 2, you will be trained to walk in the virtual world and to play with the exhibits. You will then spend 20 – 30 minutes on the gallery tour (session 3). After the tour, you will do a story-telling about your experience (session 4) and fill an evaluation form (session 5). 1. Learn about facts and tips 2. Be trained on user interface 3. Enjoy the gallery tour 4. Be interviewed on your experience 5. Complete an evaluation form A-1 > Basic Facts The Ng Eng Teng’s gallery is situated at NUS Museums in the central heart of the National University of Singapore. The gallery houses the most comprehensive collection of works by Singapore's foremost sculptor, Ng Eng Teng. It was established from generous donations by the artist himself of 760 works in July 1997. In general, we encourage users to create their very own experiences freely. Notwithstanding, if you want to achieve a smoother interaction, these tips may be useful. The exhibition, titled “Configuring the body: Form & Tensor” features works selected from the artist's 3rd donation. Ng’s main source of inspiration has always been the human figure. As you will see, even his abstract works are experiments based in elements of the human form. Tips [...]... provides planning and decision-making mechanisms The ECA in our environment is equipped with information-delivery goals and plans that accomplish the goals The Planner selects an appropriate plan from the plan library based on the agent goal During the interaction, Planner may adopt new plans according to the change of goal The Scheduler instantiates and executes the selected plan With its help, the agent. .. matching As a key area of improvement, the user’s query patterns are represented using the extracted meaning of the utterance, instead of simple keywords 4.3.1 Analysis Phase The analysis phase attempts to “understand” the user utterance and transform it to an internally recognizable meaning In our system, the meaning of an utterance is represented by the speaker’s intention (speech act) and semantics... satisfaction and agent believability The human computer interaction evaluation methods are exploited to cover novel embodied conversational settings 1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery Elva is an embodied conversational agent developed in the Learning Science and Learning Environment Lab, National University of Singapore, for this project Her job is 4 to guide a user to through an interactive. .. theory [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and DAMSL annotation scheme [8]; or it can be defined so as to be relevant to a domain of conversation When the intention of the speaker is captured with speech act, the response given by the agent is likely... as deictic gestures, turn taking, attention, and nonverbal feedback, can be instantiated and expressed We also need to devise an approach to coordinate the modalities that are generated We adopt this integrative approach in the design and development of an agent called Elva Elva is an embodied tour guide that inhabits an interactive art gallery to offer guidance to the human users A user study has been... semantics (keywords) The transformation steps are depicted in Figure 8 21 Raw Utterance S Correct ill-formed input Resolve Reference Tokenize into linked word list Transform into stem list using PORTER Classify into Speech Acts Speech Act Remove stop words Replace Synonym Keyword List Utterance Meaning (intention + semantics + S) Figure 8: Transform raw user input to utterance meaning (query pattern) •... Language, an XML-compliant language which contains tags to define pattern and response pairs [10] 9 ALICE’s conversational capabilities rely on Case-Based Reasoning and pattern-matching Another chatterbot, JULIA, employs an Activation Network to model possible conversation on a specific topic [10] When an input a pattern is matched, the node containing the pattern has its activation level raised And... field of embodied conversational agents remains open and requires systematic research 1.2 Research Objectives This research project aims to develop an integrative approach for building an ECA that is able to engage conversationally with human users, and capable of behaving according to social norms in terms of facial display and gestures Our research into the embodied conversational agents focuses on... reactive layer handles the user’s utterances as they arise The Utterance Analyzer performs natural language understanding tasks to determine the meaning of the utterance so that it can be recognized by the agent Based on the meaning, the Response Retriever queries the Knowledge Base for the matched schema node that contains an appropriate response The details about each component are described in chapter... Figure 6: Narrative modeling using a schema 4.2.2 Dialog Modeling In the dialog mode, the agent and the user are engaged in the discussion of a topic Turns are frequently taken A sample schema for dialog modeling is shown in Figure 5 Clearly, the organization of schema nodes in a dialog is more dynamic and complicated than in narrative Some schema nodes branch out into a few dialog links, which describe ... [3], an utterance can be classified into a speech act, such as stating, questioning, commanding, and promising The categories of speech act can be domain-independent, as described in TRAINS and... “understand” the user utterance and transform it to an internally recognizable meaning In our system, the meaning of an utterance is represented by the speaker’s intention (speech act) and semantics... novel embodied conversational settings 1.3 Elva, an Embodied Tour Guide in an Interactive Art Gallery Elva is an embodied conversational agent developed in the Learning Science and Learning Environment

Định dạng
Số trang	88
Dung lượng	1,35 MB