Free ebooks ==> www.Ebook777.com Signals and Communication Technology Alexander Rudnicky Antoine Raux Ian Lane Teruhisa Misu Editors Situated Dialog in SpeechBased HumanComputer Interaction www.Ebook777.com Free ebooks ==> www.Ebook777.com Signals and Communication Technology www.Ebook777.com More information about this series at http://www.springer.com/series/4748 Alexander Rudnicky Antoine Raux Ian Lane Teruhisa Misu • • Editors Situated Dialog in Speech-Based Human-Computer Interaction 123 Free ebooks ==> www.Ebook777.com Editors Alexander Rudnicky School of Computer Science Carnegie Mellon University Pittsburgh, PA USA Ian Lane Carnegie Mellon University Silicon Valley Moffett Field, CA USA Antoine Raux Cupertino, CA USA Teruhisa Misu Mountain View, CA USA ISSN 1860-4862 ISSN 1860-4870 (electronic) Signals and Communication Technology ISBN 978-3-319-21833-5 ISBN 978-3-319-21834-2 (eBook) DOI 10.1007/978-3-319-21834-2 Library of Congress Control Number: 2015949507 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) www.Ebook777.com Contents Part I Dialog Management and Spoken Language Processing Evaluation of Statistical POMDP-Based Dialogue Systems in Noisy Environments Steve Young, Catherine Breslin, Milica Gašić, Matthew Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis and Eli Tzirkel Hancock Syntactic Filtering and Content-Based Retrieval of Twitter Sentences for the Generation of System Utterances in Dialogue Systems Ryuichiro Higashinaka, Nozomi Kobayashi, Toru Hirano, Chiaki Miyazaki, Toyomi Meguro, Toshiro Makino and Yoshihiro Matsuo 15 Knowledge-Guided Interpretation and Generation of Task-Oriented Dialogue Alfredo Gabaldon, Pat Langley, Ben Meadows and Ted Selker 27 Justification and Transparency Explanations in Dialogue Systems to Maintain Human-Computer Trust Florian Nothdurft and Wolfgang Minker 41 Dialogue Management for User-Centered Adaptive Dialogue Stefan Ultes, Hüseyin Dikme and Wolfgang Minker 51 Chat-Like Conversational System Based on Selection of Reply Generating Module with Reinforcement Learning Tomohide Shibata, Yusuke Egashira and Sadao Kurohashi 63 Investigating Critical Speech Recognition Errors in Spoken Short Messages Aasish Pappu, Teruhisa Misu and Rakesh Gupta 71 v vi Part II Contents Human Interaction with Dialog Systems The HRI-CMU Corpus of Situated In-Car Interactions David Cohen, Akshay Chandrashekaran, Ian Lane and Antoine Raux Detecting ‘Request Alternatives’ User Dialog Acts from Dialog Context Yi Ma and Eric Fosler-Lussier 85 97 Emotion and Its Triggers in Human Spoken Dialogue: Recognition and Analysis 103 Nurul Lubis, Sakriani Sakti, Graham Neubig, Tomoki Toda, Ayu Purwarianti and Satoshi Nakamura Evaluation of In-Car SDS Notification Concepts for Incoming Proactive Events 111 Hansjörg Hofmann, Mario Hermanutz, Vanessa Tobisch, Ute Ehrlich, André Berton and Wolfgang Minker Construction and Analysis of a Persuasive Dialogue Corpus 125 Takuya Hiraoka, Graham Neubig, Sakriani Sakti, Tomoki Toda and Satoshi Nakamura Evaluating Model that Predicts When People Will Speak to a Humanoid Robot and Handling Variations of Individuals and Instructions 139 Takaaki Sugiyama, Kazunori Komatani and Satoshi Sato Entrainment in Pedestrian Direction Giving: How Many Kinds of Entrainment? 151 Zhichao Hu, Gabrielle Halberg, Carolynn R Jimenez and Marilyn A Walker Situated Interaction in a Multilingual Spoken Information Access Framework 165 Niklas Laxström, Kristiina Jokinen and Graham Wilcock Part III Speech Recognition and Core Technologies A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition 179 Simon Receveur, David Scheler and Tim Fingscheidt Engine-Independent ASR Error Management for Dialog Systems 193 Junhwi Choi, Donghyeon Lee, Seounghan Ryu, Kyusong Lee, Kyungduk Kim, Hyungjong Noh and Gary Geunbae Lee Contents vii Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses 205 Kazunori Komatani, Naoki Hotta and Satoshi Sato A Semi-automated Evaluation Metric for Dialogue Model Coherence 217 Sudeep Gandhe and David Traum Part I Dialog Management and Spoken Language Processing Free ebooks ==> www.Ebook777.com Evaluation of Statistical POMDP-Based Dialogue Systems in Noisy Environments Steve Young, Catherine Breslin, Milica Gaši´c, Matthew Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis and Eli Tzirkel Hancock Abstract Compared to conventional hand-crafted rule-based dialogue management systems, statistical POMDP-based dialogue managers offer the promise of increased robustness, reduced development and maintenance costs, and scaleability to large open-domains As a consequence, there has been considerable research activity in approaches to statistical spoken dialogue systems over recent years However, building and deploying a real-time spoken dialogue system is expensive, and even when operational, it is hard to recruit sufficient users to get statistically significant results Instead, researchers have tended to evaluate using user simulators or by reprocessing existing corpora, both of which are unconvincing predictors of actual real world performance This paper describes the deployment of a real-world restaurant information system and its evaluation in a motor car using subjects recruited locally and by remote users recruited using Amazon Mechanical Turk The paper explores three key questions: are statistical dialogue systems more robust than conventional hand-crafted systems; how does the performance of a system evaluated on a user simulator compare to performance with real users; and can performance of a system tested over the telephone network be used to predict performance in more hostile environments such as a motor car? The results show that the statistical approach is indeed more robust, but results from a simulator significantly over-estimate performance both absolute and relative Finally, by matching WER rates, performance results obtained over the telephone can provide useful predictors of performance in noisier environments such as the motor car, but again they tend to over-estimate performance S Young (B) · C Breslin · M Gaši´c · M Henderson · D Kim · M Szummer · B Thomson · P Tsiakoulis Cambridge University Engineering Department, Cambridge, UK e-mail: sjy@eng.cam.ac.uk E.T Hancock General Motors Advanced Technical Center, Herzliya, Israel e-mail: eli.tzirkel@gm.com © Springer International Publishing Switzerland 2016 A Rudnicky et al (eds.), Situated Dialog in Speech-Based Human-Computer Interaction, Signals and Communication Technology, DOI 10.1007/978-3-319-21834-2_1 www.Ebook777.com Free ebooks ==> www.Ebook777.com Restoring Incorrectly Segmented Keywords and Turn-Taking … 211 Fig FST used to describe dialogue management rules ASR” state when it receives the “combination starts” message, which is also depicted in Fig The second kind enables the system to terminate a response on the basis of the ASR result for the first fragment of the user utterance The system stays in this “combining and ASR” state until the ASR result for the combined wav files is obtained, i.e., the “ASR ends” message is received.2 Because the system cannot make a response based on the ASR results during this period, it produces fillers such as “Well ” This prevents unnatural silences Evaluation 5.1 Target Data We used data previously collected for our restaurant-search system Four dialogue sessions were collected from each of 30 participants: 120 dialogue sessions in total We obtained VAD and ASR results after performing ASR on the wav files recorded for each dialogue session The parameter settings for VAD were 300 and 240 ms for the silence margins at the start and end of speech segments (specified by -headmargin and -tailmargin) The fixed level threshold 500 for speech input detection in range of 16 bits (-lv in Julius) We used statistical language models of the restaurant domain, whose vocabulary size was 39,900 We obtained 6,615 utterances (VAD results), parts of which (1,564) were short noise segments Our target data should be utterance pairs needing restoration due to incorrect segmentation because the proposed method does nothing for utterances that need no restoration We first selected 376 utterance pairs (175 were originally single utterances and 201 were not) that satisfy the following two conditions and thus possibly required restoration Delay occurs for the duration of the combined wav file itself (a few seconds) This is because the combination and ASR processes start only after the second fragment is obtained in the current implementation of our plug-in www.Ebook777.com 212 K Komatani et al Fig Results of manual determination • Pairs of VAD results that were close in time • Pairs including two user utterances (not just noise segments) First, we selected two VAD results with utterance intervals shorter than 2000 ms That is, we assumed that restoration was not required for pairs with intervals longer than 2000 ms because there was almost no possibility that the pairs were originally a single utterance Next, we excluded the pairs of VAD results in which either or both were shorter than 800 ms Most such VAD results were noise This condition reflects the fact that we had rejected VAD results shorter than 800 ms when the data were collected.3 5.2 Determining Utterance Interval Threshold We analyzed the utterance intervals of the 376 utterance pairs to determine the threshold for the interval by which two utterances are regarded as a single utterance We manually annotated whether each pair was originally a single utterance after listening to the wav files The results are shown in Fig The vertical axis shows the frequency of utterance pairs, and the horizontal axis shows the millisecond groupings The figure shows that the shorter interval groupings included more pairs that were originally a single utterance The longer interval groupings included more independent utterances and noise segments We set the threshold from the results shown in Fig 5; it was set to minimize decision errors when we regard pairs with intervals shorter than the threshold as being originally a single utterance As a result, the threshold was set to 900 ms, and 11 (6 %) of 175 pairs that were originally single utterances were wrongly excluded from our restoration target The option -rejectshort of Julius was used for this purpose Restoring Incorrectly Segmented Keywords and Turn-Taking … 213 5.3 Evaluation of ASR Restoration We evaluated our ASR restoration method by using only utterances that seemed to require ASR restoration because our method does not affect other utterances In summary, 153 of the 376 utterance pairs were used as the evaluation target data because they satisfied three conditions: • The two VAD results seemed to originally be a single utterance because the interval between them was shorter than 900 ms, as determined in the previous section • Both utterances were longer than 800 ms • The transcription of the original utterance contained at least one keyword The third condition derives from the fact that we use interpretation accuracy defined by using keywords and that utterances without keywords not affect the accuracy More specifically, we regarded an utterance pair as correctly interpreted when the keywords in its manual transcription were correctly contained in its ASR result The keyword set contained names of places, stores, and stations that are important in this domain, that is, for searching restaurant information The number of keywords was 2,789, which we set manually We evaluated performance under five conditions Under the first two, incorrectly segmented utterances are not integrated Instead, a result from one or the other fragment is selected Under the remaining three, the fragments are integrated using different approaches Cond Cond Cond Cond Cond Use ASR result for first fragment Use ASR result for second fragment Simply connect the two segmented ASR results Combine two wav files and then perform ASR again for combined file Select either result for Cond or Cond on the basis of CM Cond corresponds to a system that ignores a user’s barge-in That is, the second fragment is ignored because it occurs during the system response for the first fragment Cond corresponds to a system that simply accepts a user’s barge-in That is, the system response for the first fragment is immediately terminated when the second fragment is detected Thus, only the system response for the second fragment is shown to the user.4 Conds and are the two integration methods explained in Sect 4.1 Cond selects a result from those for Conds and on the basis of the CM The threshold of the CM was set to 0.55 after a preliminary experiment The results are shown in Table First, we focus on whether the pairs of ASR results should be integrated or not For example, the accuracy for Cond was higher than that for Cond This result may be self-evident in a sense because the selected target data contained many utterance pairs that were originally a single utterance Nevertheless, this difference shows that a certain number of utterance pairs exists for which the ASR results became correct after integration An example of this case is shown in Fig 214 K Komatani et al Table Interpretation accuracy for five conditions Interpretation method No integration Integration Use only first fragment Use only second fragment Connect two ASR results Combine two wav file Proposed Accuracy 43/153 (28 %) 31/153 (20 %) 103/153 (67 %) 114/153 (75 %) 121/153 (79 %) Next, we focus on the difference among integration methods for the two segmented utterances Cond outperformed Cond 3, which simply connects two ASR results, by points This indicates that combining two wav files and performing ASR again is effective A comparison of Cond (proposed method) with Conds and shows that our approach of selecting results from multiple integration methods is promising We compared the results for Conds and to classify the effect of combining two wav files in more details The ASR results for 30 utterances were correct with Cond while they were incorrect with Cond That is, the ASR results for these utterances became correct by combining their wav files Two typical cases were found in these results First, a user stammered or breathed when saying a long keyword, creating a pause within it In such cases, the two fragments were difficult to recognize correctly because they were not in the dictionary Second, some utterances were correctly recognized once they were combined and became longer because the appropriate language constraint was applied to them In contrast, the ASR results for 19 utterances were incorrect with Cond although they were correct with Cond This was caused by a side effect of the combination: ASR errors newly occurred due to removing necessary silence margins Moreover, the language model probability might change by combining wav files This part needs further investigation for improving accuracy Conclusion We have developed a solution to two problems caused by incorrectly segmented utterances, ASR errors and erroneous turn-taking, and have implemented it as a plugin for MMDAgent [11] However, much remains for future work First, evaluation from the viewpoint of turn-taking is needed So far we have only evaluated ASR restoration Second, several parts in the ASR restoration process need to be improved The ASR restoration should be invoked by considering more features that have been shown to be effective in other studies Whether two fragments should be integrated or not will be casted as a machine learning problem using more features than the ASR confidence Since more sophisticated VAD and end-pointing methods would also be helpful, as explained in Sect 2.2, integration with such methods should be also investigated Third, we have not yet considered “repaired utterances” There are Restoring Incorrectly Segmented Keywords and Turn-Taking … 215 several studies that treat such repairs including repetition, correction, and so on [4, 6, 7, 12] These findings can be used to determine how the segmented utterances should be handled Finally, the implementation of our plug-in can be improved Our current plug-in executes ASR for the combined file only after the second fragment ends The delay would be substantially reduced if it was implemented as an online module Acknowledgments This research was partly supported by the JST PRESTO Program and the Naito Science & Engineering Foundation References Baumann T, Schlangen D (2011) Predicting the micro-timing of user input for an incremental spoken dialogue system that completes a user’s ongoing turn In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 120–129 Bell L, Boye J, Gustafson J (2001) Real-time handling of fragmented utterances In: Proceedings of the NAACL workshop on adaption in dialogue systems, pp 2–8 Benyassine A, Shlomot E, Yu Su H, Massaloux D, Lamblin C, Petit JP (1997) ITU-T recommendation G.729 annex B: a silence compression scheme for use with g.729 optimized for v.70 digital simultaneous voice and data applications IEEE Commun Mag 35(9):64–73 Core MG, Schubert LK (1999) A syntactic framework for speech repairs and other disruptions In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 413–420 http://dx.doi.org/10.3115/1034678.1034742 Edlund J, Heldner M, Gustafson J (2005) Utterance segmentation and turn-taking in spoken dialogue systems In: Computer studies in language and speech, pp 576–587 Georgila K, Wang N, Gratch J (2010) Cross-domain speech disfluency detection In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 237–240 Heeman PA, Allen JF (1999) Speech repairs, intonational phrases and discourse markers: modeling speakers’ utterances in spoken dialogue Comput Linguist 25:527–571 Jan EE, Maison B, Mangu L, Zweig G (2003) Automatic construction of unique signatures and confusable sets for natural language directory assistance application In: Proceedings of the European conference speech communication and technology (EUROSPEECH), pp 1249–1252 Katsumaru M, Komatani K, Ogata T, Okuno HG (2009) Adjusting occurrence probabilities of automatically-generated abbreviated words in spoken dialogue systems In: Next-generation applied intelligence Lecture notes in computer science, vol 5579 Springer, Berlin, pp 481–490 http://dx.doi.org/10.1007/978-3-642-02568-6_49 10 Lee A, Kawahara T (2009) Recent development of open-source speech recognition engine Julius In: Proceedings of the APSIPA ASC: Asia-Pacific signal and information processing association, annual summit and conference, pp 131–137 11 Lee A, Oura K, Tokuda K (2013) MMDAgent—a fully open-source toolkit for voice interaction systems In: Proceedings of the IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 8382–8385 12 Liu Y, Shriberg E, Stolcke A, Hillard D, Ostendorf M, Harper M (2006) Enriching speech recognition with automatic detection of sentence boundaries and disfluencies IEEE Trans Audio Speech Lang Process 14(5):1526–1540 http://dx.doi.org/10.1109/TASL.2006.878255 13 Nakano M, Miyazaki N, Ichi Hirasawa J, Dohsaka K, Kawabata T (1999) Understanding unsegmented user utterances in real-time spoken dialogue systems In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 200–207 216 K Komatani et al 14 Raux A, Eskenazi M (2008) Optimizing endpointing thresholds using dialogue features in a spoken dialogue system In: Proceedings of the SIGdial workshop on discourse and dialogue, pp 1–10 15 Raux A, Eskenazi M (2009) A finite-state turn-taking model for spoken dialog systems In: Proceedings of the human language technologies: annual conference of the North American chapter of the association for computational linguistics (HLT NAACL), pp 629–637 16 Sato R, Higashinaka R, Tamoto M, Nakano M, Aikawa K (2002) Learning decision trees to determine turn-taking by spoken dialogue systems In: Proceedings of the international conference on spoken language processing (ICSLP), pp 861–864 17 Selfridge E, Arizmendi I, Heeman PA, Williams JD (2011) Stability and accuracy in incremental speech recognition In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 110–119 18 Shneiderman B (1997) Designing the user interface, 3rd edn Addison-Wesley, New York 19 Singh R, Seltzer ML, Raj B, Stern RM (2001) Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination In: Proceedings of the IEEE international conferenceon acoustic, speech and signal processing (ICASSP), vol 1, pp 273–276 20 Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 1–8 21 Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation In: Intelligent virtual agents Lecture notes in computer science, vol 7502 Springer, Berlin, pp 275–288 http://dx.doi.org/10.1007/ 978-3-642-33197-8_29 A Semi-automated Evaluation Metric for Dialogue Model Coherence Sudeep Gandhe and David Traum Abstract We propose a new metric, Voted Appropriateness, which can be used to automatically evaluate dialogue policy decisions, once some wizard data has been collected We show that this metric outperforms a previously proposed metric Weak agreement We also present a taxonomy for dialogue model evaluation schemas, and orient our new metric within this taxonomy Introduction There has been a lot of work in end-to-end evaluation of dialogue systems, but much less so on the dialogue modelling component itself The key task here is: given a context of prior utterances in the dialogue, choose the next system utterance There are many possible ways of evaluating this decision, including whether it replicates an original dialogue move, how close it is to that move (e.g., [4]), and human evaluations of quality or coherence In Sect we provide a taxonomy that organizes types of evaluation along a series of dimensions regarding evaluation metric, evaluator and evaluation context For the purposes of using machine learning for improving dialogue policies, it is critical to have a high-quality automatic evaluation method MDP [8] and POMPD [17] dialogue models are generally evaluated with respect to a reward function, however these reward functions typically function at the level of whole dialogues and not specific choices (even though reinforcement learning models estimate the contribution of individual moves) There is still much work needed in picking good reward functions, and this task is much harder, when the metric of importance concerns dialogue coherence rather than task success S Gandhe (B) · D Traum Institute for Creative Technologies, University of Southern California, Los Angeles, CA 90094, USA e-mail: srgandhe@gmail.com D Traum e-mail: traum@ict.usc.edu © Springer International Publishing Switzerland 2016 A Rudnicky et al (eds.), Situated Dialog in Speech-Based Human-Computer Interaction, Signals and Communication Technology, DOI 10.1007/978-3-319-21834-2_19 217 218 S Gandhe and D Traum We propose a semi-automated evaluation paradigm, similar to BLEU used in machine-translation [10] or ROUGE, used in summarization [9], and improving on the previously proposed metric weak-agreement [2] In this paradigm, a set of human “wizards” make the same decisions that the system will have to make, and this data is used to evaluate a broader set of system decisions This approach is particularly appropriate in a selection paradigm for producing system utterances, where the system (or wizard) selects from a corpus of previous dialogue utterances rather than generating a novel utterance The work described in this paper is done within the scope of Virtual Human Dialogue Systems Virtual Humans are autonomous agents who can play the role of humans in simulations [14] Virtual Human characters have proved useful in many fields; some have been used in simulations for training negotiation skills [13] or tactical questioning skills [12]; some virtual humans are used in settings where a face-to-face conversation can have a stronger impact in presenting some information (e.g., a Virtual Nurse used for counseling hospital patients who have inadequate health literacy at the time of discharge [1], Museum Docents promoting science and technology interests in middle school students [11]); some virtual humans are used as non-playing characters in interactive games (e.g., [7]) Although different virtual humans may have different sets of goals, one common requirement for all of them is the ability to take part in natural language conversations Evaluation Schema for Conversational Dialogue Models Evaluating a dialogue model requires making a series of decisions Figure shows a schematic representation of such decisions for evaluation of dialogue models The first decision is which evaluation metric to use This is dependent on the goals of the dialogue system In case of a task-oriented dialogue system, some suitable Fig A schematic representation of various decision factors in evaluating dialogue models for virtual humans A Semi-automated Evaluation Metric for Dialogue Model Coherence 219 choices for an evaluation metric are user satisfaction, task success rate, task efficiency, etc [16] For tutoring dialogue systems, some suitable evaluation metrics can be user satisfaction or learning gain as measured by differences between post-test and pretest scores [3] Since the goal for virtual humans is to be as human-like as possible, a suitable evaluation metric for virtual human dialogue systems is how appropriate or human-like the responses are for a given dialogue context These evaluation metrics can be subjective or objective and can measured at different levels of granularity such as utterance-level, dialogue-level, user-level, etc The next decision is who evaluates the dialogue models The dialogue models we need to evaluate are designed to be part of a virtual human who will engage human users in natural language conversations Judging appropriateness of a response utterance given a dialogue context in such conversations is not an easy process and may require human-level intelligence This is why human judges are a natural choice for such a subjective evaluation metric Although humans are best suited to evaluate appropriateness of responses, using humans as judges is costly and time-consuming For these and other reasons, automatic evaluation becomes an attractive alternative The next decision criterion is how the dialogue model to be evaluated is used in the process of generating response utterances and the corresponding dialogue contexts There are two possible settings Dynamic Context and Static Context Figure shows a schematic representations for these different settings Dynamic Context In dynamic context evaluation, the dialogue model is used for generating the response utterances as well as the dialogue contexts with respect to which the subsequent responses are evaluated In this case, we build a dialogue system using the dialogue model that needs to be evaluated A human user interacts with this dialogue system The system’s response is the top-ranked response utterance for the given dialogue context as ranked by the dialogue model (a) (b) (c) Fig Schematic representation of dynamic context and static context evaluation settings a Original human-human dialogue, b dynamic context setting, c static context setting 220 S Gandhe and D Traum Figure 2b shows first two stages of the dynamic context evaluation process At first, the user produces an utterance P1 Based on the context P1 , the dialogue model being evaluated produces the response utterance S2 This response may be different from utterance S2 , which was the response in original human-human dialogue (Fig 2a) The user continues the dialogue and responds to the system’s response with utterance P3 The next response from the system produced by the dialogue model being evaluated is based on the context P1 , S2 , P3 This context is dependent on the dialogue model being evaluated Thus during dynamic context evaluation the resulting dialogue (and the intermediate dialogue contexts) are generated through an interactive process between a human user and a dialogue model If an inappropriate response is chosen by the dialogue model then it becomes part of the context used to select the next response Thus the dialogue model has the potential to recover from its errors or to build on them System’s responses are evaluated for appropriateness with respect to the same contexts that were used to generate them Static Context In static context evaluation the dialogue model is used for generating only the response utterances The dialogue contexts are not affected by the specific dialogue model being evaluated These dialogue contexts are extracted from actual in-domain human-human dialogues For every turn whose role is to be played by the system, we predict the most appropriate response in place of that turn given the dialogue context Figure 2c shows first two stages of the static context evaluation process The first system response is generated based on the context P1 and is S2 , the same as in the case of dynamic context But for the second response from the system, the context is reset to P1 , S2 , P3 the same as the original human-human dialogue and does not depend on the dialogue model being evaluated The system’s response then is S4 , which can be different from both S4 (human-human) and S4 (dynamic context) Again, the system’s responses are evaluated for appropriateness with respect to the same contexts that were used to generate them The next decision criterion in evaluating dialogue models is whether the evaluator takes part in the conversation If we require that the evaluator participates in the dialogue then each dialogue can be evaluated by only one evaluator—the participant himself This evaluation scheme assumes that the conversational participant is in the best position to judge the appropriateness of the response The Turing test [15] calls for such a dynamic context evaluation by the participant where instead of appropriateness, the evaluation metric is whether the conversational participant is human or machine Although evaluation by a dialogue participant is the most faithful evaluation possible, it is costly As only one evaluator can judge a dialogue, we need to create a large enough test corpus by conducting conversations with the system Moreover, volunteers may find playing two roles (dialogue participant and evaluator) difficult In such cases, evaluation by a bystander (overhearer) can be a suitable alternative In this type of evaluation the evaluator does not actively participate in the conversation and more than one evaluator can judge a dialogue for appropriateness of responses Free ebooks ==> www.Ebook777.com A Semi-automated Evaluation Metric for Dialogue Model Coherence 221 In case of multiple judges, the average of their judgments is used as a final rating for appropriateness For static context evaluation, the evaluator is always a bystander if s/he doesn’t take part in creating the original human-human dialogue Automatic Static Context Evaluation Recently we evaluated dialogue models for a Virtual Human Dialogue System We used the negotiation scenario where a human trainee tries to convince a virtual doctor to move his clinic [13] We conducted a Static Context evaluation of response appropriateness using human judges [5] We evaluated computer dialogue models and wizard dialogue models as upper human-level baselines For wizard dialogue models, we collected data from four wizards as to which utterances are appropriate responses for given dialogue contexts using the tool described in [6] The data collected from wizards is used to build two models: Wizard Max Voted model, which returns the response with the maximum number of votes from the four wizards; and Wizard Random model, which returns a random utterance from the list of all utterances marked as appropriate by any of the wizards We also collected ratings for appropriateness of responses from different dialogue models on a scale of 1–5 (1 being very inappropriate response and perfectly appropriate) The ratings were provided by four human judges for the same dialogues as used in wizard data collection.1 This results in a collection of appropriateness ratings for a total of 397 unique pairs of u t , contextt , where u t is a response utterance for a dialogue context contextt We use this data for proposing and evaluating automatic evaluation measures in static context setting 3.1 Weak Agreement DeVault et al [2] used an automatic evaluation measure based on wizard data collection for evaluating various dialogue models in a static context setting The dialogue models evaluated in that study operate at the dialogue act level and consequently the wizard data collection is also done at the dialogue act level Their proposed automatic evaluation, weak agreement, judges the response dialogue act for a given context as appropriate if any one of the wizards has chosen that dialogue act as an appropriate response In their study DeVault et al not correlate this automatic measure with human judgments of appropriateness Let R(u t , contextt ) denote the average appropriateness of the response utterance u t for the dialogue context contextt as judged by the four human judges Also let W (contextt ) be the union of set of responses judged appropriate for the dialogue context contextt by the four wizards Then following [2], an automatic evaluation for response appropriateness along the lines of weak agreement can be defined as, Two of the judges also performed the role of the wizards, but the wizard data collection and the evaluation tasks were separated by a period of over months www.Ebook777.com 222 S Gandhe and D Traum Rweak (u t , contextt ) = if u t ∈ W (contextt ) Appropriate response / W (contextt ) Inappropriate response if u t ∈ (1) In order to test the validity of this automatic evaluation metric (Rweak ), We correlate it with human judgments (R) This correlation can be computed either at the level of an individual response (i.e., for every unique value of u t , contextt ) or at the system level (i.e., by aggregating the ratings over each dialogue model) The Pearson’s correlation between Rweak and R is 0.485 ( p < 0.001, n = 397) at individual response level and 0.803 ( p < 0.05, n = 7) at the system level Although we report both correlation values, we’re primarily interested in comparing dialogue models with each other So we focus on the system level correlation Weak Agreement, Rweak turns out to be a good evaluation understudy for judging appropriateness of responses given a dialogue context especially at the system level 3.2 Voted Appropriateness We made an observation regarding Rweak which may lead to an improvement According to weak agreement, we should expect Wizard Max Voted and Wizard Random models to have the same appropriateness rating of value (by definition in 1) Instead, we observe that Wizard Max Voted model receives significantly higher appropriateness ratings than Wizard Random This indicates that not all responses chosen by wizards are judged as highly appropriate by other judges It also suggests that more votes from wizards for a response utterance are likely to result in higher appropriateness ratings Based on these observations, we propose an evaluation understudy Voted Appropriateness, Rvoted Let V (u t , contextt ) be the number of wizards who chose the utterance u t as an appropriate response to the dialogue context contextt Following PARADISE [16], which models user satisfaction as a linear regression of observable dialogue features, we model Rvoted as a linear regression based on V Rvoted (u t , contextt ) = α0 + α1 · V (u t , contextt ) (2) Figure shows the appropriateness rating (R) as judged by human judges for response utterances as a function of number of wizard votes (V ) received by those response utterances For this analysis we use only distinct pairs of u t , contextt (n = 397) We fit a linear regression model for this data The number of votes received V is a significant factor in estimating R ( p < 0.001) The final linear model estimated from all available data is, Rvoted = 3.549 + 0.449V The fraction of variance explained by the model is 0.238 To verify whether a simple linear regression model can be used as an automatic evaluation for static context setting, we perform fivefold cross-validation analysis During each fold, we hold out the data corresponding to one of the dialogues and train a linear model on the rest of the data We use this trained model to compute voted appropriateness (Rvoted ) for the held-out data and then correlate it with the actual A Semi-automated Evaluation Metric for Dialogue Model Coherence 223 Fig Appropriateness of responses (R) as judges by human judges plotted against the number of wizard votes (V ) received by those responses The dashed line indicates a fitted linear model A small amount of jitter is added to V for visualization Fig Comparison between two automatic evaluation understudy measures at system level in static context setting observed value of appropriateness rating (R) as judged by humans The Pearson’s correlation between Rvoted and R is 0.479 ( p < 0.001, n = 397) at the individual response level At the system level the Pearson’s correlation between Rvoted and R is 0.893 ( p < 0.01, n = 7) At the system level, Rvoted is a better evaluation understudy than Rweak Figure shows a comparison between these two possible evaluation measures for automatic evaluation of appropriateness in static context setting 3.3 Discussion Different resources are required to build different automatic evaluation measures For Rweak , we need to collect wizard data When this data is being collected at the surface text level, we need a substantial number of wizards (four or more) each selecting a 224 S Gandhe and D Traum large number of appropriate responses for each context For the automatic evaluation measure Rvoted , in addition to the wizard data we need resources to estimate the linear regression model As training data to build a linear regression model, we need human evaluators’ appropriateness ratings for responses given the dialogue contexts Automatic evaluation for static context setting involves human efforts for collecting wizard data and appropriateness ratings But since the resources are collected at the surface text level non-experts can accomplish this task An appropriate tool which can ensure a wide variety of appropriate responses proves useful for this task Moreover since static context setting uses a fixed set of contexts, wizard data collection needs to be performed only once The resulting automatic evaluation metrics can be used to compare different dialogue models When using the Voted Appropriateness evaluation method, the training data used for linear regression should represent all possible responses adequately The data used to fit our model includes relatively well-performing models which results in a rather high intercept value of 3.549 For any model producing responses that are not judged appropriate by any of the wizards, our model would predict the appropriateness value of 3.549 which seems rather high Conclusion In this paper, we evaluated a previously proposed automatic evaluation metric for dialogue coherence models, Weak Agreement in terms of how closely it correlates with human judgments We also proposed and evaluated a new metric, Voted Appropriateness and showed that it has better correlation with human judgments We also introduced a taxonomy for evaluation which is useful in understanding how various dialogue model evaluations relate to each other Acknowledgments The effort described here has been sponsored by the U.S Army Any opinions, content or information presented does not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred References Bickmore TW, Pfeifer LM, Jack BW (2009) Taking the time to care: empowering low health literacy hospital patients with virtual nurse agents In: Proceedings of the 27th international conference on Human factors in computing systems, CHI ’09 ACM, New York, NY, USA, pp 1265–1274 doi:10.1145/1518701.1518891 http://doi.acm.org/10.1145/1518701.1518891 DeVault D, Leuski A, Sagae K (2011) Toward learning and evaluation of dialogue policies with text examples In: Proceedings of the SIGDIAL 2011 conference Association for Computational Linguistics, Portland, Oregon, pp 39–48 http://www.aclweb.org/anthology/W/W11/ W11-2006 Forbes-Riley K, Litman DJ (2006) Modelling user satisfaction and student learning in a spoken dialogue tutoring system with generic, tutoring, and user affect parameters In: Proceedings Free ebooks ==> www.Ebook777.com A Semi-automated Evaluation Metric for Dialogue Model Coherence 10 11 12 13 14 15 16 17 225 of the main conference on human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL ’06 Association for Computational Linguistics, Stroudsburg, PA, USA, pp 264–271 http://dx.doi.org/10.3115/ 1220835.1220869 Gandhe S, Traum D (2008) Evaluation understudy for dialogue coherence models In: Proceedings of the 9th SIGdial workshop on discourse and dialogue Association for Computational Linguistics, Columbus, Ohio, pp 172–181 http://www.aclweb.org/anthology/W/W08/W080127 Gandhe S, Traum D (2013) Surface text based dialogue models for virtual humans In: Proceedings of the SIGDIAL 2013 conference Association for Computational Linguistics, Metz, France, pp 251–260 http://www.aclweb.org/anthology/W/W13/W13-4039 Gandhe S, Traum D (2014) SAWDUST: a semi-automated wizard dialogue utterance selection tool for domain-independent large-domain dialogue In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL) Association for Computational Linguistics, Philadelphia, PA, USA, pp 251–253 http://www.aclweb.org/anthology/ W14-4333 Gustafson J, Bell L, Boye J, Lindström A, Wirén M (2004) The nice fairy-tale game system In: Strube M, Sidner C (eds) Proceedings of the 5th SIGdial workshop on discourse and dialogue Association for Computational Linguistics, Cambridge, Massachusetts, USA, pp 23–26 Levin E, Pieraccini R, Eckert W (1997) Learning dialogue strategies within the Markov decision process framework In: Proceedings of the 1997 IEEE workshop on automatic speech recognition and understanding, pp 72–79 doi:10.1109/ASRU.1997.658989 Lin CY, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics In: NAACL ’03: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology Association for Computational Linguistics, Morristown, NJ, USA, pp 71–78 http://dx.doi.org/10.3115/ 1073445.1073465 Papineni KA, Roukos S, Ward T, Zhu WJ (2001) Bleu: a method for automatic evaluation of machine translation In: Technical report RC22176 (W0109-022), IBM Research Division http://citeseer.ist.psu.edu/papineni02bleu.html Swartout W, Traum D, Artstein R, Noren D, Debevec P, Bronnenkant K, Williams J, Leuski A, Narayanan S, Piepol D, Lane C, Morie J, Aggarwal P, Liewer M, Chiang JY, Gerten J, Chu S, White K (2010) Ada and grace: toward realistic and engaging virtual museum guides In: Proceedings of the 10th international conference on Intelligent virtual agents, IVA’10 Springer, Berlin, pp 286–300 http://dl.acm.org/citation.cfm?id=1889075.1889110 Traum D, Leuksi A, Roque A, Gandhe S, DeVault D, Gerten J, Robinson S, Martinovski B (2008) Natural language dialogue architectures for tactical questioning characters In: Proceedings of 26th army science conference Traum D, Swartout W, Gratch J, Marsella S (2005) Virtual humans for non-team interaction training In: AAMAS-05 workshop on creating bonds with humanoids Traum D, Swartout W, Gratch J, Marsella S (2008) A virtual human dialogue model for nonteam interaction Text, speech and language technology, vol 39 Springer, New York, pp 45–67 doi:10.1007/978-1-4020-6821-8 Turing AM (1950) Computing machinery and intelligence Mind 59:433–460 http://cogprints org/499/ Walker M, Kamm C, Litman D (2000) Towards developing general models of usability with paradise Natural language engineering: special issue on best practice in spoken dialogue systems http://citeseer.ist.psu.edu/article/walker00towards.html Williams JD, Young S (2007) Partially observable Markov decision processes for spoken dialog systems Comput Speech Lang 21:393–422 www.Ebook777.com ... decrease human- computer trust [7] In human- human interaction moments of unclear, not reasonable decisions by one party are often clarified by explaining the process of reasoning (i.e., increasing... explanations on the human- computer trust relationship Human- computer trust has shown to be very important in keeping the user motivated and cooperative in a human- computer interaction Especially... of advanced dialogue managers RavenClaw [2] separates from the domain level some domain-independent aspects of dialogue management, including turn taking, timing, and error handling In contrast,