Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 35 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
35
Dung lượng
142,5 KB
Nội dung
STOCHASTIC NATURAL LANGUAGE GENERATION FOR SPOKEN DIALOG SYSTEMS Alice H Oh Alexander I Rudnicky School of Computer Science Carnegie Mellon University Pittsburgh, PA USA Contact Information Alice Oh (aoh@ai.mit.edu) Artificial Intelligence Laboratory 200 Technology Square Rm 812 Cambridge, MA 02139 USA Alex Rudnicky (air+@cs.cmu.edu) School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213 USA Abstract N-gram language models have been found useful for many automatic speech recognition applications Although it is not clear whether the simple n-grams can adequately model human language, we show an application of this ubiquitous modeling technique to the task of natural language generation (NLG) This work shows that it is possible to employ a purely corpus-based approach to NLG within a spoken dialog system In this paper, we discuss applying this corpus-based stochastic language generation at two levels: content selection and sentence planning/realization At the content selection level, the utterances are modeled by bigrams, and the appropriate attributes are chosen using the bigram statistics For sentence planning and realization, the utterances in the corpus are modeled by n-grams of varying length, and each new utterance is generated stochastically This paper presents the details of the implementation in the CMU Communicator, some preliminary evaluation results, and the potential contribution to the general problem of natural language generation for spoken dialog, written dialog, and text generation Introduction Natural language generation (NLG) is the process of generating text from a meaning representation It can be thought of as the reverse of natural language understanding (NLU) (see Figure 1) While it is clear that NLG is an important part of natural language processing, there has been considerably less research activity in NLG than in NLU This can be partly explained by the reality that NLU, at least until now, has had more application potential, due to the enormous amount of text present in the world (Mellish and Dale, 1998) In contrast, it is unclear what the input to NLG should be, and other than some systems in which input to NLG is automatically created by another module, all input must be created somehow just for this purpose (Insert Figure about here) Nevertheless, NLG plays a critical role in applications such as text summarization, machine translation, and dialog systems Presently, NLG researchers are actively advancing the field of text generation, and as a result, many useful technologies are being developed However, there is still a critical shortage of adequate NLG technologies for spoken dialog systems, where NLG plays an especially important role The current work, along with other related efforts such as Baptist and Seneff (2000) and Ratnaparkhi (2000), strives to highlight the importance and initiate progress of NLG research in spoken dialog systems In developing and maintaining a natural language generation (NLG) module for a spoken dialog system, we recognized the limitations of the current NLG technologies for our purposes While several general-purpose rule-based generation systems have been developed (cf Elhadad and Robin, 1996), they are often quite difficult to adapt to small, task-oriented applications because of their generality To overcome this problem, several people have proposed different solutions Bateman and Henschel (1999) have described a lower cost and more efficient generation system for a specific application using an automatically customized subgrammar Busemann and Horacek (1998) describe a system that mixes templates and rule- based generation This approach takes advantages of templates and rule-based generation as needed by specific sentences or utterances Stent (1999) has also proposed a similar approach for a spoken dialog system However, for all of these, there is still the burden of writing grammar rules and acquiring the appropriate lexicon Because comparatively less effort is needed, many current dialog systems use template-based generation But there is one obvious disadvantage to templates: the quality of the output depends entirely on the set of templates Even in a relatively simple domain, such as travel reservations, the number of templates necessary for reasonable quality can become quite large (in the order of hundreds) such that maintenance becomes a serious problem There is an unavoidable tradeoff between the amount of time and effort in creating and maintaining templates and the variety and quality of the output utterances This will become clear when we present the details of our previous template system in Section 4.2.1 Recognizing these limitations of the rule-based and template-based techniques, we developed a novel approach to natural language generation It features a simple, yet effective corpusbased technique that uses statistical models of task-oriented language spoken by domain experts to generate system utterances We have applied this technique to sentence realization and content planning, and have incorporated the resulting generation component into a working spoken dialog system In our evaluation experiments, this technique performed well for our spoken dialog system This shows that the corpus-based approach is a promising avenue to further explore Outline This article is structured as follows: Section starts with a high-level view of the general role of NLG in various language technology applications, and then continues with a discussion of the specific role that NLG plays in spoken dialog systems Section is a survey of some of the popular techniques in NLG This section will also serve as an introduction to the terminology used throughout the rest of the paper Section briefly describes the Carnegie Mellon Communicator, how our stochastic NLG fits into the system, and how the NLG module has changed over time Section gives a detailed description of the corpora we used, and how we prepared them before building the statistical models Section and talk about our implementation at the content planning and surface realization levels, respectively Section describes our evaluation methodology and presents the results Section is the conclusion of this paper Natural Language Generation for Spoken Dialog Systems Natural Language Generation (NLG) and Spoken Dialog Systems are two distinct and nonoverlapping research fields within language technologies Most researchers in these two fields have not worked together until very recently However, since every spoken dialog system needs an NLG component, spoken dialog researchers have started to look at what technologies the NLG community can provide them While that may be a good approach, it is also possible for the spoken dialog community to take the initiative and contribute to the NLG community This work is one of the first attempts at such contribution, along with Stent (1999), Ratnaparkhi (2000), and Baptist and Seneff (2000) It is not merely an application of well-developed NLG techniques to a spoken dialog system, but it introduces a novel generation technique potentially well-suited to many applications In this section, we compare the different roles that NLG must play in text-based applications and in spoken dialog systems 2.1 Natural Language Generation In this section, we give a general introduction to NLG systems for text-based applications, such as machine translation By doing so, we hope to set a stage for describing how the needs of spoken dialog systems differ from those of the applications for which most NLG technologies were developed One of the most important applications of NLG is machine translation (MT) In an MT system where interlingua is used as an intermediate representation between the source language and the target language, NLG is an essential component that maps interlingua to text in the target language (see Figure 2) One example of such a system is Nitrogen (Langkilde and Knight, 1998), the generation component used in Gazelle, a broad-coverage machine translation system (Insert Figure about here) As one can imagine, building an MT system for written documents is a difficult task The machine-generated output must be coherent, grammatical, and correct Even if the source language is mapped correctly to a representation in the interlingua, it is the generation system's responsibility to plan the structure of the final text, choose the right words, and put the words in the correct order according to the grammar of the target language One of the reasons that building an NLG system is so difficult is that the generation grammar requires the input to NLG be rich with many linguistic details As an example, we can look at the Nitrogen system, which is one of the more flexible systems, but it still requires a fair amount of linguistic detail (e.g., θ-roles, such as agent, patient, see Figure 3) that can be daunting for some applications While it may not pose much of a problem for MT systems where the linguistic features can be extracted from the source language, there are other applications of NLG where this is a major problem (Insert Figure about here) Some other applications in which NLG plays an essential role include personalized letter writing (Reiter, et al 2000) and report generation systems, including multilingual report generation (Goldberg, et al 1994) In all of these text-based applications, the focus of the NLG component is on automatically generating well-structured, grammatical, and well-written text In the next section, we discuss some of the different issues that arise when designing an NLG component for spoken dialog systems 2.2 Spoken Dialog Systems A spoken dialog system enables human-computer interaction via spoken natural language A task-oriented spoken dialog system speaks as well as understands natural language to complete a well-defined task This is a relatively new research area, but many task-oriented spoken dialog systems are already fairly advanced Examples include a complex travel planning system (Rudnicky, et al 1999), a publicly available worldwide weather information system (Zue, et al 2000), and an automatic call routing system (Gorin, et al 1997) Building an NLG component for a spoken dialog system differs from the more traditional NLG systems for generating documents, but it is a very interesting problem that can provide a novel way of looking at NLG The following are some characteristics of task-oriented spoken dialog systems that contribute to the unconventional issues for the developer of the NLG component The language used in spoken dialog is different from the language used in written text Spoken utterances are generally shorter in length compared to sentences in written text Spoken utterances follow grammatical rules, but much less strictly than written text Also, the syntactic structures used tend to be much simpler and less varied than those in written text We expect this is due to the limited cognitive capacity of humans In a spoken dialog, simple and short utterances may be easier to say and understand, thus enabling more effective communication.1 The language used in task-oriented dialogs is largely domain-specific The domains for the current dialog systems are fairly narrow (e.g., flight reservation, weather information) Hence, the lexicon for a given spoken dialog system is small and domain-specific NLG is usually not the main focus in building/maintaining these systems Yet the NLG module is critical in system performance and user satisfaction In a telephone-based dialog system, NLG and text-to-speech (TTS) synthesis are the only modules users will Note, that the unit utterance is different from the unit turn; a turn may consist of several utterances experience directly, but with limited development resources, NLG has traditionally been overlooked by spoken dialog system developers Taking these characteristics into account, NLG for task-oriented spoken dialog systems must be able to generate language appropriate for spoken interaction; system utterances must not be too lengthy or too complex (syntactically) generate domain-specific language; the lexicon must contain appropriate words for the domain enable fast prototyping; development of the NLG module should not be the bottleneck in developing the whole dialog system Also, to satisfy the overall goal of the task-oriented spoken dialog system, the NLG component must be able to carry out a natural conversation, elicit appropriate responses from the user, prevent user confusion, and guide the user in cases of confusion Survey of NLG techniques In this section, we take a step back and give a general survey of the techniques used in NLG Before going right into it, let us take a look at the components of NLG Although terms such as “surface realization” and “lexical selection” are used throughout the NLG literature (e.g., this paper), there is little consensus among NLG researchers as to where the boundaries are, and what each of the sub-tasks spans (In response to this problem, the RAGS project is aiming to establish a reference architecture of NLG See Cahill, et al., 1999) Table illustrates three different segmentations Although we tried to align similar sub-tasks, there can be many differences among the three segmentations about what each sub-task actually does (Insert Table about here) Using the most general segmentation in the first column, content planning entails taking the overall communicative goal of the document, breaking it down into smaller sub-goals, and layiing out the ordering of those sub-goals in a coherent manner In spoken dialog systems, the dialog manager is responsible for content planning In sentence planning, the appropriate syntactic structure and meaning words are chosen, as well as sentence boundaries In syntactic realization, the meaning words are connected to produce well-formed strings In the rest of this section (and in our NLG system), we chose not to cover content/text planning in detail and concentrate on sentence planning and syntactic realization This is because content planning is not as relevant in the current spoken dialog systems Also, for our convenience, we will group sentence planning and syntactic realization under the name “surface realization” The term “surface realization” here also includes some aspects of sentence planning This conflicts with the use of the term in Mellish and Dale (1998), but for template generation and NLG in spoken dialog systems, combining sentence planning into surface realization seems more natural 3.1 Surface Realization A definition of surface realization given in Mellish and Dale (1998) is as follows: Determining how the underlying content of a text should be mapped into a sequence of grammatically correct sentences … An NLG system has to decide which syntactic form to use, and it has to ensure that the resulting text is syntactically and morphologically correct One technique for surface realization is using templates, which is widely used in simple text generation applications At the other end of the spectrum is a technique based on generation grammar rules, which most research systems have focused on Recently, there has been work on hybrid systems and corpus-based methods The next section describes the existing 10 then there is a separate model for the distribution of those words in the class Similarly, in our model, we replace the words in a word class with the tag, treating them as a single unique word in the n-gram model (Insert Figure about here) A separate set of class models is not needed, however, since the word classes represent the attributes, for which the values are passed in the input frame from the dialog manager This is similar to having slots in templates, and then filling the slots with the values in the input Much of the tagging effort can be automated The dialog system already has a list of words for most of the word classes: list of city names, list of airports, list of airlines, etc However, as you can see from Figure 7, some of the word classes are more detailed than that For example, the class “city” should be divided into departure city (depart_city) and arrival city (arrive_city) Tagging first with the more general class, and then applying simple rules such as “from {city}” becomes “from {depart_city}” fixes this problem The same principle can be applied to tagging utterance classes, but that requires more of manual tagging effort Figure is an excerpt from our corpus after tagging (Insert Figure about here) 5.3 Using models of human-human interaction for humancomputer interaction Several issues arise when using computational models of human-human interaction for spoken dialog systems First, there are some inherent differences between human-human conversations and human-computer conversations As Boyce and Gorin (1996) point out, there are some user and system utterances in human-computer interactions that not occur in normal human-human interactions These include more frequent confirmations, error messages, and help messages Similarly, there are some utterances that occur frequently in human-human conversations but not in human-computer conversations, the most obvious being backchanneling (e.g., “uh-huh”, “okay”) The second issue is that the quality of the 21 output depends very much on the speaker whose language is modeled This means the selection process for the speaker is crucial for system performance In the case of a travel reservation system, simply selecting an experienced travel agent is sufficient, but in other domains, it may not be so simple Another issue is that while the models of human-human interaction may result in natural dialogs, they may not lead to the most efficient dialogs That is, it may be possible that one could design a computer agent that can carry out a more efficient task-oriented dialog than a human expert We leave these questions unanswered for deeper studies of human computer interaction Content Planning Content planning is a process where the system decides which attributes (represented as word classes, see Figure 7) should be included in an utterance In a task-oriented dialog, the number of attributes with specified values generally increases during the course of the dialog, as the user specifies his/her constraints Therefore, as the dialog progresses, the system needs to decide which ones to include at each system turn If the system includes all of them every time (indirect echoing, see Hayes and Reddy, 1983), the utterances become overly lengthy (see Dialog in Figure 9), but if we remove all system confirmations, the user may get confused (see Dialog in Figure 9) With a fairly high recognition error rate, this becomes an even more important issue (Insert Figure about here) The problem, then, is to find a compromise between the two We compared two ways to systematically generate system utterances with only selected attributes, such that the user hears repetition of some of the constraints he/she has specified, at appropriate points in the dialog, without sacrificing naturalness and efficiency The specific problems, then, are deciding what should be repeated, and when We first describe a simple heuristic of old versus new information Then we present a statistical approach, based on bigram models 22 6.1 Using Heuristics As a simple solution, we can use the previous dialog history, by tagging the attribute-value pairs as old (previously said by the system) information or new (not said by the system yet) information The generation module would select only new information to be included in the system utterances Consequently, information given by the user is repeated only once in the dialog, usually in the utterance immediately following the user utterance in which the new information was given When the system utterance uses a template that does not contain the slots for the new information given in the previous user utterance, then that new information will be confirmed in the next available system utterance in which the template contains those slots Although this approach seems to work fairly well, echoing user’s constraints only once may not be the right thing to Looking at human-human dialogs, we observe that this is not very natural for a conversation; humans often repeat mutually known information, and they also often not repeat some information at all Also, this model does not capture the close relationship between two consecutive utterances within a dialog The second approach tries to address these issues 6.2 Statistical Approach For this approach, we built a two-stage statistical model of human-human dialogs using the CMU corpus The model first predicts the number of attributes in the system utterance given the utterance class, then predicts the attributes given the attributes in the previous user utterance 6.2.1 Models The number of attributes model The first model will predict the number of attributes in a system utterance given the utterance class The model is the probability distribution P(nk) = P(nk|ck), where nk is the number of attributes and ck is the utterance class for system utterance k 23 The bigram model of the attributes This model will predict which attributes to use in a system utterance Using a statistical model, what we need to is find the set of attributes A* = {a1, a2, …, an} such that A * = arg max ∏P(a1, a2, , an ) We assume that the distributions of the ai’s are dependent on the attributes in the previous utterances As a simple model, we look only at the utterance immediately preceding the current utterance and build a bigram model of the attributes In other words, A* = arg max P(A|B), where B = {b1, b2, …, bm}, the set of m attributes in the preceding user utterance If we took the above model and tried to apply it directly, we would run into a serious data sparseness problem, so we make two independence assumptions The first assumption is that the attributes in the user utterance contribute independently to the probabilities of the attributes in the system utterance following it Applying this assumption to the model above, we get the following: m A * = arg max ∑ P(bk )P( A | bk ) k =1 The second independence assumption is that the attributes in the system utterance are independent of each other This gives the final model that we used for selecting the attributes m n k =1 i =1 A* = arg max ∑ P(bk )∏ P(ai | bk ) Although this independence assumption is an oversimplification, this simple model is a good starting point for our initial implementation of this approach Surface Realization 7.1 Corpus-based Approach If a natural human-computer dialog is one that closely resembles a human-human conversation, the best method for generating natural system utterances would be to mimic 24 human utterances In our case, where the system is acting as a travel agent, the solution would be to use a human travel agent’s utterances (see Section for details about the training data) The computational model we chose to use is the simple n-grams used in speech recognition 7.1.1 Implementation We have implemented a hybrid NLG module of three different techniques: canned expressions (e.g., “Welcome to the CMU Communicator.”), templates (e.g., “Hello Alice.”), and corpus-based stochastic generation For example, at the beginning of the dialog, a system greeting can be simply generated by a “canned” expression Other short and simple utterances can be generated efficiently by templates Then, for the remaining utterances where there is a good match between human-human interaction and human-computer interaction, we use the statistical language models There are four aspects to our stochastic surface realizer: building language models, generating candidate utterances, scoring the utterances, and filling in the slots We explain each of these below Building Language Models Using the tagged utterances as described in the introduction, we built an unsmoothed n-gram language model for each utterance class We selected as the n in n-gram to introduce some variability in the output utterances while preventing nonsense utterances Note that language models are not used here in the same way as in speech recognition In speech recognition, the language model probability acts as a ‘prior’ in determining the most probable sequence of words given the acoustics In other words, W* = arg max P(W|A) = arg max P(A| W)Pr(W) where W is the string of words, w1, …, wn, and A is the acoustic evidence (Jelinek 1998) 25 Although we use the same statistical tool, we compute and use this generative language model probability directly to predict the next word In other words, the most likely utterance is W* = arg max P(W|u), where u is the utterance class We not, however, look for the most likely hypothesis, but rather generate each word randomly according to the distribution, as illustrated in the next section Generating Utterances The input to NLG from the dialogue manager is a frame of attribute-value pairs The first two attribute-value pairs specify the utterance class The rest of the frame contains word classes and their values See Figure for an example of an input frame to NLG The generation engine uses the appropriate language model for the utterance class and generates word sequences randomly according to the language model distributions As in speech recognition, the probability of a word using the n-gram language model is P(wi) = P(wi|wi-1, wi-2, … wi-(n-1), u) where u is the utterance class Since we have built separate models for each of the utterance classes, we can ignore u, and say that P(wi) = P(wi|wi-1, wi-2, … wi-(n-1)) using the language model for u Since we use unsmoothed 5-grams, we will not generate any unseen 5-grams This precludes generation of nonsense sequences, at least within the 5-word window Using a smoothed ngram would result in more randomness, but using the conventional back-off methods (Jelinek 1998), the probability mass assigned to unseen 5-grams would be very small, and those rare occurrences of unseen n-grams may not make sense anyway There is the problem, as in speech recognition using n-gram language models, that long-distance dependency cannot be captured 26 Scoring Utterances For each randomly generated utterance, we compute a penalty score The score is based on the heuristics we’ve empirically selected Various penalty scores are assigned for the following: The utterance is too short or too long (determined by utterance-class dependent thresholds) The utterance contains repetitions of any of the slots The utterance contains slots for which there is no valid value in the frame The utterance does not have some of the required slots (see section for deciding which slots are required) The generation engine generates a candidate utterance, scores it, keeping only the best-scored utterance up to that point It stops and returns the best utterance when it finds an utterance with a zero penalty score, or it reaches the limit of 50 iterations The average generation for the longest utterance class (10-20 words long) is about 200 milliseconds Filling Slots The last step is filling slots with the appropriate values For example, the utterance “What time would you like to leave {depart_city}?” becomes “What time would you like to leave New York?” Evaluation and Discussions It is generally difficult to evaluate an NLG system, and although more attention has been given to evaluation in the recent years, there are several questions the NLG community has yet to answer regarding evaluation of NLG systems (see Mellish and Dale, 1998) In the context of spoken dialog systems, evaluation becomes even more difficult One reason is simply that there has been little effort in building sophisticated generation engines for spoken dialog systems, and much less in evaluating the generation engines Another reason is that it is 27 difficult to separate the NLG module from the rest of the dialog system, especially its very close neighbor, text-to-speech synthesis (TTS) This section discusses briefly some previous evalution methods in NLG and spoken dialog systems, presents a set of experiments we have conducted for our stochastic generator, and summarizes some issues in evaluating the NLG component of a spoken dialog system 8.1 Methods used in text generation Several techniques have been used to evaluate text generation systems, but finding a good evaluation method is still a difficult problem Whereas natural language understanding (NLU) researchers can use resources such as the Penn Treebank (Marcus, et al 1993) to compare the system’s answer to the “correct parse”, a similar resource is not appropriate for automatic evaluation of NLG because there is no one “correct sentence” for text generation Hence, many evaluation methods rely on human judgment of accuracy, fluency, and readability of the generated text The methodologies used include user surveys, measuring comprehension/response times, or comparison of human-generated output and machinegenerated output A good survey of evaluation methods is presented in Mellish and Dale,1998 8.2 Methods used in spoken dialog systems NLG in spoken dialog systems can be evaluated on several dimensions The first is an analysis of the technique, which may include grammar coverage, completeness of lexicon, variety of generated utterances, and other analytical measures of the technique This dimension plays a major part in building the NLG module, and is an important evaluation criterion However, we consider it more as a part of system building and maintenance, so we will focus this section on the other dimensions The second measure is the quality of the generated output within the embedded system This is the major concern for any evaluation metric, and will be the major topic of this section The last dimension is the effect of the output utterances on user's responses Not only does the system output need to be comprehensible, it must elicit a proper response from the user 28 8.3 Comparative Evaluation As a simple solution, we have conducted a comparative evaluation by running two identical systems varying only the generation component In this section we present results from two preliminary evaluations of our generation algorithms described in the previous sections 8.3.1 Content Planning: Experiment For the content planning part of the generation system, we conducted a comparative evaluation of the two different generation algorithms: old/new and bigrams Twelve subjects had two dialogues each, one with the old/new generation system, and another with the bigrams generation system (in counterbalanced order); all other modules were held fixed Afterwards, each subject answered eleven questions on a usability survey (see Figure 10) The first seven questions asked the subjects to compare the two systems, and the remaining four questions asked for user’s information and other comments Immediately after filling out the survey, each subject was given transcribed logs of his/her dialogues and asked to rate each system utterance on a scale of to (1 = good; = okay; = bad) (Insert Figure 10 about here) 8.3.2 Content Planning: Results Table shows the results from the usability survey The results seem to indicate subjects’ preference for the old/new system, but as one can tell from looking at the large variances, the difference is not statistically significant However, it is worth noting that six out of the twelve subjects chose the bigram system to the question “During the session, which system’s responses were easier to understand?” compared to three subjects choosing the old/new system We can make an interesting conjecture that since we used the statistical models of human-human conversations to generate the content of the system utterances, users found those utterances easier to understand, but confirming this conjecture would require extensive studies beyond the scope of this paper (Insert Table about here) 29 8.3.3 Surface Realization: Experiment For surface realization, we conducted a batch-mode evaluation We picked six calls to our system and ran two generation algorithms (template-based generation and stochastic generation) on the input frames We then presented to seven subjects the transcripts of the generated dialogues, consisting of decoder output of the user utterances and corresponding system responses, for each of the two generation algorithms Subjects then selected the output utterance they would prefer, for each of the utterances that differ between the two systems The results (see Table 4) show a trend that subjects preferred stochastic generation to template-based generation, but a t-test for all the subjects shows no significant difference (p = 0.18) (Insert Table about here) These results are inconclusive While that may be an acceptable proof that our simple stochastic generator does at least as well as a carefully hand-crafted template system, we wished to conduct another experiment with more subjects Another problem with this experimental setup was that subjects were reading written transcripts of the dialogs, rather than listening to the dialogs and evaluating them We set it up that way to simplify the experiment, but our concern after looking at the results was that subjects’ judgments were tainted by the different modality 8.3.4 Surface Realization: Experiment We conducted this experiment via the web to make it easier for more subjects to participate As in the first experiment, we picked three dialogs and ran them through the two different generation engines In this second experiment, we also wished to use audio files of the dialogs To evaluate our experimental methodology, we also asked the subjects to judge the transcripts of the same dialogs The set-up was as follows: The subject receives an email message containing a link to the webpage 30 The first webpage describes the experiment and gives instructions It also contains a link to the experiment The experiment page has links to the audio files (.wav files) where the subjects can click and listen to the dialogs After listening to two different versions of each of the dialogs (3 pairs of dialogs were used), the subject answers a set of questions (see Figure 11) After rating all three pairs of dialogs, the subject clicks to go onto the next webpage The webpage contains transcripts of the dialogs the subject just heard This time, the subject reads the transcripts and picks, for each dialog pair, which one is better (Insert Figure 11 about here) For both the audio and transcript evaluations, the results were again inconclusive Table shows the results of the questions after listening to the audio files If we look at the scores for subjects individually, 11 subjects gave better scores for stochastic, for templates, and neutral, but of those, only were statistically significant (3 for stochastic, for templates) Looking at the results from the transcript evaluation, 11 subjects preferred stochastic, preferred templates, and were neutral Of those, only (both for stochastic) were statistically significant As for the correlation between the two modalities, there seemed to be no correlation (r = 0.009) All in all, the results from all of the experiments show that the stochastic generator does as well as the template system (Insert Table about here) Conclusion We felt that the popular techniques in NLG are not adequate for spoken dialog systems, and explored an alternative approach The aim was to create a high quality generation system without having to have an expert grammar writer or the cumbersome maintenance of 31 templates We leveraged the data collection effort done for other parts of the dialog system, applied some heuristics to tag the data, and built stochastic models of human-human interaction By this corpus-based approach, we were able to capture the domain expert’s knowledge into simple n-gram models By augmenting our template system with this stochastic method, we built a hybrid system that produces high quality output with no complex grammar rules and minimally specified input For both content planning and surface realization, we proved that the stochastic method perform as well as the carefully crafted template system There also seemed to be subjects’ preference for the stochastic system, but the preference was not statistically significant We conducted experiments to evaluate our system, and by doing so, we discovered various elements one must be careful about when designing these experiments Teasing apart NLG from the dialog system is difficult, but with our comparative evaluation scheme, it is possible to so We believe this will open the door to further progress on evaluation of NLG in spoken dialog systems 10 References S Axelrod 2000 Natural language generation in the IBM Flight Information System In Proceedings of ANLP/NAACL 2000 Workshop on Conversational Systems, May 2000, pp 2126 L Baptist and S Seneff 2000 GENESIS-II: a versatile system for language generation in conversational system applications To appear in Proceedings of International Conference on Spoken Language Processing (ICSLP 2000) J Bateman and R Henschel 1999 From full generation to ‘near-templates’ without losing generality In Proceedings of the KI'99 workshop, "May I Speak Freely?" S Boyce and A Gorin 1996 User interface issues for natural spoken dialog systems In Proceedings of the International Symposium on Spoken Dialog, pp 65-68 October 1996 32 S Busemann and H Horacek 1998 A flexible shallow approach to text generation In Proceedings of the International Natural Language Generation Workshop Niagara-on-theLake, Canada Cahill, Doran, Evans, Mellish, Paiva, Reape, Scott, and Tipper 1999 In search of a reference architecture for NLG systems In Proceedings of European Workshop on Natural Language Generation, 1999 P Clarkson and R Rosenfeld 1997 Statistical Language Modeling using the CMU- Cambridge toolkit In Proceedings of Eurospeech97 M Elhadad and J Robin 1996 An Overview of SURGE: A reusable comprehensive syntactic realization component, Technical Report 96-03, Dept of Mathematics and Computer Science, Ben Gurion University, Beer Sheva, Israel M Eskenazi, A Rudnicky, K Gregory, P Constantinides, R Brennan, C Bennett, J Allen 1999 Data Collection and Processing in the Carnegie Mellon Communicator In Proceedings of Eurospeech, 6, 2695-2698 E Goldberg, N Driedger, and R Kittrdge 1994 Using natural-language processing to produce weather forecasts IEEE Expert, 9(2), 45-53 A.L Gorin, G Riccardi and J.H Wright How may I Help You? Speech Communication 23 (1997) pp 113-127 P J Hayes and D R Reddy 1983 Steps toward graceful interaction in spoken and written man-machine communication International Journal of Man-Machine Studies, v.19, p 231284 J Kowtko and P Price 1989 Data collection and analysis in the air travel planning domain In Proceedings of DARPA Speech and Natural Language Workshop, October 1989 33 I Langkilde and K Knight 1998 The practical value of n-grams in generation In Proceedings of International Natural Language Generation Workshop 1998 M Marcus, B Santorini and M.A Marcinkiewicz 1993 Buidling a large annotated corpus of English: the Penn Treebank In Computational Lingusitics, Vol 19 C Mellish and R Dale 1998 Evaluation in the context of natural language generation Computer Speech and Language, vol 12, pp 349-373 S Nirenburg, Lesser, and E Nyberg 1989 Controlling a language generation planner In Proceedings of IJCAI-89 Detroit, MI A Ratnaparkhi 2000 Trainable methods for surface natural language generation In Proceedings of the ANLP/NAACL 2000 Workshop on Conversational Systems, pp 194-201 E Reiter 1995 NLG vs Templates In Proceedings of European Natural Language Generation Workshop E Reiter, R Robertson, and L Osman 2000 Knowledge acquisition for natural language generation In Proceedings of the First International Natural Language Generation Conference, June 2000, pp 217-224 A Rudnicky, E Thayer, P Constantinides, C Tchou, R Shern, K Lenzo, W Xu, and A Oh 1999 Creating natural dialogs in the Carnegie Mellon Communicator system In Proceedings of Eurospeech, 1999, 4, 1531-1534 A Stent 1999 Content planning and generation in continuous-speech spoken dialog systems In Proceedings of the KI'99 workshop, "May I Speak Freely?" W Xu, and A Rudnicky 2000 Task-based dialog management using an agenda In Proceedings of ANLP/NAACL 2000 Workshop on Conversational Systems, May 2000, pp 4247 34 V Zue, et al 2000 JUPITER: A Telephone-Based Conversational Interface for Weather Information IEEE Transactions on Speech and Audio Processing, Vol , No 1, January 2000 35 ... this paper Natural Language Generation for Spoken Dialog Systems Natural Language Generation (NLG) and Spoken Dialog Systems are two distinct and nonoverlapping research fields within language. .. an NLG component for spoken dialog systems 2.2 Spoken Dialog Systems A spoken dialog system enables human-computer interaction via spoken natural language A task-oriented spoken dialog system speaks... contribution to the general problem of natural language generation for spoken dialog, written dialog, and text generation Introduction Natural language generation (NLG) is the process of generating