Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 48 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
48
Dung lượng
4,21 MB
Nội dung
Quality of Spoken Dialogue Systems 323 be informative in the data pre-analysis. This set includes DD, STD, UTD, SRD, URD, # T URNS , WPST, WPUT, # BARGE-INS, # SYSTEM ERROR MESSAGES, # SYSTEM QUESTIONS, # USER Q UESTIONS , AN:CO, AN:PA, AN:FA, PA: CO, PA: PA, PA:FA, SCR, UCR, CA:AP, CA:IA, IR, IC, UA, WA, WER, In both sets, the turn-related parameters have been normalized to the overall number of turns (or for the AN parameters to the number of user questions), as is described in Section 6.2.1. Input parameters reflecting task success: Although the basic formula of the PARADISE model contains as a mandatory input parameter, it has been often replaced by the user judgment on task completion, COMP, in the practical application of the model. This COMP parameter roughly corresponds to a binary version of the judgment on question B1. For the analysis of the experiment 6.3 data, the following options for describing task success have been chosen: Task success calculated on the basis of the AVM, either on a per-dialogue level or on a per-configuration level Task success measures based on the overall solution, and User judgment on question B1. A binary version of B1 calculated by assigning a value of 0 for a rating and a value of 1 for a rating B1 > 3.0. It should be noted that B1 and are calculated on the basis of user judg- ments. Thus, using one of these parameters as an input to a prediction model is not in line with the general idea of quality prediction, namely to become independent of direct user ratings. Apart from the input and output parameters, the choice of the regression approach carries an influence on the resulting model. A linear multivariate analysis like the one used in PARADISE has been chosen here. The choice of parameters which are included in the regression function depends on the amount of available parameters. For set 1, a forced inclusion of all four pa- rameters has been chosen. For set 2, a stepwise inclusion method is more appropriate, because of the large number of input parameters. The stepwise method sequentially includes the variables with the highest partial correlation values with the target variable (forward step), and then excludes variables with the lowest partial correlation (backward step). In case of missing values, the corresponding cases have been excluded from the analysis for the set 1 data (listwise exclusion). For set 2, such an exclusion would lead to a relatively low 324 number of valid cases; therefore, the missing values have been replaced by the corresponding mean value instead. In Table 6.35, the amount of variance covered by models with set 1 input parameters is listed for different target variables and task-success-related input parameters. When the coefficient is used for describing task success, the amount of covered variance is very low Unfortunately, the authors of PARADISE do not provide example models which include the coefficient. Instead, their models rely on the COMP measure. Making use of the parameter (which is similar to COMP ) increases to 0.24 0.45, which is in the range of values given in the literature. The model’s performance can be further increased by using the non-simplified judgment on question B1 for describing task success. In this case, reaches 0.52, a value which is amongst the best of Table 6.34. Task success measures which are based on the overall solution and the modified version provide slightly better estimations than and but they are not competitive with the subject- derived measures B1 and Apparently, the PARADISE model performs best when it is not completely relying on interaction parameters, but when subjective estimations of task success are included in the estimation function. This finding is in line with comparative experiments described by Bonneau- Maynard et al. (2000). When using subjectivejudgments of task success instead of the coefficient, the amount of predicted variance raised from 0.41 to 0.48. Comparing the performance for the different target variables, seems to be least predictable. The amount of covered variance is significantly lower than in the experiments described by Walker. The relatively low number of input parameters in set 1 may be a major reason for this finding. Prediction accuracy significantly raises when B0 or B23 are taken as the target parameter, and with B1 or describing task success. A further improvement is observed when Quality of Spoken Dialogue Systems 325 the target parameter is calculated as a mean over several ratings, namely as M EAN(B0, B23) or MEAN(B). The model’s performance is equally high in these cases. Apparently, the smoothing of individual judgments which is inherent to the calculation of the mean has a positive effect on the model’s prediction accuracy. Table 6.36 shows the significant predictors for different models determined using the set 1 “dialogue cost” parameters and different task-success-related parameters as the input . Target variables are either the or the MEAN (B ) parameter. For most significant dialogue cost contributions come from # T URNS (with a negative sign), and partly also from the # BARGE-INS pa- rameter (negative sign). DD and IC only play a subordinate role in predicting For the task-success-related parameters, a clear order can be observed: B1 and have a dominant effect on (both with a positive sign), and only a moderate one (the first with a positive and the second with a 326 negative sign), and and are nearly irrelevant in predicting For M EAN(B) as the target, the situation is very similar. Once again, # TURNS is a persistent predictor (always with a negative sign), and DD, IC and # B ARGE- I NS only have minor importance. The task-success-related input parameters show the same significance order in predicting M EAN (B): B1 and have a strong effect (positive sign), and a moderate one (also posi- tive sign), and and are not important predictors. Apparently, the PARADISE model is strongly dependent on the type of the input parameter describing task success. The prediction results for different target variables are depicted in Table 6.37, both for the expert-derived parameter and for the user-derived parameter describing task success. The most important contributors for the prediction of are # T URNS (negative sign) and the task-success-related parameter. For predicting B0, also DD and # B ARGE-INS (both negative sign) play a certain role. B23 and MEAN(B0, B23) seem to be better predicted from Quality of Spoken Dialogue Systems 327 DD and the task-success-related parameter; here, the # TURNS parameter is relatively irrelevant. For predicting M EAN(B), the most significant contribu- tions come from # TURNS and As may be expected, the different target parameters related to user satisfaction require different input parameters for an adequate prediction. Thus, the models established by the multivariate regression analysis are only capable of predicting different indicators of user satisfaction to a limited extent. The number of input parameters in set 1 is very restricted (four “dialogue cost” parameters and one task-success-related parameter). Taking the set 2 pa- rameters as an input, it can be expected that more general aspects of quality are covered by the resulting models. An overview of the achievable variance coverage is given in Table 6.38. In general, the coverage is much better than was observed for set 1. Using the interaction parameters or for describ- ing task success, raises to 0.28 0.47 depending on the target parameter. With B1 or an even better coverage can be reached As was observed for the set 1 data, it seems to be important to include subject- derived estimations of task success in the prediction function. Expert-derived parameters like or are far less efficient in predicting indicators of user satisfaction. Interestingly, the and parameters are never selected by the stepwise inclusion algorithm. Thus, the low importance of these parameters in the prediction function (see Table 6.36) is confirmed for the augmented set of input parameters. Overall, the prediction functions include a relatively large number of input parameters. However, the amount of variance covered by the function does not seem to be strictly related to the number of input parameters, as the results in the final row or column of Table 6.38 show. 328 Taking subject-derived estimations of task success as an input, the best pre- diction results can be obtained for B0 and M EAN(B) . Prediction functions with a good data coverage can be obtained especially in the latter case. The val- ues in these cases exceed the best results reported by Walker et al. (2000a), see Table 6.34. However, it has to be noted that a larger number of input parameters are used in these models. For the set 2 data, no clear tendency towards better re- sults for the prediction of smoothed arithmetic mean values M EAN (B), MEAN(B0,B23)) can be observed. In summary, the augmented data set leads to far better prediction results, with a wider coverage of the resulting prediction functions. Table 6.39 shows the resulting prediction functions for different task-success- related input parameters. The following parameters seem to be stable contrib- utors to the respective targets: Measures of communication efficiency: Most models include either the WPST and SRD parameters (positive sign), STD (negative sign), or # T URNS (negative sign). The latter two parameters seem to indicate a preference for shorter interactions, whereas the positive sign for the WPST parameter indicates the opposite, namely that a talkative system would be preferred. A higher value for SRD is in principle linked to longer user utter- ances which require an increased processing time from the system/wizard. No conclusive explanation can be drawn with respect to the communication efficiency measures. Measures of appropriateness of system utterances: All prediction functions contain the CA:AP parameter with a positive sign. Two models of Ta- ble 6.39 also contain CA:IA (positive sign), which seems to rule out a part of the high effect of CA:AP in these functions. In any case, dialogue cooperativity proves to be a significant contributor to user satisfaction. Measures of task success: The task-success-related parameters do not al- ways provide an important contribution to the target parameter, except for B1 which is in both cases a significant contributor. In the model estimated from the first four input parameter sets (identical model), task success is completely omitted. Measures of initiative: Most models contain the # SYSTEM QUESTIONS parameter, with a positive sign. Apparently, the user likes systems which take a considerable part of the initiative. Only one model contains the # USER QUESTIONS parameter. Measures of meta-communication: Two parameters are frequently selected in the models. The PA:PA parameter (positive sign) indicates that partial Quality of Spoken Dialogue Systems 329 system understanding seems to be a relevant factor for user satisfaction. The SCR parameter is an indicator for corrected misunderstandings. It is always used with a positive sign. 330 The prediction functions differ for the mentioned target parameters, see Ta- ble 6.40. Apart from the parameters listed above, new contributors are the di- alogue duration (negative sign), the # BARGE-INS parameter (negative sign), and in two cases the word accuracy as well. Whereas the first parameter under- lines the significant influence of communication efficiency, the latter introduces speech input quality as a new quality aspect in the prediction function. Two models differ significantly from the others, namely the ones for predicting B23 and MEAN(B0, B23) on the basis of and the set 2 input parameters. The models are very simple (only two input parameters), but reach a relatively Quality of Spoken Dialogue Systems 331 high amount of covered variance. The relatively high correlation between B1 and B23 may be responsible for this result. The values given so far reflect the amount of variance in the training data covered by the respective model. However, the aim of a model is to allow for predictions of new, unseen data. Experiments have been carried out to train a model on 90% of the available data, and to test it on the remaining 10% of data. The sets of training and test data can be chosen either in a purely randomized way, i.e. selecting a randomized 10% of the dialogues for testing (comparable to the results reported in Table 6.34), or in a per-subject way, i.e. selecting a randomized set of 4 of the 40 test subjects for testing. The latter way is slightly more independent, as it prevents within-subject extrapolation. Both analyses have been applied ten times, and the amount of variance covered by the training and test data sets ( values) is reported in Tables 6.41 and 6.42. It turns out that the models show a significantly lower predictive power for the test data than for the training data. The performance on the training data is comparable to the one observed in Table 6.40, namely using and using as the input parameter related to task success. For a purely randomized set of unseen test data, the mean amount of covered variance drops to 0.263 with and to 0.305 with The situation is similar when within-subject extrapolation is excluded: Here, the mean drops to 0.198 with and to 0.360 with In contrast to what has been reported 332 by Walker et al. (see Table 6.34), the model predictions are more limited to the training data. Several reasons may be responsible for this finding. Firstly, the differences between system versions seem to be larger in experiment 6.3 than in Walker et al. (2000a). Although different functionalities are offered by the systems at AT&T, it is to be expected that the components for speech input and output were identical for all systems. Secondly, the amount of available training data is considerably lower for each system version of experiment 6.3. Walker et al. showed saturation from about 200 dialogues onwards, but these 200 dialogues only reflected three instead often different system versions. Finally, several of the parameters used in the original PARADISE version only have limited predictive power for experiment 6.3, e.g. the # BARGE-INS , # ASR REJECTIONS and # HELP REQUESTS parameters, see Section 6.2.1. It can be expected that a linear regression analysis on parameters which are only different from zero in a few cases, will not lead to an optimally fitting curve. The interaction parameters and user judgments which form the model input have been collected with different system versions. In order to capture the resulting differences in perceived quality, it is possible to build separate pre- diction models for each system configuration. In this way, model functions for different system versions can be compared, as well as the amount of variance which is covered in each case. Table 6.43 shows models derived for each of the ten system versions of experiment 6.3, as well as the overall model derived [...]... Bernsen et al (19 98) provided a very valuable theory in this respect, which can be directly applied to SDS development and 352 optimization This theory forms a basis for the detailed analysis of quality aspects governing the interaction with telephone-based spoken dialogue systems which was presented here Applying a recent definition of quality to this type of service (Jekosch, 2000), the quality perceived... features and physical or algorithmic quality elements is not always obvious and straight-forward, because spoken dialogue systems are particularly complex and consist of a number of interconnected components Two fundamental ways of addressing their quality have been discussed Technology-centered quality assessment and evaluation tries to capture the performance or quality of individual system components,... absence of speech-output- Quality of Spoken Dialogue Systems 339 related interaction parameters Only the speech input aspect of the category is covered by the interaction parameters of set 3 However, these parameters were not identified as relevant predictors by the algorithm This finding underlines the fact that information may be lost when different quality aspects are mixed as a target variable of the... renouncing on quality judgments obtained from the test subjects, and replacing them by collected interaction parameters Such an approach would inevitably lead to a loss of information, because the interaction parameters do not reflect the quality perceived by the test subjects Quality of Spoken Dialogue Systems 349 Via a multidimensional analysis, it is possible to identify and to describe the quality dimensions... potential quality dimensions This disadvantage may be overcome with the help of the QoS taxonomy defined in Chapter 2 On the basis of the quality aspects and categories identified in the taxonomy, predictors of individual quality aspects have been derived Target variables defined for each quality category can be predicted with a similar degree of precision as was observed for the global quality aspects... used to describe the performance or quality of spoken dialogue systems and their components mainly address the effects of environmental and agent factors, and only to a limited extent the effects of task factors Their aim is to estimate a target variable related to performance (e.g word error rate or task success) or quality (e.g overall user satisfaction) on the basis of instrumentally or expert-derived... value, forced inclusion of all input parameters # UQ: # USER QUESTIONS; # BI: # BARGE-INS; # SEM: # SYSTEM ERROR MESSAGES; # SQ: # SYSTEM QUESTIONS; # UQ: # USER QUESTIONS Quality of Spoken Dialogue Systems 343 for B0, for B23, for MEAN(B0,B23), and for MEAN(B) These values show that the amount of variance which can be covered by a regression model strongly depends on the choice of available input parameters... perceived quality dimensions (quality features) which are somehow linked to the quality elements, i.e the physical or algorithmic characteristics of the system and the interaction scenario, including the transmission channel A new taxonomy of quality of service aspects has been proposed, and provides a generic structure for the relevant quality elements (grouped into factors) and quality features (grouped... cooperativity, dialogue symmetry and speech input/output quality are taken as input variables, together with additional interaction parameters which have been assigned to these categories In this way, the interdependence of quality categories displayed in the QoS taxonomy is reflected in the model structure On the basis of the predictions for all six quality categories, estimations of global quality aspects... or quality of system components in the overall setting, and is the final reference for describing the quality of the system and the offered service as a whole Both types of evaluation should go hand in hand, because they provide complementary information to the system developer In Chapter 3, an overview of the respective assessment and evaluation methods has been presented Two system components are of . of questions in both categories, see Table 6.47. The speech input/output quality category cannot be well predicted. This is mainly due to the absence of speech-output- 3 38 Quality of Spoken Dialogue. amount of variance which is covered in each case. Table 6.43 shows models derived for each of the ten system versions of experiment 6.3, as well as the overall model derived Quality of Spoken Dialogue. on the basis of and the set 2 input parameters. The models are very simple (only two input parameters), but reach a relatively Quality of Spoken Dialogue Systems 331 high amount of covered variance.