1. Trang chủ
  2. » Luận Văn - Báo Cáo

Perception of synthetic speech generated by rule

12 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

zyxw zyxwvut zyxwv Perception of Synthetic Speech Generated by Rule zyxwvutsrqpo DAVID B PISONI, HOWARD C.NUSBAUM, AND BETH G GREENE I n v i t e d Paper As the use of voice response systemsemploying synthetic speech becomes more widespread in consumer products, industrial and military applications, and aidsfor the handicapped, itwill be necessary to develop reliable methods of comparing different synthesis systems and of assessing how human observers perceive and respond to the speech generated by these systems The selection of a specific voice response system for a particular application depends on a wide variety of factors only one of which is the inherent intelligibility of the speech generated by the synthesis coutines In this paper, we describe the results of several studies that applied measures of phoneme intell;gibility, word recognition, and comprehension to assess the perception of synthetic speech Several techniques were used to compare performance of different synthesis systems with natural speech and to learn more about how humans perceive synthetic speech generated by rule Our findings suggest that the perception of synthetic speech depends on an interaction ofseveral factors including the acoustic-phoneticp r o p erties of the speech signal, the requirements of the perceptual task, and the previous experience of the listener Differences in percep tionbetween natural speech and high-quality synthetic speech appear to be related to the redundancy of the acoustic-phonetic information encoded in the speech signal I INTRODUCTION In the not toodistant past, voice output systems could be classified into two broad categories depending on the nature of the synthesis process Speech-coding systems used a fixed set of parameters to reproduce a relatively limited vocabulary of utterances These systems produced intelligible and acceptable speech [58], [59] at the cost of flexibility in terms of the range of utterances that could be produced In contrast! synthetic speech produced by rule provided less intelligible and less natural sounding speech, but these systems had the capability of automatically converting unrestricted text in ASCII format into speech[2], [3] Over the last few years, significant improvements in text-to-speech Manuscript received January 15, 1985; revised July 3, 1985 This research was supported, in part, under NIH Grant NS-12179 and, in part, under Contract AF-F 33615-83-K-0501 with the Air Force Systems Command, AFOSR, through the Aerospace Medical Research Laboratory, Wright-Patterson AFB, Ohio Requests for reprints should be sent t o the authors at the address below Theauthors are with the SpeechResearchLaboratory, Departmentof Psychology, IndianaUniversity, Bloomington, IN 47405, USA systems have begun to eliminate the advantages of simple coded-speechvoice responsesystemsover text-to-speech systems Extensive researchon improvingthe letter-to-sound rules and phonetic implementation rules used by these systems, as well as the techniques of diphone and demisyllable synthesis in text-to-speech systems suggest that, in the near future, unrestricted text-to-speech voice response devices may producehighly intelligible and very natural sounding synthetic speech [23] As the quality of the speech generated by text-to-speech systems improves, it becomes necessary to be able to evaluate and compare the performance ofdifferent synthesis systems The need for a systematic and reliable assessment of the capabilities of voice response devices becomes even more critical as the complexity of these systems increases-this is especially true when considering the advanced features that are now being offered by some of the newest systemssuch as DECtalk,Prose-2000, and lnfovox which provide capabilities for synthesis of different languages and generation of several different synthetic voices It is also important, in its own right, to learn more about how human listeners perceive and understand synthetic speech and how performance with synthetic speech differs from natural speech If there existed a set of algorithms or a set of acoustic criteria that could be applied automatically to measure the quality of synthetic speech, there would be no question about describing the performance of a particular system or the effectiveness of new rules orsynthesis techniques Standards could be developed fairly easily and applied uniformly Unfortunately, there is no existing method for automating the assessment of synthetic speech quality The ultimate test of synthetic speech involves assessment and perceptual response by the human listener Thus it is necessary to employ perceptual tests of synthetic speech under the conditions in which synthetic speech will be used The perceptionof speech depends on the human listener as much as it does on the attributes of the acoustic signal itself and the system of rules used to generate the signal [42] Although it is clear that the performance of systems that generate synthetic speech must be evaluated using objec- 0018-9219/85/1100-1665$01.00 01985 IEEE PROCEEDINGS zyxwvutsrqponmlkji OF T H E IEEE VOL 73,NO 1 N O V E M B E R 1965 1665 tive perceptual tests, there have been only a handfulof studies dealing with the intelligibility of synthetic speech over the years(e.g., [19], [38], [MI) And, there have been even fewer discussions of the technical issues that surround theproblemof measuring the intelligibilityof synthetic speech (see, [35], [ a ] , [42]) Furthermore, it is important to specify precisely which aspects of synthetic speech are being evaluated O n one hand, the perception and comprehension of synthetic speech can be measured using a variety of objective behavioral tests that provide precise and statistically reliable estimates of the performance of a particular voice response system in a specific condition Thesetests investigate the transmission of linguistic information from the speech signal to the listener and address specific questions such as: 1) how accurately are synthetic phonemes and words recognized, 2) how well is the meaning of a synthetic utterance understood, and 3) how easy is it to perceive and understand synthetic speech O n the other hand, an equally important issue concerns the acceptability and naturalness of synthetic speech, and whetherthe listener prefers one type of speech output over another Questions of listener preference cannot be addressed directlyusingobjective performance measures such as the proportionof words correctly recognized or response latencies, but instead must be investigated more indirectly by asking the listener for his or her subjective impressions of the quality of synthetic speech using questions designed to assess different dimensions of naturalness, acceptability, and preference [36] In the Speech Research Laboratory at Indiana University, we have carried out a large number of studies over the last five years to learn more about the perception of synthetic speech generated automatically by rule using several textto-speech systems (see [34], [37l, [42], [MI) Strictly speaking, this work is not human factorsresearch; that is, it is not designed to answer specific questions regarding the development and use of specific products or techniques Rather, the goal of our researchhas been to provide more basic knowledge about the perception of synthetic speech under well-defined laboratory conditions These research findings can then serve as a benchmark for subsequent human factors studies that maybe motivated by more specific problems of using voice responsesystems for a particular application In general, our researchhas focused on measuringthe performance of human listeners who are required to perceive and respond to synthetic speech under a variety of task demands and experimental conditions During thecourse of this work, we have also carried out several comparisions of the performance of human listeners on standardized perceptual tasks using synthetic speech generated by rule with a number of text-to-speech systems 11 CONSTRAINTS ON The first factor that constrains performance concerns the complexity of the tasks that engage an observer during the perception of speech In some tasks, the response demands are relatively simple, such as deciding which of two known words was said Other tasks are extremely complex, such as trying to recognize an unknown utterance from a virtually unlimited number of response alternatives, while engaging in an activity that already requires attention There is a substantial amount of research in the cognitive psychology and human factors literature demonstrating thepowerful effects of perceptual set, instructions, subjective expectancies, cognitive load, and response set on performance in a variety of perceptual and cognitive tasks [63] The amount of context and the degree of uncertainty in the task also stronglyaffect anobserver’s performance in substantial ways[22].Thus it is necessary to understand the requirements and demands of a particular task before drawing any strong inferences about anobserver’s behavior or performance zyxwvu B Limitations on the Observer The second factor influencingrecognition of synthetic speech concerns the structural limitations on the human information processing system’s ability to perceive, encode, store, and retrieve information Because the nervous system cannot maintain all aspects of sensory stimulation (and therefore must integrate acoustic energyover time), very severe processing limitations have been found in the human observer‘s capacity to encode andstore raw sensory data in memory To overcome these capacity limitations, the listener must rapidly transform sensory input into more abstract neural codes for more stablestorage in memory and subsequent processing operations The bulk of the research in perception and cognitive processes over the last 25yearshas identified human short-term memory (STM) as a major limitationon processing sensory input [50].The amount of information that can be processed in and out of STM is severely limited by the listener’s attentional state, past experience, and thequalityof the original sensory input C Experience and Training zyxwvu zyxwvuts PERFORMANCE OF HUMANOBSERVERS To provide a framework for interpreting the results of our research, we first consider a number of factors that are known to affect an observer’s performance: 1) the specific demands imposed by a particular task, 2) theinherent limitations of the human information processing system, 3) the experience and training of the human listener, 4) the linguistic structure of the message set, and 5) the structure and quality of the speech signal 1666 A Task Complexity The third factor concerns the ability of human observers to quickly learn effective cognitive and perceptual strategies to improve performance in almost anytask When given appropriate feedback and training, subjects can learn t o classify novel stimuli, remember complex pattern sequences, and respond to rapidly changing stimulus patterns in different sensory modalities Clearly, the flexibility of subjects in adapting to the specific demands of a task is an important constraint that mustbe considered and controlled in any attempt to evaluate the perception of synthetic speech by the human observer D MessageSet The fourth factor relates to the structure of the message set; that is, the constraints on the number of possible messages and the organization and linguistic properties of the messageset A messagesetmay consist of words that PROCEEDINGS O F THEIEEE,VOL z 73, NO 11, NOVEMBER zyxw zyxwvutsr are distinguished only by a single phoneme or may consist of wordsand phrases with very different lengths, stress patterns, and phonotactic structures Use of this constraint by listeners depends on linguisticknowledge [27].The choice and arrangement of speechsounds into words is constrained by thephonological rules of language; the arrangement of words in sentences is constrained by syntax; and finally, the meaning of individual words and the overall meaning of sentences in a text is constrained by semantics and pragmatics of language The contribution of these various levels of linguistic structure to perception will vary substantially from isolated words, to sentences, to passages of fluent continuous speech E SignalCharacteristics The fifth factor refers to the acoustic-phonetic and prosodicstructureof a synthetic utterance This constraint refers to the veridicality of the acoustic properties of the synthetic speech signal compared to naturally produced speech Speechsignalsmaybe thought of as the physical realization of a complex and hierarchically organized system of linguistic rules that map sounds onto meanings and meanings back onto sounds At the lowest level inthe system, the distinctive properties of the speechsignalare constrained in substantial ways by vocal tract acoustics and articulation The acoustic-phonetic structure of natural speech reflects these physical and contextual constraints; synthetic speech is an impoverished signal representing phoneticdistinctionswithonly a limited subset of the acoustic properties used to convey phonetic information in natural speech Furthermore, the acoustic properties used to represent segmental structure in synthetic speechare highly stylized and are insensitive to phonetic context when compared to natural speech 111 PERCEPTUAL EVALUATION or final position [18] All the stimuli in the MRT are consonant-vowel-consonant (CVC) monosyllabic words; on half the trials, the responses share the vowel-consonant portion of the stimulus and on the other half, the responses share the consonant-vowelportion Thus the MRT provides a measure of the performance of listeners in identifyingeither the initial or final phoneme of a set of spoken words To date, we have evaluated natural speech and synthetic speech produced by five different text-to-speech systems: the Vortrax Type-'N'-Talk, the SpeechPlusProse-2000, the MITalk-79 research system, Infovox, and DECtalk (see [15]) The major findings are summarized in Table Performance Table Percent Correct Performance Obtained for Modified Rhyme Test (MRT) Experiments Conducted at the Speech Research Laboratory (1979-1984) zyx zyx zyxwvu System Tested (date) Natural Speech* (11/79) MITalk-79 Research System* (6/79-9/79) Prototype Prose-2000** (12/79) Votrax Type-'N'-Talk*** (3/82) DECtalk Paul v1.8' (3/W DECtalk Betty v1.8++ (8/W Current Working Prose+++ (9/W Prose-2000 v3.0 (3/85) lnfovox (3/85) MRT Closed MRT Open 99.4 97.2 93 I 75.4 87.6 66.2 96.7 86.7 94.4 82.5 l O 94.3 zyxwv zyxwvutsrqpon OF SYNTHETIC SPEECH There are basically three areas in which a text-to-speech system could produce errors that would impact the overall intelligibility of the speech: 1) the spelling-to-sound rules, 2) the computation and production of suprasegmental information, and 3) the phonetic implementation rules that convert the internal representation of phonemes and/or allophones into a speech waveform [2], [4] In our previous research, we have found that phoneticimplementation rules are a major factor in determining the segmental intelligibility of a voice response system [33] In the perceptual studies described below, we have focused most of our attention on measures of segmental intelligibility, assuming that the letter-to-sound rulesused by a particular text-tospeech system were applied correctly A PhonemeIntelligibility The task that has been used most often in previous studies evaluating synthetic speech and is now accepted as the de facto standard measure of the segmental intelligibility of synthetic speech is the Modified RhymeTest([14], [38]; however, see [ a ] for a different opinion) In the Modified Rhyme Test (MRT), subjects are required to identify a single word by choosing one of six alternative response words differing by a single phoneme in either initial 87.4 'Pisoni and Hunnicutt, [MI **Bernstein and Pisoni, [5] ***Pisoni and Koen, [45] +Creene, Manous, and Pisoni, [15] ++ Creene, Manous, and Pisoni, 1964 (final report) +++Manous, Creene, and Pisoni, 1W in the MRT for natural speech was the best at 99.4 percent correct For DECtalk v1.8, we evaluated speech produced by "Paul" and "Betty," two of DECtalk's nine voices, and found different levels of performance on these voices 96.7 percent of the words spoken by the Paul voice were identified correctly while only 94.4 percent of Betty's words were identified correctly The level of performance observed for the Paul voice comes the closest to natural speech and is considerably higher than performance for any of the other text-to-speech systems we have studied Performance on MITalk-produced speech was somewhat lower than either of the DECtalk v1.8 voices at 93.1 percent correct word identification The prototype of the Prose-2000 produced speech that was identified at 87.6 percent correct; version 3.0 ofthe Prose-2000has improved with performance at94.3 percent correct The lnfovox multilingual system produced English speech that was identified at 87.4 percent correct Finally, the least intelligible synthetic speech was produced by the Votrax Type-'N'-Talk with only 67.2 percent correct identification These results, obtained under closely matched testing conditions in the same laboratoryenvironment, show a wide range of variation zyxwvutsrqponmlkjihg PlSONl et dl.: PERCEPTION OF SYNTHETIC SPEECH GENERATED BY RULE 1667 amongcurrently available text-to-speech systems In our view, these differences in performance directly reflect the amount of basicresearch that was carried out to develop the phonetic implementation rules of these different voice response systems Inadditiontothe standard closed-response MRT, we havealso used an open-response format version of the MRT In this procedure, listeners are instructed towrite down the word that they heard on each trial This openresponse format provides a measure of performance when constraints onthe responsesetare minimized (all CVC words known to the listener compared to the six alternative responses in the closed-responseversion).This procedure also provides information about the intelligibility of vowels that is not available in the closed-response set version of the MRT A comparison of the closed- and open-response versions ofthe MRT for synthetic speech produced by different text-to-speech systems with natural speech indicates the degree towhich listeners rely on response-set constraints Some representative findings using theopenresponse MRT format are also shown in Table Performance on the open-response set MRT for natural speech was at 97.2 percent correct exact word identification compared to 99.4 percent correct in the closed-response set task Even when there are no strong constraints on the number of alternative responses for natural speech, performance is still better than for any of the text-to-speech systems with a constrained set of responses For the MITalk79 research system, performance in the open-set MRT task is, however, considerably worse at 75.4 percent correct Similarly, DECtalk's Paul voice was identified at the 86.7 percent level; correct word identification for "Betty" was 82.5 percent correct These results show a large interaction between intelligibility measured in the closed-response format MRT and the open-response format MRT Although the rank ordering of intelligibility remains the sameacross the two forms of the MRT, it is clear that as speech becomes less intelligible, listeners rely more heavily on response-set constraints to aid performance Table Percent Correct Word Recognition for Meaningful and Semantically AnomalousSentence Contexts Type of Sentence Context Meaningful (%) Anomalous (%) Type of Speech Natural MlTalk-79 Prototype Prose-2000 64.5 I DEC Paul V DEC Betty V 8I 75.1 Current Working Prose (9/84) 99.2 93.3 83.7 95.3 90.5 l O 97.7 78.7 zyx 86.8 sentences for natural speech, and synthetic speech produced by MlTalk-79, the Speech Plus Prose-2000 prototype, and for DECtalk's Paul and Bettyvoices (v1.8).For natural and synthetic speech, word recognition was much better in meaningful sentences than in the semantically anomalous sentences Furthermore, a comparison of correct word identification in these sentences reveals an interaction in performance suggesting that semantic constraints are relied on by listeners much more when the speechbecomesprogressively less intelligible zyxw zyxwvutsrq zyxwvutsr zyxwvut zyxwvuts Word Recognition in Sentences To examine the contribution of several linguistic constraints on performance, we compared word recognition in two types of sentence contexts The first type of sentence context was syntactically correct and meaningful-the Harvard psychoacoustic sentences [13] An example is given in (1) below: Add salt before you fry the egg (1) The second typeof sentence context was syntactically correct, but these sentences were semantically anomalous -the Haskins syntactic sentences [38] These test sentences had the syntactic form of normal sentences, but they were nonsense An example of this type of nonsense sentence is given in (2) below: The old farm cost the blood (2) By comparing word recognition performance for these two classes of sentences, it was possible to determine the influence of sentence meaning and linguistic constraints on word recognition [IS] Table shows percent correct word identification for meaningful and semantically anomalous 1668 C ListeningComprehension Spoken language understanding is a very complex cognitive process that involves theencodingof sensory information, retrieval of previously stored knowledge from long-term memory, and the subsequent interpretation and integration of various sources of knowledge available to a listener [26],[39].Language comprehension, therefore, depends on a relatively large number of diverse and complex factors, many of which are still only poorly understood by cognitive psychologists at the present time Measuring comprehension is difficult, therefore, because of the interaction of several different knowledge sources in the comprehension process This problem is made worse because there is no coherent theoretical model of language comprehension to guide the developmentof measurement procedures Moreover,there are presently no theoretical models that can deal with the diverse strategies employed by listeners to mediate language understanding under a wide variety of listening conditions and task demands One of the factors that obviously plays an important role in listening comprehension is the quality of the initial input signal-that is, the intelligibility ofthe speech itself But the acoustic-phonetic properties of the input signal are only one source of information used by listeners in speech perception and spoken language understanding As we have seen from the results summarized in the previous sections, additional consideration must also be given to the contribution of higher levels of linguistic knowledge to perception and comprehension In our initial attempts to measure comprehension of synthetic speech, we wanted to obtain a gross estimate of how well listeners could understand the linguistic content of continuous, fluent speech produced by the MITalk-79 text-to-speech system (see [ a ] , [MI) As far as we have been able to determine, little attention has actually been devoted t o the problems surrounding comprehension of the linguisticcontentof synthetically produced speech, particularly passages of meaningful fluent continuous speech [21], [ a ] To assess comprehension, we selected fifteen narrative zyxwvu PROCEEDINGS O F THE IEEE VOL 73, NO 11, NOVEMBER 19W passages and an appropriate set ofmultiple-choice test questions from several standardized adult reading comprehension tests[IO],[20],[30],[55].The passages were quite diverse, covering a wide range of topics, writing styles, and vocabulary Thesepassages were also selected to be interesting for subjects to listen to in the context of laboratorybased tests designed to assess language understanding Since these test passages were chosen from several different types of reading tests, they varied in difficultyand style This variation permitted us to evaluate the contribution of all of the individual components of a particular text-tospeech system to comprehension in one relatively gross measure We assumed that the results of these comprehension tests would provide an initial benchmark against which the entiretext-to-speech system could be evaluated with materials that would be comparable to thoseused in a more realistic application such as a reading machine for the blind or database information retrieval system [I], [2] In our initial study, we tested three groups of naive subjects with 20 subjects in each group [MI One group of subjects listened to MITalk-79 versions of the passages, another listened to natural speech, while a third group read the passages silently All three groups answered the same set of test questions immediately after eachpassage In a subsequent study [5], agroup of subjects listened to the prototype of the SpeechPlusProse-2000 (then known as Telesensory Systems, Incorporated) Thesame prosepassages were used in this study as in the original study The comprehension results for all groups are summarized in Table Averaged over the last thirteen test passages, the comprehending or understanding the content of these passages D Conclusions: Intelligibility and Comprehension The results of theModified Rhyme Test revealed relativelyhigh levels of segmental intelligibility for speech generated by MITalk-79, Prose-2000, Infovox, and DECtalk The results for the Votrax Type-’N’-Talk using this measure showedmuch lower levels of performance The progression from MITalk-77 (the forerunner of Prose-2000), to the MITalk-79 research system, to Infovox, to the Prose-2000, and finally to DECtalkshows the continual refinement of speech synthesis technology With additional research and development,the speech generated by these high-quality text-to-speech systems may soon approach the almost perfect levels ofintelligibility observed for natural speech under laboratory testing conditions The results from the two sentence tasks indicatedthat context is a powerful aid to recognition When both semantic and syntactic information is available to subjects, higher levels of performance were obtained in recognizing words in sentences However, when the use of semantic knowledge was modified or eliminated, as in the Haskins sentences, subjects must, of necessity, rely primarily on the acoustic-phonetic information in the signal and their knowledgeof morphology Clearly, the contribution of higher level sources of knowledge is responsible for the superior performance obtained on Harvardsentences; in the absence of this knowledge, subjects’ performance was considerably poorer Finally, the results of the listening comprehension tests reveal that subjects are able to correctly answer multiplechoice comprehension questions about the content of passages of fluent connected synthetic speech After only a few minutesof exposure to the output of a speechsynthesizer, comprehension improves substantially and eventually approximates levelsobserved when subjectsread the samepassages of text or listen to naturally produced versions of the same materials There are, however, a number of problems in measuring comprehension with the materials we have used First,these materials were designed to measure reading comprehension, not listening comprehension Thus for these tests, a reader was expected to re-read the material in order to answer some of the questions The reader always has access to the passage; the listener cannot go back and hearsome portion ofthe passage again Second, the multiple-choicequestions not directly assess the perceptual processes used to encode the speech input Moreover, these questions measure comprehension after the materials have been presented therefore reflecting post-perceptual comprehension strategiesand subject biases Thus multiple-choice questions are not measures of the on-line, real-time cognitive processes used in comprehension but reflect the final product of comprehension zyxwvutsrq zyxwvu zyxwvutsrqponm zyxwvutsrq Table Percent Correct Performance on the Comprehension Tests [5], [44] [a], MITalk-79 68.5 Natural Speech 67.3 (TSI) Prose-2000 77.2 Reading Half12nd s t Half (6 passages) (%) 70.3 64.1 65.6 60.9 76.1 (6 passages) Total (13 passages) (X) (%I 74.8 zyxwvu zyxwvutsrq 67.8 reading group showed a significant percent advantage ( p < 0.a)over the synthetic speech group and a 12 percent advantage over the Prose (TSI) group ( p 0.OOl) However, the differences in performance betweenthe groups appeared to be localized primarily in the first half of the test By the second half, performance for the groups listening to synthetic speech improved substantially whereas performance for the reading group remained about the same Althoughthe scores for the natural speech group were slightlylower overall, no improvement was observed in their performance from the first half to the second half of the test The finding of improved performance in the second half ofthe test for subjects listening to synthetic speech is consistent with the earlier results from word recognition in sentences which showed that recognition performance improves forsynthetic speech after only a short period of exposure Theseresultssuggest that the overall difference in performancebetweenthe groups is probably due to familiarity with the output of the synthesizer and not due to any inherent difference in the basicstrategiesused in Iv PERCEPTUAL ENCODING OF SYNTHETIC SPEECH The results of the MRT and word identification studies of natural and synthetic speech clearly indicate that synthetic speech is less intelligible than natural speech In addition, these studies demonstrate that, as synthetic speech becomes less intelligible, listeners rely more on linguistic zyxwvutsrqponmlkjihgf PISONI et a/: PERCEPTION OF SYNTHETIC SPEECH GENERATED BY R U L E 1669 zyxwvutsr knowledge and response-set constraints to aid word identification However, thesestudies donot account for the differences in perceptionbetween natural and synthetic speech; rather they just demonstrate and describe some of these basic differences A Lexical Decision and Naming Latencies In order to begin to investigate differences in the perceptual processing of natural and synthetic speech, we carried out a series of experiments designed to measure the time listeners need to recognize words and pronounceable nonwords producedby a human talker and a text-to-speech system In carrying out these studies, we wanted to know how long it takes a listener to recognize an isolated word, and how the process of word recognition might be affected by the quality of the acoustic-phonetic information in the signal To measure the duration of the recognition process, we used a lexical decision task [41], [54] Listeners were presented with a single word or a nonword stimulus item on each trial Each listener was required to classify the item as either a “word” or a “nonword” as quickly and accurately as possible by pressing one of two buttons located on a response box that was interfaced to a minicomputer Examples of the stimuliare shown in Table the lexical decision experiment, subjects were much slower to name synthetic test items than natural test items Moreover, this difference was again observed for both words and nonwords The naming results demonstrate that the extra processing time needed for synthetic speech does not depend on the type of response made by the listener, since the results were comparable for both manual and vocal responses Taken together, these two sets of findings demonstrate that early stages ofencoding synthetic speech require more processing time than encoding natural speech Several additional studies were carried out to determine the nature and extent oftheencoding differences between natural and synthetic speech zyxwvuts Table Examples of Lexical Decision Stimuli PROMINENT PRADAMENT BAKED BEPT TINY TADGY GLASS CEEP PARENTS PEEMERS TOLD TAVED BLACK BAEP CONCERT CAELIMPS DARK D U T BABBLE BURTLE CRITIC CRAENICK BOUGHT BUPPED PAIN POON GORGEOUS GAETLESS COLORED COOBERED Consonant- Vowel (CV) Confusions Several hypotheses canbe proposed to account for the greater difficultyof encoding synthetic speech Onehypothesis that has been suggested recently is that synthetic speech is simply equivalent to “noisy” natural speech [8], [56] That is, the acoustic-phonetic structure of synthetic speech is more difficult to encode than natural speech for the same reasons that natural speech presented in noise i s hard to perceive-the acoustic cues to phonemes are obscured, masked, or physically degraded in some way by the masking noise According to this view, synthetic speech is onthe same continuum as natural speech, butit is degraded in comparison with natural speech In contrast, an alternative hypothesis, and the one we prefer, is that synthetic speech is not like “noisy” or degraded natural speech at all, but instead may be thought of as ”perceptually impoverished” relative to natural speech By this account, synthetic speech is fundamentally different from natural speech inboth degree and kind becausemany ofthe important criteria1 acoustic cuesare either poorly represented or not represented at all Spoken language is structurally rich and redundant at all levels of linguistic analysis In particular, natural speech is highly redundant at the level of acoustic-phonetic structure Natural speech contains multiple acoustic cues for almost every phonetic distinction and these cues change as a function of context, speaking rate, and talker However, in synthesizing speech by rule, only a smallsubset of the possible cues are typically implemented as phonetic implementation rules As a result, some phoneticdistinctions may be minimally cued, perhaps by only a single acoustic attribute.If all cues donot have equal importance in different phonetic contexts, a single cue may not be perceptually sufficient to convey a particular phonetic distinction in all utterances (see [12]) Moreover, the reliance on minimal sets of cues in generating synthetic speech could be disastrous for perception if a particular phonetic distinction i s incorrectly synthesized or masked by environmental noise Indeed, many of the errors we have found in our analyses of the M R T data suggest that this account is correct ~51 These two hypotheses concerning the structural relationships between synthetic and natural speech make different predictions about the types of errors and the distribution of perceptualconfusions that shouldbe observed with synthetic speech compared to natural speech According to the “noisy speech” hypothesis, synthetic speech is similar to natural speech that has been degraded by the addition of noise Therefore, the perceptual confusions that occur with zyxwvu zyxwvuts Subjects responded significantly faster to natural words (903 ms) and nonwords (1046 ms) than to synthetic words (1056 ms) andnonwords (1179 ms) O n the average, response times to the synthetic speech were 145 ms longer than response times to the natural speech These findings demonstrate two important differences in perception between natural and synthetic speech.First, perception of synthetic speech requires more cognitive “effort” than the perceptionof natural speech Second,because the differences in latency were observed for both words and nonwords alike, and therefore donot depend onthe lexical status of the test item, the extra processing effort appears to be related to the process of extracting the acoustic-phonetic information from the signal and not the process of identifying words in the lexicon In short, the patternof results suggests that the perceptual processes used to encode synthetic speech require more cognitive “effort“ orresources than the processesused to encode natural speech Similar results were obtained by Pisoni [42] in a naming task using natural and synthetic words and nonwords As in 1670 PROCEEDINGS OF THE IEEE, VOL 73, N O 11, N O V E M B E R 1985 synthetic speech should be very similar to those obtained with natural speechheard in noise By comparison, the “impoverished speech” hypothesis claims that the acoustic-phonetic structure of synthetic speech is not as rich or redundant in segmental cues as natural speech According to this hypothesis, two patterns of confusion errors should occur in the perceptionof synthetic speech When the acoustic cues used to specify a phonetic segment are not sufficiently distinctive, confusions should occur between minimally cued segments that are phonetically similar This error pattern should be similar to the errors predicted by the noisy speech hypothesis, since perceptual confusions of natural speech in noise also depend on the acousticphonetic similarity of the segments [29], [62] However, the two hypotheses may be distinguished by the presence of a second type of error that is only predicted by the impoverished speech hypothesis If the minimal acoustic cues used to specify phonetic contrasts are incorrect or misleading as a result of poorly specified phonetic implementation rules, then confusions should occur that are not based on the nominal acoustic-phonetic similarity of the confused segments Instead, these confusions should beentirely determinedbythe listener’s perceptual interpretationof the misleading cues Thus the pattern of confusions observed with synthetic speech should be phonetically quite different from the expected onesbased on the acousticphonetic similarity of natural speech To investigate the predictions made by these twohypotheses, Nusbaum, Dedina and Pisoni[32] examined the perceptual confusions that arise within a set of 48 natural andsynthetic consonant-vowel (CV)syllables as stimuli These were constructed from the vowels /i, a, u/ and the k, n,m,r, I, w, j, s, f, z, v/ The consonants /b, d, g,p,t, natural CV syllables were produced by a male talker The synthetic syllables were generated by three different textto-speech systems-theVotrax Type-IN’-Talk, the Speech PlusProse-2000v2.1,and the Digital Equipment Corporation DECtalk v1.8 To assess the pattern of perceptual confusions that occur for natural speech, the natural syllables were presented to listeners at four signal-to-noise ( S / N ) ratios of +28, 0, - , and -10 dB When averaged over the three vowel contexts, the results showed that natural speechat +28 dB S/N was the most intelligible (96.6 percent correct), followed by DECtalk (92.2 percent correct), followed by the Prose-2000(62.8 percent correct) The Type-IN’-Talk showed the worst performance (27.0 percent correct) Of special interest were the results of more detailed error analyses which revealed that the distributions of perceptual confusions obtained for natural and synthetic speech were often quite different For example, in the case of DECtalk,100 percent ofthe errorsmade in identifying the segment /r/ were due to confusions with /b/ even though this type of errornever occurred for natural speech at +28 dB S/N Evenat the poorest S/N (-10 dB) where the intelligibility of natural speech in noise was actually worse than DECtalk presented without noise (29.1 percent correct versus92.2 percent correct), this type of error accounted for only percent of the total errors observed for this segment In order to examine the segmental confusions more precisely, we compared the confusion matrices for a particular text-to-speech system with the confusion matrices for natural speech presented at a signal-to-noise ratio that resulted in comparable overall levels of identification performance We compared the confusion matrices for the Prose-2000 with natural speech presented at dB S / N and the confusion matrices for Votrax with natural speech presented at -10 dB S/N An examination of the proportion of the total errors contributed by each response class (stop, nasal, liquid/glide, fricative, other) indicated that, for natural speech, most of the errors in identifying stops were due to responses that were other stop consonants In contrast, the errors foundwith the Prose-2000 appeared to be more evenly distributed between stop, liquid/glide, and fricative responses In other words, more intrusions appeared from other manner classes in the errors observed with the Prose2000 synthetic speech than for the natural speech produced in noise Thus the different pattern of errors obtained for Prose-2000 and natural speech suggests that the errors produced by the Prose-2000 may be “phonetic miscues” rather than true phonetic confusions The comparison between natural speech at -10 dB S/N and Votrax speech indicated that the pattern of errors in identifying stops was more similar for these conditions Indeed, the comparison of identification errors for natural speech at dB and -10 dB S/N was quite similar to the comparison between Votrax and natural speech Thusat least for the perception of stop consonants, the confusions of Votrax speech seem to be based on the acoustic-phonetic similarity ofthe confused segments as in noisy speech However, it should be emphasized that the overall performance level for Votrax synthetic speechwas quite low to begin with Therefore, these errors could reflect similarities that occur when performance begins to approach chance levels A very different pattern of results was obtained for the errors that occurred in the perception of liquids and glides The distribution of errors for Prose-2000 speech and natural speech revealed that similar confusions were made for liquids and glides for both types of speech However, the results were quite different for the errors made with Votrax speech and natural speech for these phonemes For liquids and glides, the largest number of errors for Votrax speech resulted from confusions with stop consonants while for natural speech, relatively few stop responses were observed Thus for liquids and glides, errors in perception of Prose-2000 speech seem to be based on acoustic-phonetic similarity while the errors for Votraxspeech seem to be phonetic miscues In summary, based on these confusion analyses, it should be obvious that the predictions made by the noisy speech hypothesis are simply incorrect Twodifferent types of errors were observed in the perception of synthetic speech Some consonant identification errors were based on the acoustic-phonetic similarity of the confused segments Other errors follow a pattern that can only be explained as phonetic miscues;theseareerrors inwhichthe acoustic cues used in synthesis specified the wrong segment in a particular context zyxwvutsrq zyxwvuts PlSONl et dl.: PERCEPTION O F SYNTHETIC SPEECH GENERATED BY R U L E zyxw C Gating and Signal Duration The results of the consonant-vowel confusion experiment support the conclusion that the differences in perception between natural and synthetic speech are largely the result of differences in the acoustic-phonetic properties of 1671 the signals More recently, we have found further support forthisaccount using the gating paradigm [16],[47] to investigate the perception of natural and synthetic words In an experiment carried out recently by Manous and Pisoni [25] listeners were presented with short segments of either natural or synthetic words for identification O n the first trial using a particular word, the first 50 ms of the signal was presented for identification O n subsequent trials, the amount of signal duration was increased in 50-ms steps so that on the next trial, 100 ms of the word was identified, and on the next trial 150 ms of the word was identified, and so on, until the entire word was presented Manous and Pisoni found that, on the average, natural words could be identified after 67 percent of a word was heard; for synthetic words, it wasnecessary for listeners to hear 75 percentofa word for correct wordidentification These gating results demonstrate more directly that the acousticphonetic structure of synthetic words conveys less information (per unit of time) than the acoustic-phonetic structure of natural speech zyxw imposeadditionalcognitive demands on subjectsbesides those already incurred by the acoustic-phonetic properties of synthetic speech v CAPACITY DEMANDS I N PERCEPTION Recent work on human selective attention has suggested thatcognitive processesare limited by the capacity of short-term or working memory [SI] Thusany perceptual process that imposes a load on short-term memory may interfere with decision making, perceptual processing, and other subsequent cognitive operations If perception of synthetic speech imposes a greater demand on the capacity of short-term memory than perceptionof natural speech, thenthe use of synthetic speech in applications where other complex cognitive operations are required might produce serious problems in recognition of the message Severalyearsago,Luce,Feustel, and Pisoni [24] conducted a series of experiments that were designed to determinethe effects of processing synthetic speech on short-termmemory capacity In one experiment, on each trial, subjects were given two different lists of items to remember The first list consisted of a set of digits visually presented o n a CRT screen On some trials, no digits were presented On other trials, either three or six digits were presented in the visualdisplay Following the visual list, subjects were presented with a spoken list of ten natural words or ten synthetic words After the spoken list was presented, the subjects were instructed to write down all the visual digits in the order of presentation and all the words they could remember from the auditory list Across all three visual conditions (no list, three,or six digits), recall of the natural words was significantly better than recall of the synthetic words In addition, recall of the synthetic and natural words becameworse as the size of lists increased In other words, increasing the thedigit number of digits held in short-term memory impaired recall of the spoken words But the most important finding was the interaction between the type of speech presented (synthetic versus natural) and the number of digits presented (three versussix).This interaction was revealed by the numberof subjects whocould recall all the digits presented in correct order As the size of the digit lists increased, significantly fewer subjects were able to recall all the digits for the synthetic words compared to the natural words Thus perceptionof the synthetic speech impaired recall of the visually presented digits more with increasing digit list size than did natural speech These results demonstrate that synthetic speech requires more short-term memory capacity than natural speech As a result, it would be expected that synthetic speech should interfere much more with other cognitive processesbecause it imposesgreater capacity demands on the humaninformation processing system than natural speech To test this prediction, Luce et al [24] carried out another experiment in which subjects were presented with lists of ten words to be memorized The lists were either all synthetic or all natural words The subjects were required to recall the words in the same order as the original presentation As in the previous experiment, the natural words were recalled better overall than the synthetic words However, a moredetailed analysis revealed an interaction in recall zyxwvutsr zyxwvu 0.Conclusions:PerceptualEncoding Taken together, our results provide strong evidence that encoding of the acoustic-phonetic structure of synthetic speech is more difficult and requires more cognitive effort and capacity thanencoding natural speech One source of supportfor this conclusion comes fromthe finding that recognitionof words and nonwords requires more processing time for synthetic speech compared to natural speech This result indicates that a major source of difficulty in recognition is the extraction of phonetic information and not word recognition since the same result was obtained for both words and nonwords This conclusion is supported furtherbythe findingsofthe CV confusion study This experiment demonstrated significant differences in perceptionofthe acoustic-phonetic structure of synthetic and be viewed as a natural speech Synthetic speech may phonetically impoverished signal compared to natural speech This was demonstrated clearly in the gating experiment using natural and synthetic speech.Theresults obtained from this experiment suggest that synthetic speech requires more acoustic-phonetic information to correctly identify isolated monosyllabic words Taken together, the overall pattern of findings suggests that the differences in processing time between natural and synthetic speech probably lie at processing stages involved inthe extraction of basic acoustic-phonetic information from the speech waveform-that is, the early pattern recognition process itself rather than at the more cognitive levels involved in lexical searchor retrieval of words from the mental lexicon (see [43n The results obtained in the lexical decision and naming tasks also demonstrate that even with relatively high levels of performance accuracy, synthetic speech requires more cognitive processing time than natural speech to recognize words presented in isolation In thesestudies, however, subjects were performing relatively simple and straightforward tasks.As wenoted at the outset, the specific task demands of a perceptual experiment almost always affect the speed and accuracy of a listener’sresponse.The next series of experiments we will describe was designed to zyxwvutsrq zyxwvutsrqp 1672 zyxw PROCEEDINGS O F THE IEEE, VOL , NO 11, NOVEMBER 1985 performance depending on the position of items in the list The first synthetic words heard in the list were recalled much less accurately than the natural words in the beginning of the lists This result demonstrated that, in the synthetic lists, the words heard later in each list interfered with active rehearsal of the words heard earlier in the list This is precisely the result that would be expected if the perceptual encoding of the synthetic words placed greater processing demands on short-term memory [46] The data o n serial-ordered recall of lists of natural and synthetic speech support theconclusionfromthe lexical decision research that the processing of synthetic speech requires more effort than perception of natural speech The perceptual encoding of synthetic speech requires more cognitive capacity and may, in turn, affect other cognitive processes that require active attentional resources Previous research on capacity limitations in speech perception demonstratedthat paying attention to one spoken message seriously impairs the listener’s ability to detect specific words in other spoken messages(e.g., [6],[57]) Moreover, several recent experiments have shown that attending to one message significantly impairs phoneme recognition in a second stream of speech [31] Taken together, these studies indicate that speech perception requires active attention and cognitive capacity,even at the level ofencoding phonemes As a result, increased processing demands for encoding synthetic speech may place important perceptual andcognitive limitations on the use of voice response systems in high information load conditions or severe environments This would be especially true in cases where a listener is expected to pay attentionto several different sources of information at the same time would not obscure any effects of training and there would be room for improvement to occur during the course of the experiment The three groups of subjects were treated differently on Days2-9 One group received training with Votraxsynthetic speech One group was trained with natural speech using the same words,sentences,andparagraphs as the group trained on synthetic speech This secondgroup served to control forfamiliarity with the specific experimental tasks Finally, a third group received no training at all on Days 2-9 O n the pre-test and post-test days, the subjects were given the MRT, isolated phonetically balanced (PB) words, and sentences for transcription The word lists were taken from PBlists; the sentences consisted of both meaningful and semantically anomalous sentencesused in our earlier work Subjects were given different materials to listen to on every day of the experiment During all the training sessions (i.e., Days 2-9), subjects were presented with spoken words and sentences, and received feedback indicating the identity of the stimulus presented on each trial The results showed that performance improved dramaticallyfor only one group-the subjects that were trained with the Votrax synthetic speech At the end of training, the Votrax-trained group showed significantly higher levels of performance than either of the other two groups To take one example, performance in identifying isolated PB words improved for the Votrax-trained group from about 25 percentcorrect on the pre-test to almost 70-percent correct wordrecognitionon the post-test Similar improvements were found for all the word identification tasks The results of this training study suggestseveral important conclusions First, the effects of training appear to be related to improving or modifying the encoding process used to recognize words Clearly, subjects were not simply learning to perform the various tasks better, since the subjects trained on natural speech showed little or no improvement in performance Moreover, training affected performance similarly with isolated words and words in sentences, and for closed- and open-response sets The patternof results strongly suggests that subjects in the group trained o n synthetic speech were notmemorizingindividual test items nor were they learning special strategies; that is, they did not learn to use linguistic knowledge or task constraints to improve their recognition performance Rather, subjects learned something about the structural characteristics of this particular text-to-speech system that enabled them to perform better regardless of the task.This conclusion is further supported by the design of the training study Improvements in performance were obtained with novel materials even though the subjectsneverheard the same words or sentences more than once during the entire experiment In order to show improvements in performance, subjects must have acquired detailed information and knowledge about the rule systemused to generate the synthetic speech.They couldnot have shown improvements in performance on the post-test if they simply learned to memorizeindividual words or sentencessince novel materials were used in this test too i n addition to these findings, we also found that subjects retained the training even after six months with no further contact withthe synthetic speech Thus it appears that zyxwvutsr zyxwvuts VI TRAININGAND EXPERIENCE WITH SYNTHETIC SPEECH zyxwvutsr The human observer is a very flexibile processor of information.With sufficient experience, practice, and specialized training, observers are able to overcome some of the limitations on performance we have observed in our previous studies indeed, several researchers (e.g., [7], [MI) have reported a rapid improvement in recognitionof synthetic speech during the course of their experiments These improvements appear to be the result of subjects learning to process the acoustic-phonetic structure of synthetic speech more effectively However, it is also possible that the ;eported improvements in intelligibility of synthetic speech were simply due to an increased familiarity with the experimental procedures rather than a real improvement in the perceptual processing of the synthetic speech In order to test these alternatives, Schwab, Nusbaum, and Pisoni [49] carried out an experiment to separate the effects of training o n task performance from improvements in the recognition of synthetic speech Three groups of subjects were given a pre-test with synthetic speech on Day and a post-test with synthetic speech on Day 10 of the experiment The pre-test established baseline performance for the Votrax Type-’N‘-Talk text-to-speech system; the post-test on Day 10 was used to determine if any improvements had occurred in recognition of the synthetic speech after training The low-cost Votrax system was used primarily because of the poor quality of i t s segmental synthesis Thus ceiling effects in performance zyxwvu zyxwvutsrqponml zyxwvutsrqponm PlSONl et d l : PERCEPTION OF SYNTHETIC SPEECH GENERATED BY R U L E 1673 training produced a relatively stable and long-term change in the perceptualencoding processes used by subjects Furthermore, it is likely that more extensive training would have produced even greaterpersistence of the training effects If subjects had been trained to asymptotic levels of performance, the long-term effects of training might have been even more stable.Theresults of this study demonstrate thathuman listeners can modify their perceptual strategies in encoding synthetic speech and that substantial increases in performance can be realized in relatively short periods of time even with poor-quality synthetic speech VII C Habituation and Attention to Synthetic Speech When listening to longpassages of synthetic speech, one often experiences difficulty in maintaining focused attention on the linguisticcontentof the passage Whilethe results obtained in our listening comprehension tests indicated that subjects did, indeed, comprehend these passages quite well, we not have any evidence that subjects paid full attention to the passages(seealso [21]) We also have subjective reports from other experiments suggesting that subjects are “tuning in” and “fading out” as they listen to long passages of synthetic speech Is synthetic speech more fatiguing to listen to than natural speech? Can a listener fully comprehend a passage when only part of the passage is processed? How is the listener’s attention allocated in listening to synthetic speechcompared, for example, to natural speech and how does it change with other demands on processing capacity in short-term memory? These are all important questions that await further study zyxwvuts zy zyxwvutsrqpo FUTURE DIRECTIONS A Research on Comprehension Most of the research on synthetic speech produced by text-to-speech systemshas, in the past, been concerned with the acoustic-phonetic output generated by these systems (see, however, [21], [MI, [49]) Researchers havefocused attention o n improving the segmental intelligibility of synthetic speech At this point in time, the available perceptual data suggest that segmental intelligibility is quite good for some systems(DECtalk, Prose, Infovox) and, while not at the same level as natural speech, it may take a great deal of additionaleffortto achieve relatively smallgains inimprovement O n the other hand, little research effort has been directedtowards assessing.listening comprehension in amore general sense Ourinitial efforts used relatively gross and insensitive measures of comprehension, even though thesemeasuresrevealedsmall though reliable differences in comprehension performance between natural and synthetic speech Additional research is needed to understand the precise role that practice and familiarity plays in comprehension and understanding As we noted earlier, performance in comprehension tasks improves due to experience in listening to the synthetic speech Additional research should be carried out to deal with issues surrounding the nature of practice and familiarity effects and how the subject’s criteria andperceptual strategies are modified after listening to synthetic speech Thereare still many questions to be answered: How much practice does a listener need? Does performance using synthetic speechreach the same levels as with natural speech? Does training reduce the capacity demands imposed by synthetic speech? These questions need to be studied in carefully designed laboratory experiments using more sophisticated and sensitive measures of perception and comprehension 0.Subjective Evaluation and Listener Preference In addition to the quality of the synthetic speech signal itself, another consideration with respect to the evaluation of synthetic speechconcerns the user’s preferences and biases If an individual using a particular text-to-speech system cannot tolerate the sound of the speech or does not trust the information provided by the voice output device, the usefulness of this technology will be reduced With this goal in mind, we have developed a questionnaire to assess subjective ratings of synthetic speech [36] Some preliminary data have been collected using varioustypes of stimulus materials and several synthesis systems In general, we have found that listeners’ subjective evaluations of their performance generally correlates well with objective measures of performance Also, we have found that the degree to which subjects are willing to trust the information provided by synthetic speech is positively correlated with objective measures of performance For the naive user, poor performance predicts low levels of belief in the messages, whereas high levels of accuracy predict a greater degree of confidence zyxwvutsrq zyxwvuts B O n - Line Measures of Linguistic Processing In order to understand the moment-to-moment demands that occur while listening to fluent synthetic speech, we will need to use on-line measures that tap thereal-time computational processes used by listeners to perceive and comprehend fluent speech The use ofphonemeand word-monitoring tasks which require listeners to respond while processing the speech input may provide some insight into the covert processes listeners use to understand synthetic speech (see [Ill) Other psycholinguistic tasks such as mispronunciation detection [9] may also be useful as well.In these tasks, response latencies are used to measure cognitive processing 1674 E Research on the Applications of Voice Output Technology Finally, there are many unanswered questions related to the use of voice responsesystems in real-world applications Additional research is needed on the use of synthetic speech in settings where it is already being used or could be used.Except fora few studies reportingthe use of synthetic speech in military, business, and industrial settings (see, for example, [17],[52], [56], [61n, most ofthe reports concerning the use of synthetic speech describe a new ornovelapplication but they not evaluate the usefulness or success or failure of the application VIII SUMMARY AND CONCLUSIONS Evaluating the use of voice response systems employing synthetic speech is not just a matter of conducting standardized intelligibility tests Different applications will impose different demands and constraints on observers Thus zyxw PROCEEDINGS O F T H E IEEE, VOL 73, NO 11, NOVEMBER 1985 it is necessary to consider the five factors that we discussed at thebeginning of this paper and determine how they combine and interact to affect human performance In particular, perceptual and cognitive processes are primarily limited by the capacity of short-term memory Because the perception of synthetic speechimposes a severe load on short-term memory, it is reasonable to assume that in highly demanding tasks, processing of synthetic speechmay impact the performance of other concurrent cognitive operations Of course, the conversemay occur also; that is, performing demanding a task may interfere with the processing of synthetic speech The human observer is not an interrupt-driven computer that can respond immediately tothe presentation of an input signal During complex cognitive processing, an observer may not be able to make the appropriate response to a speechsignal;evenworse, the presentation of a synthetic message might not be detected at all under very demanding or life-critical conditions in severe environments In highlydemanding tasks, it is important to provide messages that maximize redundancy and perceptual distinctiveness [52] This is where the structure and content of the messageset becomes critical As the message set becomes simpler (e.g., isolated command words), the perceptual distinctiveness of the messages should be increased accordingly For isolated words, the listener is unable to rely on the linguistic constraints provided by syntax and semantics Moreover, the discriminability of the message is most important when the quality of phoneme synthesis is poor because the redundancy of the acoustic structureof the signal is minimized As a result, moreeffort maybe required to encode the speech.This implies that low-cost synthetic speech should only be used in applications where the task demands are not severe To take one example, it seems more advisable to use a low-cost speech synthesizer to provide spoken confirmation of database entries than as a voice responsesystem in the cockpit of a jet fighter or helicopter Furthermore, it should be recognized that‘ the ability to respond to synthetic speech in very demanding applications cannot be predictedfrom the rsullts ofthetraditional forced-choice MRT In the forced-choice MRT, the listener can utilize the constraints inherent in the task, provided by the restricted set of alternative responses However, outside the laboratory, the observer is seldom provided with these constraints There is no simple or direct method of estimating performance in less constrained situations from the results of the forced-choice MRT.Instead, evaluation of voice response systems should be carried out under the sametask requirements that are imposed in the intended application Laboratory studies are designed to -establish benchmarks of performance for comparative purposes; to be useful fordevelopment and application projects, they need to be supplemented by other relevant data From our research on the perception of synthetic speech, we have been able to specify some of the constraints on the use of voice response systems However, there is still a great deal more research to be done Basic research is needed to understand the effects of noise and distortion on the perception of synthetic speech in severe environments, how perception is influenced by practice and prior experience, and precisely how perceived naturalness interacts with intelligibility Now that the technology of automatic text-to-speech conversion has been developed to the point where several products are commercially viable, future research will allow us to understand both the potential and the limitations of voice responsesystemsand how human listeners interact with this new technology zy zyx zyxwvu zyxwv zyxwvu zyxwvutsrq zyxwvut PlSONl et dl.: PERCEPTION OF SYNTHETIC SPEECH GENERATED BY R U L E REFERENCES J Allen,”Reading machines forthe blind: The technical problems and the methods adopted for their solution,” I€€€ Trans Audio Electroacoust., vol AU-21, pp 259-264, 1973 -, “Synthesis of speech fromunrestrictedtext,” Proc /E€€, VOI.64, pp 433-442, 1976 -, “Linguistic-basedalgorithmsofferpractical text-tospeechsystems,” Speech Technol., vol 1, no 1, pp 12-16, 1981 J Allen, S Hunnicutt, and D.H Klatt, Eds., Conversion of Unrestricted Text to Speech (Notes for MIT Summer Course 6.69s), July 1979 J Bernsteinand D B Pisoni, “Unlimited text-to-speechdevice:Description and evaluationof a microprocessor-based system,” in 1980 /€€E Int Conf Rec on Acoustics,Speech, and Signal Processing, pp 576-579, 1980 J Bookbinderand E Osman, “Attentional strategies indichotic listening,” Memory Cogn., vol 7, pp 511-520, 1979 R Carlson, Crandstrom, andK.Larssen, “Evaluationof a text-to-speech system as a readingmachinefortheblind,” Quart Progr and Status Rep STL-QPSR 2-3, Stockholm, Sweden, Royal Inst Technol., Dep Speech Commun., 1976 J E Clark, “Intelligibility comparisons for two synthetic and one natural speech source,” J Phonetics, vol 11, pp 37-49, 1983 R A Cole and J Jakimik, “A model of speech perception,” in R A Cole, Ed., Perception and Production of Fluent Speech Hillsdale, NJ:Erlbaum, 1980 Cooperative English Tests: Reading Comprehension, Form 16, Princeton, NJ, Educational Testing Service, 1960 A Cutlerand D Norris, “Monitoring sentence comprehension,” i n W E Cooper and E C T Walker, Eds., Sentence Processing: Psycholinguistic Studies Presented toMerrill Garrett Hillsdale, NJ: Erlbaum,1979 M Dorman, M Studdert-Kennedy,and L Raphael, “Stop consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues,” Percep tion Psychophys., vol 22, pp 109-122, 1977 J P.Egan, “Articulation testing methods,” Laryngoscope, vol 58, pp 955-991,1948 C Fairbanks, “Testofphonemicdifferentiation: The rhyme test,” J Acoust SOC Amer., vol 30, pp 596-600, 1958 B C Creene, L M Manous,and D B Pisoni, ”Preliminary evaluationof DECtalk,” Speech Res Lab Tech Note 84-03, Bloomington, IN, Indiana University, 1984 F Crosjean,“Spoken wordrecognition processes andthe gating pardigm,” Perception Psychophys.,vol 19, pp, 267-283, 1980 M T Hakkinenand H Williges,“Synthesizedwarning messages:Effects of an alerting cue in single- and multiplefunction voice synthesis systems,” Human Factors, vol 26,’pp 185-195, 1984 A S House, C E Williams, M H L Hecker, and K Kryter, “Articulation-testing methods: Consonantal differentiation with a closed-response set,” J Acoust SOC Amer., vol 37, pp 158-166,1965, F Ingemann, “Speech synthesis by rule using the FOVE program,” Haskins Lab.StatusRep on SpeechResearch Sd-54, pp 165-1 73, 1978 IowaSilent Reading Tests, Level 3, Form E New York: Harcourt Brace Jovanovich, 1972 J J Jenkins andL D Franklin, “Recall of passages of synthetic speech,” presented at the 22nd Meet.Psychonomic SOC., Philadelphia, PA, Nov 1981 B H.Kantowitz and R D Sorkin, Human Factors: Understanding People-SystemsRelationships New York:Wiley, 1983 D H Klatt, “Timing rules in Klattalk: Implications for models of speech production,” 1, Acoust SOC Amer., Suppl 1, vol 1675 zyxwvutsrqpo zyxwvut 73, p 566, 1983 P A Luce, T C Feustel, and D B Pisoni, “Capacity demands in short-term memory for synthetic and natural word lists,” Human Factors, vol 25, pp 17-32, 1983 L M Manous and D B Pisoni, “Effects of signal duratibn on the perception of natural and synthetic speech,” Research on Speech Perception Progress Rep 10 Bloomington, IN, Speech Res Lab., Dep Psychology, Indiana Univ., 1984 W.Marslen-Wilson,“Functionand process in spoken word recognition-Atutorialreview,” in H.Bouma and D C Bouwhuis, Eds., Attentionand Performance vol X Hillsdale, NJ: Erlbaum, 1984 C A Miller, C.Hiese, and W Lichten, “The intelligibility of speech as a function of the context of the test materials,” / Experimental Psychol., vol 41, pp 329-335, 1951 C A Miller and S Isard, “Some perceptual consequences of linguistic rules,” Verbal Learning and Verbal Behavior, vol 2, pp 21 7-228, 1963 C A Miller, and P E Nicely, “An analysis of perceptual confusions among some English consonants,” J Acoust SOC Amer., vol 27, pp 338-352, 1955 The Nelson-Denny Reading Test, Form D Boston, MA: Houghton-Mifflin, 1973 H C Nusbaum,”Capacitylimitations in phonemeperception,”unpublisheddoctoraldissertation, SUNYat Buffalo, 1981 zyxwvuts zyxwvutsrq zyxwvut zyxwvutsrqp H C Nusbaum, M J Dedina, and D B Pisoni, ”Perceptual confusions of consonants in natural and synthetic CV syllables,’’SpeechRes.Lab.Tech Note 84-02, Bloomington,IN, Indiana Univ., 1984 H C Nusbaumand D B Pisoni, “Perceptualandcognitive constraints on the use of voice responsesystems,” in Proc 2nd Voice Data Entry Systems Applications Conf (Sunnyvale, CA Lockheed, 1982) -, “Perceptual evaluation of synthetic speech generated by rule,” i n Proc 4th Voice Data EntrySystems Applications Conf (Sunnyvale, CA, Lockheed, 1984) -, “Some constraints on theperceptionofsynthetic speech,” Behavior Res Methods Instrum (in press) H C Nusbaum, E C.Schwab, and D B Pisoni, ”Subjective evaluation of synthetic speech: Measuring preference, naturalness, andacceptability,” Research on Speech Perception Progress Rep IO, Bloomington, IN, SpeechRes.Lab., Indiana Univ., 1984 -, ”Perceptualevaluationofsyntheticspeech: Some constraints on the use ofvoice responsesystems,” in Proc 3rd Voice Data Entry Systems Applications Conf (Sunnyvale, CA, Lockheed, 1983) P W Nyeand J Caitenby,“The intelligibility ofsynthetic monosyllabic words in short, syntactically normal sentences,” Haskins Lab Status Rep on Speech Res., vol 38, pp 169-190, 1974 D B Pisoni “Speech perception,” in W K Estes, Ed., Handbookof Learning andCognitive Processes: Volume Hillsdale, NJ: Erlbaum, 1978 -, “Some measures of intelligibilityand comprehension,’’ i n J Allen, S Hunnicutt, and D H Klatt, Eds., Conversion of Unrestricted Text to Speech, Notes for MIT Summer Course 6.695, July 1979 -, “Speeded classification of natural and synthetic speech i n a lexical decision task,” Acoust SOC Amer vol 70, p S98, 1981 -, “Perceptionof speech:The humanlistener as a cognitive interface,” Speech Tecnol., vol 1, pp IC-23, 1982 -, “Acoustic-phonetic representations in the mental lexicon,” Cognition (in press) 1676 D B Pisoniand S Hunnicutt,”Perceptualevaluation of MITalk: The MIT unrestricted text-to-speech system,” in IEEE Int Conf Rec on Acoustics, Speech, and Signal Processing, pp 572-575, 1980 D B Pisoni and E Koen, “Some comparisons of intelligibility of synthetic and natural speech at different speech-to-noise ratios,” Research on Speech Perception Progress Rep 7, Bloomington, IN, Speech Res Lab., Indiana University, 1981 P M A.Rabbitt,“Channel-capacity, intelligibility and immediatememory,’’ Quart j Experimental Psychol., vol 20, pp 241 -248,1968 A Salasoo and D B Pisoni, “Interaction of knowledge sources i n spoken word identification,” Memory Language, vol 24, pp 210-231,1985, A Schmidt-Nielsen, “Listener preference and comprehension tests of stress algorithms for a text-to-phonetic speech synthesisprogram,”Naval Res.Lab.Rep 8015, Washington, DC, Naval Res.Lab., 1976 E C Schwab, H C Nusbaum, and D B Pisoni,“Effects of training on the perception of synthetic speech, Human Factors (in press) R M Shiffrin, “Capacity limitations in information processing, attention,and memory,’’ inW K Estes, Ed., Handbookof Learning and Cognitive Processes, vol Hillsdale, NJ: Erlbaum, 1976 R M Shiffrin and W Schneider, “Controlled and automatic information processing: 11 Perceptual learning, automatic attending,anda generaltheory,” Psychol Rev., vol 84, pp 127-1 90,1977 C A Simpson and D H Williams, “Response time effects of alertingtone andsemanticcontextforsynthesizedvoice cockpit warnings,” Human Factors, vol 22, pp 319-330,1980, L M Slowiaczek and H C Nusbaum, “Intelligibility of fluent synthetic sentences: Effects of speech rate, pitch contour, and meaning,” / Acoust Soc Amer., vol 73, p 5103, 1983 L M Slowiaczekand D Pisoni,“Effects of practice on speeded classification of natural and synthetic speech,”Research on Speech Perception ProgressRep 7, Bloomington, IN, Speech Res Lab., Indiana University, 1982 Stanford Test of AcademicSkills:Reading (College Level 11-A) New York: Harcourt Brace Jovanovich, 1972 J C Thomas, M B Rosson, and N Mellen, ”Human factors and synthetic speech,” i n Proc 4th Voice Data Entry Systems Applications Conf (Sunnyvale, CA, Lockheed, 1984) A M Treismanand J C A Riley,“Is selectiveattention selective perception or selective response?” /, Experimental Psycho/., VOI.79, pp 27-34, 1%9 R Viswanathan, J Makhoul, and A W F Huggins, “Speech compressionandevaluation,” BBNRep 3794, Cambridge, MA, Bolt Beranek and Newman, 1978 W D Voiers, “Diagnostic evaluation ofspeech intelligibility,” in M Hawley, Ed., BenchmarkPapers in Acoustics, Vol I / Stroudsburg, PA: Dowden, Hutchinson and Ross, 1977 -, “Evaluating processedspeech usingtheDiagnostic RhymeTest,” Speech Technology, vol 1, no 4, pp 30-’39, zy 1984 J W VorheesandN M Bucher, “The integration of voice and visual displays in helicopter cockpits,” in Proc 4th Voice Data Entry Systems Applications Conf (Sunnyvale, CA, Lockheed, 1984) M D.Wang,and R C.Bilger, “Consonantconfusions in noise: A study of perceptual features,” / Acoust SOC Amer., vol 54, pp 6 , 1973 C D Wickens,“Thestructureofattentional resources,” in R S Nickerson, Ed., Attentionand Performance V I / / Hillsdale, NJ: Erlbaum, 1984 PROCEEDINGS O F THE IEEE, VOL 73, NO, 11 NOVEMBER 1935 ... cognitive operations If perception of synthetic speech imposes a greater demand on the capacity of short-term memory than perceptionof natural speech, thenthe use of synthetic speech in applications... important differences in perception between natural and synthetic speech. First, perception of synthetic speech requires more cognitive “effort” than the perceptionof natural speech Second,because... ENCODING OF SYNTHETIC SPEECH The results of the MRT and word identification studies of natural and synthetic speech clearly indicate that synthetic speech is less intelligible than natural speech

Ngày đăng: 12/10/2022, 16:46