Tài liệu Digital Signal Processing Handbook P47 pptx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	17
Dung lượng	140,17 KB

Nội dung

Lawrence R. Rabiner, et. Al. “Speech Recognition by Machine.” 2000 CRC Press LLC. <http://www.engnetbase.com>. SpeechRecognitionbyMachine LawrenceR.Rabiner AT&TLabs—Research B.H.Juang BellLaboratories LucentTechnologies 47.1Introduction 47.2CharacterizationofSpeechRecognitionSystems 47.3SourcesofVariabilityofSpeech 47.4ApproachestoASRbyMachine TheAcoustic-PhoneticApproach[1] • “Pattern-Matching” Approach[2] • ArtificialIntelligenceApproach[3,4] 47.5SpeechRecognitionbyPatternMatching SpeechAnalysis • PatternTraining • PatternMatching • De- cisionStrategy • ResultsofIsolatedWordRecognition 47.6ConnectedWordRecognition PerformanceofConnectedWordRecognizers 47.7ContinuousSpeechRecognition Sub-WordSpeechUnitsandAcousticModeling • WordMod- elingFromSub-WordUnits • LanguageModelingWithinthe Recognizer • PerformanceofContinuousSpeechRecognizers 47.8SpeechRecognitionSystemIssues RobustSpeechRecognition[18] • SpeakerAdaptation[25] • KeywordSpotting[26]andUtteranceVerification[27] • Barge-In 47.9PracticalIssuesinSpeechRecognition 47.10ASRApplications References 47.1 Introduction Overthepastseveraldecadesaneedhasarisentoenablehumanstocommunicatewithmachinesin ordertocontroltheiractionsortoobtaininformation.Initialattemptsatprovidinghuman-machine communicationsledtothedevelopmentofthekeyboard,themouse,thetrackball,thetouchscreen, andthejoystick.However,noneofthesecommunicationdevicesprovidestherichnessortheease ofuseofspeechwhichhasbeenthemostnaturalformofcommunicationbetweenhumansfortens ofcenturies.Hence,aneedhasarisentoprovideavoiceinterfacebetweenhumansandmachines. Thisneedhasbeenmet,toalimitedextent,byspeechprocessingsystemswhichenableamachine tospeak(speechsynthesissystems)andwhichenableamachinetounderstand(speechrecognition systems)humanspeech.Weconcentrateonspeechrecognitionsystemsinthissection. Speechrecognitionbymachinereferstothecapabilityofamachinetoconverthumanspeechto atextualform,providingatranscriptionorinterpretationofeverythingthehumanspeakswhile themachineislistening.Thiscapabilityisrequiredfortasksinwhichthehumaniscontrollingthe actionsofthemachineusingonlylimitedspeakingcapability,e.g.,whilespeakingsimplecommands orsequencesofwordsfromalimitedvocabulary(e.g.,digitsequencesforatelephonenumber).In c  1999byCRCPressLLC the moregeneral case, usually referredtoasspeech understanding, the machine need only recognize a limited subset of the user input speech, namely the speech that specifies enough about the action requested so that the machine can either respond appropriately, or initiate some action in response to what was understood. Speech recognition systems have been deployed in applications ranging from control of desktop computers, to telecommunication services, to business services, and have achieved varying degrees of success and commercialization. In this section we discuss a range of issues involved in the design and implementation of speech recognition systems. 47.2 Characterization of Speech Recognition Systems A number of issues define the technologyof speech recognition systems. These include: 1. The manner in which a user speaks to the machine. There are generally three modes of speaking, including: • isolated word (or phrase) mode in which the user speaks individual words (or phrases) drawn from a specified vocabulary; • connected word mode in which the user speaks fluent speech consisting entirely of words from a specified vocabulary(e.g., telephone numbers); • continuous speech mode in which the user can speak fluently from a large (often unlimited) vocabulary. 2. The size of the recognition vocabulary,including: • smallvocabulary systemswhichproviderecognitioncapabilityforupto100words; • medium vocabulary systems which provide recognition capability for from 100 to 1000 words; • largevocabularysystemswhichproviderecognition capability forover1000words. 3. The knowledge of the user’s speech patterns, including: • speaker dependent systems which have been custom tailored to each individual talker; • speaker independent systems which work on broad populations of talkers, most of which the system has never encountered or adapted to; • speaker adaptive systems which customize their knowledge to each individual user over time while the system is in use. 4. The amount of acoustic and lexical knowledge used in the system, including: • simple acoustic systems w hich have no linguistic knowledge; • systems which integrate acoustic and linguistic knowledge, where the linguistic knowledge is generally represented via syntactical and semantic constraints on the output of the recognition system. 5. The degree of dialogue between the human and the machine, including: • one-way (passive) communication in which each user spoken input is acted upon; • system-driven dialog systems in which the system is the sole initiator of a dialog, requesting information from the user via verbal input; c  1999 by CRC Press LLC • natural dialogue systems in w hich the machine conducts a conversation with the speaker, solicits inputs, acts in response to user inputs, or even tries to clarify am- biguity in the conversation. 47.3 Sources of Variability of Speech Speech recognition bymachineisinherentlydifficultbecauseofthevariability inthesignal. Sources of this variability include: 1. Within-speakervariabilityinmaintainingconsistentpronunciationanduseofwordsand phrases. 2. Across-speaker variability due to physiological differences (e.g., different vocal tract lengths) regional accents, foreign languages, etc. 3. Transducer variability whilespeaking over different microphones/telephone handsets. 4. Variability introduced by the transmission system (the media through which speech is transmitted, telecommunication networks, cellular phones, etc.). 5. Variabilityinthespeakingenvironment,includingextraneousconversationsandacoustic background events (e.g., noise, door slams). 47.4 Approaches to ASR by Machine 47.4.1 The Acoustic-Phonetic Approach [1] Theearliestapproachestospeechrecognitionwerebasedonfindingspeechsoundsandprovidingap- propriatelabels tothesesounds. This isthebasisofthe acoustic-phoneticapproachwhichpostulates thatthereexistfinite, distinctivephoneticunits(phonemes) inspokenlanguage,and thattheseunits are broadly characterized by a set of acoustic properties that are manifest in the speech signal over time. Even though the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds (the so-called coarticulation), it is assumed in the “acoustic-phonetic approach” that the rules governing the variabilityare straightforward and can be readily lear ned (by a machine). The first step in the acoustic-phonetic approach is a segmentation and labeling phase in which the speech signal is segmented into stable acoustic regions, followed by attaching one or more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of the speech (see Fig . 47.1). The second step attempts to determine a valid word (or string of words) from the phonetic label sequences produced in the first step. In the validation process, linguistic constraints of the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in order to access the lexicon for word decoding based on the phoneme lattice. The acoustic-phonetic approach has not been widely used in most commercial applications. 47.4.2 “Pattern-Matching” Approach [2] The “pattern-matching approach” involves two essential steps, namely, pattern trainingand pattern comparison. The essential feature of this approach is that it uses a well- formulated mathematical framework,andestablishesconsistentspeechpatternrepresentationsforreliablepatterncomparison fromasetoflabeledtrainingsamplesviaaformaltrainingalgorithm. Aspeechpatternrepresentation can be in the form of a speechtemplateorastatisticalmodel,andcanbeappliedtoasound(smaller than a word), a word, or a phrase. In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speech (the speech to be recognized) witheach possible c  1999 by CRC Press LLC FIGURE 47.1: Segmentation and labeling for word sequence“seven-six”. pattern learned in the training stage, in orderto determinetheidentityoftheunknown accordingto thegoodnessof matchofthe patterns. The patternmatchingapproachhas becomethepredominant method of speech recognition in the last decade and we shall elaborate on it in subsequent sections. 47.4.3 Artificial Intelligence Approach [3, 4] The “artificial intelligence approach” attempts to mechanize the recognition procedure according to the way a person applies intelligence in visualizing, analyzing, and characterizing speech based on a set of measured acoustic features. Among the techniques used within this class of methods are use of an expert system (e.g., a neural network) which integrates phonemic, lexical, syntactic, semantic, and evenpragmaticknowledgeforsegmentationandlabeling,and uses toolssuchasartificialneural networks for learning the relationships among phonetic events. The focus in this approach has been mostly in the representation of knowledge and integration of knowledge sources. This method has not been used widelyin commercial systems. 47.5 Speech Recognition by Pattern Matching Figure 47.2 is a block diagram that depicts the pattern matching framework. The speech signal is first analyzed and a feature representation is obtained for comparison with either stored reference templatesorstatisticalmodelsinthepatternmatchingblock. Adecisionschemedeterminestheword or phonetic class of the unknown speech based on the matching scores with respect to the stored reference patterns. There are two types of reference patterns that can be used with the model of Fig. 47.2. The first type,calledanonparametricreferencepattern[5](or oftena template),is apatterncreatedfromone ormorespokentokens(exemplars)of thesoundassociatedwiththepattern. Thesecondtype, called a statistical reference model, is created as a statistical characterization (via a fixed type of model) of the behavior of a collection of tokens of the sound associated with the pattern. The hidden Markov model [6] is an example of the statistical model. c  1999 by CRC Press LLC FIGURE 47.2: Block diagram of pattern-recognition speech recognizer. Themodelof Fig. 47.2hasbeen used(eitherexplicitly orimplicitly)for almostallcommercialand industrial speech recognition systems for the following reasons: 1. It is invariant to different speech vocabularies, user sets, feature sets, pattern matching algorithms, and decision rules. 2. It is easy to implement in software (and hardware). 3. Itworkswellinpractice. We now discuss the elements of the pattern recognition model and show how it has been used in isolated word, connected word, and continuous speech recognition systems. 47.5.1 Speech Analysis The purpose of the speech analysis block is to transform the speech waveform into a parsimonious representation which characterizes the time varying properties of the speech. The transfor mation is normally done on successive and possibly overlapped short inter vals 10 to 30 msec in duration (i.e., short-timeanalysis)due to the time-varying nature of speech. The representation [7] could be spectral parameters, such as the output from a filter bank, a discrete Fourier transform (DFT), or a linear predictive coding (LPC) analysis, or theycould be temporal parameters, such as the locations of variouszero or level crossing times in the speech signal. Empirical knowledge gained over decades of psychoacousticstudies suggests that the power spectrum has the necessary acoustic information for high accuracy sound identity. Studies in psychoa- cousticsalsosuggestthatourauditoryperceptionofsoundpowerandloudnessinvolvescompression, leadingtotheuseofthelogarithmicpowerspectrumandthecepstrum[8],whichistheFouriertrans- formofthelog-spectrum. Thelowordercepstralcoefficients(upto10to20)provideaparsimonious representationoftheshort-timespeechsegmentwhichisusuallysufficientforphoneticidentification. The cepstralparametersare oftenaugmentedbytheso-called delta cepstrum[9]whichcharacter- izes dynamic aspects of the time-varyingspeech process. 47.5.2 Pattern Training Pattern training is the method by which representative sound patterns (for the unit being trained) are converted into reference patterns for use by the pattern matching algorithm. There are several ways in which pattern training can be performed, including: 1. Casualtraininginwhich asinglesoundpatternisused directlytocreateeither a template or a crude statistical model (due to the paucityof data). 2. Robust training in which se veral (typically 2 to 4) versions of the sound pattern (usually extracted from the speech of a single talker) are used to create a single merged template or statistical model. c  1999 by CRC Press LLC 3. Clustering training in which a large number of versions of the sound pattern (extracted fromawiderangeoftalkers)isusedtocreateoneormoretemplatesorareliablestatistical model of the sound pattern. Inordertobetterunderstandhowandwhystatisticalmodelsaresobroadlyusedinspeechrecognition, we now formally define an important class of statistical models, namely the hidden Markov model (HMM) [6]. The HMM TheHMMisastatisticalcharacterizationofboththedynamics(timevaryingnature)andstatics (the spectral characterization of sounds) of speech during speaking of a sub-word unit, a word, or even a phrase. The basic premise of the HMM is that a Markov chain can be used to describe the probabilistic nature of the temporal sequence of sounds in speech, i.e., the phonemes in the speech, via a probabilistic state sequence. The states in the sequence are not observed with certaintybecause the correspondence between linguistic sounds and the speech waveform is probabilistic in nature; hence the concept of a hidden model. Instead, the states manifest themselves through the second component of the HMM which is a set of output distributions governing the production of the speechfeaturesin eachstate(thespectral characterizationofthesounds). Inotherwords,theoutput distributions (which are observed) represent the local statistical knowledge of the speech pattern within the state, and the Markov chain characterizes, through a set of state transition probabilities, how these sound processes evolve from one sound to another. Integrated together, the HMM is particularly well suited for modeling speech processes. FIGURE 47.3: Characterization of a word (or phrase, or subword) using a N(5) state, left-to-right, HMM, withcontinuous observation densities in each state of the model. An example of an HMM of a speech pattern is shown in Fig. 47.3. The model has five states (corresponding to five distinct “sounds” or “phonemes” w ithin the speech), and the state (corresponding to the sound being spoken) proceeds from left-to-right (as time progresses). Within each state (assumed to represent a stable acoustical distribution) the spectral features of the speechsignal c  1999 by CRC Press LLC are characterized bya mixtureGaussian density of spectral features (called the observationdensity), alongwithan energydistribution,andastatedurationprobability. The statesrepresentthe changing temporal nature of the speech signal; hence indirectly they represent the speech sounds within the pattern. ThetrainingproblemforHMMsconsistsofestimatingtheparametersofthestatisticaldistributions within each state (e.g., means, variances, mixture gains, etc.), along with the state transitionproba- bilities for the composite HMM. Well-established techniques(e.g., the Baum-Welchmethod [10]or the segmental K-means method [11]) have been defined for doing this pattern training efficiently. 47.5.3 Pattern Matching Pattern matching refers to the process of assessing the similarity between two speech patterns, one of which represents the unknown speechand one of w hich represents the reference pattern (derived from the training process) of each element that can be recognized. When the reference pattern is a “typical” utterance template, pattern matching produces a gross similarity (or dissimilarity) score. Whenthereferencepattern consistsofaprobabilisticmodel,suchas anHMM,theprocessof pattern matching is equivalent to using the statistical knowledge contained in the probabilistic model to assess the likelihood of the speech (which led to the model) being realized as the unknown pattern. FIGURE47.4: Resultsoftime aligning twoversionsoftheword“seven”, showinglinearalignmentof thetwoutterances(toppanel);optimaltime-alignmentpath(middlepanel);andnonlinearlyaligned patterns (lower panel). A majorproblemincomparingspeech patternsisdueto speakingratevariations. HMMs provide animplicittimenormalizationaspartoftheprocessformeasuringlikelihood. However,fortemplate c  1999 by CRC Press LLC approaches, explicit time normalization is required. Figure 47.4 demonstrates the effect of explicit timenormalizationbetweentwopatternsrepresentingisolatedwordutterances. Thetoppanelofthe figure shows the log energy contour of the two patterns (for the spoken word “seven”) — one called the reference (known) pattern and the other called the test (or unknown input) pattern. It can be seen that the inherent duration of the two patterns, 30 and 35 frames (where each frame is a 15-ms segment ofspeech),isdifferentandthat linearalignmentisgrosslyinadequatefor internallyaligning events within the two patterns (compare the locations of the vowel peaks in the two patterns). A basic principle of time alignment is to nonuniformly warp the time scale so as to achieve the best possible matching score between the two patterns (regardless of whether the two patterns are of the same word identity or not). This can be accomplished by a dynamic programming procedure, often calleddynamictimewarping(DTW)[12]whenappliedtospeechtemplatematching. The“optimal” nonlinear alignment result of dynamic time warping is shown at the bottom of Fig. 47.4 in contrast to the linear alignment of the patterns at the top. It is clear that the nonlinear alignment provides a more realistic measure of similarity between the patterns. 47.5.4 Decision Strategy The decision strategy takes all the matching scores (from the unknown pattern to each of the stored reference patterns) into account, finds the “closest” match, and decides if the quality of the match is goodenoughtomakearecognition decision. If not,theuseris askedtoprovide anothertokenofthe speech (e.g., the word or phrase) for another recognition attempt. This is necessary because often the user may speak words that are incorrect in some sense (e.g.,hesitation,incorrectly spoken word, etc.) or simply outside of the vocabulary of the recognition system. 47.5.5 Results of Isolated Word Recognition Using the pattern recognition model of Fig. 47.2, and using either the non-parametric template approach or the statistical HMM method to derive reference patterns, a wide variety of tests of the recognizer have been performed on telephone speech with isolated word inputs in both speaker- dependent (SD) and speaker-independent (SI) modes. Vocabulary sizes have ranged from as few as 10 words (i.e., the digits zero–nine) to as many as 1109 words. Table 47.1 gives a summary of recognizer performance under the conditions describedabove. TABLE 47.1 Performance of Isolated Word Re cognizers Vocabulary Mode Word error rate (%) 10 Digits SI 0.1 SD 0.0 39 Alphadigits SI 7.0 SD 4.5 129 Airline terms SI 2.9 SD 1.0 1109 Basic English SD 4.3 47.6 Connected Word Recognition The systems we have been describing in previous sections have all been isolated word recognition systems. In this section we consider extensions of the basic processing methods described in pre- c  1999 by CRC Press LLC vious sections in order to handle recognition of sequences of words, the so-called connected word recognition system. The basic approach to connected word recognition is shown in Fig. 47.5. Assume we are given a fluently spoken sequence of words, represented by the (unknown) test pattern T , and we are also givenasetofV reference patterns, {R 1 ,R 2 , ,R V } each representing one of the words in the vocabulary. The connectedwordrecognitionproblemconsistsoffindingtheconcatenatedreference pattern, R S , which best matches the test pattern, in the sense that the overall similarity between T and R S is maximum over all sequence lengths and over all combinations of vocabulary words. FIGURE 47.5: Illustration of the problem of matching a connected word string, spoken fluently, using whole word patterns concatenated together to provide the best match. There are several problems associated with solving the connected word recognition problem, as formulated above. First of all, we do not know how many words were spoken; hence, we have to considersolutionswitharangeonthenumberofwordsintheutterance. Second,wedonotknownor can we reliably find word boundar ies within the test pattern. Hence, we cannot use word boundary information to segment the problem into simple “word-matching” recognition problems. Finally, sincethecombinatoricsof tryingtosolvetheproblemexhaustively(bytryingtomatcheverypossible string) are exponential in nature, we need to deviseefficient algorithms to solve this problem. Such efficient algorithms have been developed and they solve the connected word recognition problem by iteratively building up time-aligned matches between sequences of reference patterns and the unknown test pattern, one frame at a time [13, 14, 15]. 47.6.1 Performance of Connected Word Recognizers Typical recognition performance for connected word recognizers is given in Table 47.2 for a range of vocabularies, and for a range of associated tasks. In the next section we will see how we exploit linguistic constraints of the task to improve recognition accuracy for word strings beyond the level one would expect on the basis of word error rates of the system. c  1999 by CRC Press LLC [...]... Speech, and Signal Processing, ASSP30(9),1639–1641, Sept 1990 [12] Sakoe, H and Chiba, S., Dynamic programming optimization for spoken word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-26(1),43–49, Feb 1978 [13] Sakoe, H., Two-level DP matching — a dynamic programming-based pattern matching algorithm for connected word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ... monosyllabic word recognition in continuously spoken sentences, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-28(4),357–366, Aug 1980 c 1999 by CRC Press LLC [9] Furui, S., Speaker independent isolated word recognition using dynamic features of speech spectrum, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-34(1),52–59, Feb 1986 [10] Baum, L.E., Petrie, T., Soules, G and Weiss, N.,... Minimum prediction residual principle applied to speech recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-23,57–72, Feb 1975 [3] Lesser, V.R., Fennell, R.D., Erman, L.D and Reddy D.R., Organization of the Hearsay-II Speech Understanding System, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-23(1),11– 23, 1975 [4] Lippmann, R., An introduction to computing with neural networks,... assume no explicit knowledge of the signal environment, while adaptive methods attempt to estimate the adverse condition and adjust the signal (or the reference models) accordingly in order to achieve reliable matching results When channel or transducer distortions are the major factor, it is convenient to assume that the linear distortion effect appears as an additive signal bias in the cepstral domain... ASSP37(11),1659–1671, Nov 1989 [24] Rahim, M.G and Juang, B.H., Signal bias removal for robust telephone speech recognition in adverse environments, Proc ICASSP-94, Apr 1994 [25] Lee, C.-H., Lin, C.-H and Juang, B.H., A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-39(4),806–814, Apr 1991 [26] Wilpon, J.G.,... Signal Processing, ASSP-27(6),588–595, Dec 1979 [14] Myers, C.S and Rabiner, L.R., A level building dynamic time warping algorithm for connected word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-29(3),351–363, June 1981 [15] Bridle, J.S., Brown, M.D and Chamberlain, R.M., An algorithm for connected word recognition, Proc ICASSP-82, 899–902, May 1982 [16] Lee, C.H., Rabiner,... speech analysis technique, Proc ICASSP-29, 121–124, 1992 [20] Mansour, D and Juang, B.H., The short-time modified coherence representation and noisy speech recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-37(6),795– 804, June 1989 [21] Ghitza, O., Auditory nerve representation as a front-end for speech recognition in a noisy environment, Comp Speech Lang., 1(2),109–130, Dec 1986 [22]... beginning of the system prompt An echo canceler, with a proper double talk detector, is used to cancel the system prompt while attempting to detect if the near-end signal from the talker (i.e., speech to be recognized) is present The tentatively detected signal is then passed through the recognizer with rejection thresholds to produce the partial recognition results The rejection technique is critical because... ASSP-39(4),806–814, Apr 1991 [26] Wilpon, J.G., Rabiner, L.R., Lee, C.-H and Goldman, E., Automatic recognition of keywords in unconstrained speech using hidden Markov models, IEEE Trans Acoustics, Speech, and Signal Processing, 38(11),1870–1878, Nov 1990 [27] Rahim, M., Lee, C.-H and Juang, B.H., Robust utterance verification for connected digit recognition, Proc ICASSP-95, WA02.02, May 1995 c 1999 by CRC Press... monitoring of manufacturing processes (e.g., parts inspection) for quality control 3 Telephone or telecommunications Applications include automation of operator assisted services (the Voice Recognition Call Processing system by AT&T to automate operator service routing according to call types), inbound and outbound telemarketing, information services (the ANSER system by NTT for limited home banking services, . assume no explicit knowledge of the signal environment,whileadaptivemethodsattempttoestimatetheadverseconditionandadjustthesignal (or the reference models). the lineardistortioneffectappearsasanadditivesignalbiasinthecepstraldomain. Thisdistortionmodel leadstothe methodofcepstralmeansubtractionand,moregenerally,signalbias removal[24]w

Ngày đăng: 25/01/2014, 13:20

Xem thêm