Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 17 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
17
Dung lượng
140,17 KB
Nội dung
Lawrence R. Rabiner, et. Al. “Speech Recognition by Machine.”
2000 CRC Press LLC. <http://www.engnetbase.com>.
SpeechRecognitionbyMachine
LawrenceR.Rabiner
AT&TLabs—Research
B.H.Juang
BellLaboratories
LucentTechnologies
47.1Introduction
47.2CharacterizationofSpeechRecognitionSystems
47.3SourcesofVariabilityofSpeech
47.4ApproachestoASRbyMachine
TheAcoustic-PhoneticApproach[1]
•
“Pattern-Matching”
Approach[2]
•
ArtificialIntelligenceApproach[3,4]
47.5SpeechRecognitionbyPatternMatching
SpeechAnalysis
•
PatternTraining
•
PatternMatching
•
De-
cisionStrategy
•
ResultsofIsolatedWordRecognition
47.6ConnectedWordRecognition
PerformanceofConnectedWordRecognizers
47.7ContinuousSpeechRecognition
Sub-WordSpeechUnitsandAcousticModeling
•
WordMod-
elingFromSub-WordUnits
•
LanguageModelingWithinthe
Recognizer
•
PerformanceofContinuousSpeechRecognizers
47.8SpeechRecognitionSystemIssues
RobustSpeechRecognition[18]
•
SpeakerAdaptation[25]
•
KeywordSpotting[26]andUtteranceVerification[27]
•
Barge-In
47.9PracticalIssuesinSpeechRecognition
47.10ASRApplications
References
47.1 Introduction
Overthepastseveraldecadesaneedhasarisentoenablehumanstocommunicatewithmachinesin
ordertocontroltheiractionsortoobtaininformation.Initialattemptsatprovidinghuman-machine
communicationsledtothedevelopmentofthekeyboard,themouse,thetrackball,thetouchscreen,
andthejoystick.However,noneofthesecommunicationdevicesprovidestherichnessortheease
ofuseofspeechwhichhasbeenthemostnaturalformofcommunicationbetweenhumansfortens
ofcenturies.Hence,aneedhasarisentoprovideavoiceinterfacebetweenhumansandmachines.
Thisneedhasbeenmet,toalimitedextent,byspeechprocessingsystemswhichenableamachine
tospeak(speechsynthesissystems)andwhichenableamachinetounderstand(speechrecognition
systems)humanspeech.Weconcentrateonspeechrecognitionsystemsinthissection.
Speechrecognitionbymachinereferstothecapabilityofamachinetoconverthumanspeechto
atextualform,providingatranscriptionorinterpretationofeverythingthehumanspeakswhile
themachineislistening.Thiscapabilityisrequiredfortasksinwhichthehumaniscontrollingthe
actionsofthemachineusingonlylimitedspeakingcapability,e.g.,whilespeakingsimplecommands
orsequencesofwordsfromalimitedvocabulary(e.g.,digitsequencesforatelephonenumber).In
c
1999byCRCPressLLC
the moregeneral case, usually referredtoasspeech understanding, the machine need only recognize
a limited subset of the user input speech, namely the speech that specifies enough about the action
requested so that the machine can either respond appropriately, or initiate some action in response
to what was understood.
Speech recognition systems have been deployed in applications ranging from control of desktop
computers, to telecommunication services, to business services, and have achieved varying degrees
of success and commercialization.
In this section we discuss a range of issues involved in the design and implementation of speech
recognition systems.
47.2 Characterization of Speech Recognition Systems
A number of issues define the technologyof speech recognition systems. These include:
1. The manner in which a user speaks to the machine. There are generally three modes of
speaking, including:
• isolated word (or phrase) mode in which the user speaks individual words (or
phrases) drawn from a specified vocabulary;
• connected word mode in which the user speaks fluent speech consisting entirely of
words from a specified vocabulary(e.g., telephone numbers);
• continuous speech mode in which the user can speak fluently from a large (often
unlimited) vocabulary.
2. The size of the recognition vocabulary,including:
• smallvocabulary systemswhichproviderecognitioncapabilityforupto100words;
• medium vocabulary systems which provide recognition capability for from 100 to
1000 words;
• largevocabularysystemswhichproviderecognition capability forover1000words.
3. The knowledge of the user’s speech patterns, including:
• speaker dependent systems which have been custom tailored to each individual
talker;
• speaker independent systems which work on broad populations of talkers, most of
which the system has never encountered or adapted to;
• speaker adaptive systems which customize their knowledge to each individual user
over time while the system is in use.
4. The amount of acoustic and lexical knowledge used in the system, including:
• simple acoustic systems w hich have no linguistic knowledge;
• systems which integrate acoustic and linguistic knowledge, where the linguistic
knowledge is generally represented via syntactical and semantic constraints on the
output of the recognition system.
5. The degree of dialogue between the human and the machine, including:
• one-way (passive) communication in which each user spoken input is acted upon;
• system-driven dialog systems in which the system is the sole initiator of a dialog,
requesting information from the user via verbal input;
c
1999 by CRC Press LLC
• natural dialogue systems in w hich the machine conducts a conversation with the
speaker, solicits inputs, acts in response to user inputs, or even tries to clarify am-
biguity in the conversation.
47.3 Sources of Variability of Speech
Speech recognition bymachineisinherentlydifficultbecauseofthevariability inthesignal. Sources
of this variability include:
1. Within-speakervariabilityinmaintainingconsistentpronunciationanduseofwordsand
phrases.
2. Across-speaker variability due to physiological differences (e.g., different vocal tract
lengths) regional accents, foreign languages, etc.
3. Transducer variability whilespeaking over different microphones/telephone handsets.
4. Variability introduced by the transmission system (the media through which speech is
transmitted, telecommunication networks, cellular phones, etc.).
5. Variabilityinthespeakingenvironment,includingextraneousconversationsandacoustic
background events (e.g., noise, door slams).
47.4 Approaches to ASR by Machine
47.4.1 The Acoustic-Phonetic Approach [1]
Theearliestapproachestospeechrecognitionwerebasedonfindingspeechsoundsandprovidingap-
propriatelabels tothesesounds. This isthebasisofthe acoustic-phoneticapproachwhichpostulates
thatthereexistfinite, distinctivephoneticunits(phonemes) inspokenlanguage,and thattheseunits
are broadly characterized by a set of acoustic properties that are manifest in the speech signal over
time. Even though the acoustic properties of phonetic units are highly variable, both with speakers
and with neighboring sounds (the so-called coarticulation), it is assumed in the “acoustic-phonetic
approach” that the rules governing the variabilityare straightforward and can be readily lear ned (by
a machine). The first step in the acoustic-phonetic approach is a segmentation and labeling phase
in which the speech signal is segmented into stable acoustic regions, followed by attaching one or
more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of
the speech (see Fig . 47.1). The second step attempts to determine a valid word (or string of words)
from the phonetic label sequences produced in the first step. In the validation process, linguistic
constraints of the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in
order to access the lexicon for word decoding based on the phoneme lattice. The acoustic-phonetic
approach has not been widely used in most commercial applications.
47.4.2 “Pattern-Matching” Approach [2]
The “pattern-matching approach” involves two essential steps, namely, pattern trainingand pattern
comparison. The essential feature of this approach is that it uses a well- formulated mathematical
framework,andestablishesconsistentspeechpatternrepresentationsforreliablepatterncomparison
fromasetoflabeledtrainingsamplesviaaformaltrainingalgorithm. Aspeechpatternrepresentation
can be in the form of a speechtemplateorastatisticalmodel,andcanbeappliedtoasound(smaller
than a word), a word, or a phrase. In the pattern-comparison stage of the approach, a direct
comparison is made between the unknown speech (the speech to be recognized) witheach possible
c
1999 by CRC Press LLC
FIGURE 47.1: Segmentation and labeling for word sequence“seven-six”.
pattern learned in the training stage, in orderto determinetheidentityoftheunknown accordingto
thegoodnessof matchofthe patterns. The patternmatchingapproachhas becomethepredominant
method of speech recognition in the last decade and we shall elaborate on it in subsequent sections.
47.4.3 Artificial Intelligence Approach [3, 4]
The “artificial intelligence approach” attempts to mechanize the recognition procedure according to
the way a person applies intelligence in visualizing, analyzing, and characterizing speech based on a
set of measured acoustic features. Among the techniques used within this class of methods are use
of an expert system (e.g., a neural network) which integrates phonemic, lexical, syntactic, semantic,
and evenpragmaticknowledgeforsegmentationandlabeling,and uses toolssuchasartificialneural
networks for learning the relationships among phonetic events. The focus in this approach has been
mostly in the representation of knowledge and integration of knowledge sources. This method has
not been used widelyin commercial systems.
47.5 Speech Recognition by Pattern Matching
Figure 47.2 is a block diagram that depicts the pattern matching framework. The speech signal is
first analyzed and a feature representation is obtained for comparison with either stored reference
templatesorstatisticalmodelsinthepatternmatchingblock. Adecisionschemedeterminestheword
or phonetic class of the unknown speech based on the matching scores with respect to the stored
reference patterns.
There are two types of reference patterns that can be used with the model of Fig. 47.2. The first
type,calledanonparametricreferencepattern[5](or oftena template),is apatterncreatedfromone
ormorespokentokens(exemplars)of thesoundassociatedwiththepattern. Thesecondtype, called
a statistical reference model, is created as a statistical characterization (via a fixed type of model) of
the behavior of a collection of tokens of the sound associated with the pattern. The hidden Markov
model [6] is an example of the statistical model.
c
1999 by CRC Press LLC
FIGURE 47.2: Block diagram of pattern-recognition speech recognizer.
Themodelof Fig. 47.2hasbeen used(eitherexplicitly orimplicitly)for almostallcommercialand
industrial speech recognition systems for the following reasons:
1. It is invariant to different speech vocabularies, user sets, feature sets, pattern matching
algorithms, and decision rules.
2. It is easy to implement in software (and hardware).
3. Itworkswellinpractice.
We now discuss the elements of the pattern recognition model and show how it has been used in
isolated word, connected word, and continuous speech recognition systems.
47.5.1 Speech Analysis
The purpose of the speech analysis block is to transform the speech waveform into a parsimonious
representation which characterizes the time varying properties of the speech. The transfor mation
is normally done on successive and possibly overlapped short inter vals 10 to 30 msec in duration
(i.e., short-timeanalysis)due to the time-varying nature of speech. The representation [7] could be
spectral parameters, such as the output from a filter bank, a discrete Fourier transform (DFT), or a
linear predictive coding (LPC) analysis, or theycould be temporal parameters, such as the locations
of variouszero or level crossing times in the speech signal.
Empirical knowledge gained over decades of psychoacousticstudies suggests that the power spec-
trum has the necessary acoustic information for high accuracy sound identity. Studies in psychoa-
cousticsalsosuggestthatourauditoryperceptionofsoundpowerandloudnessinvolvescompression,
leadingtotheuseofthelogarithmicpowerspectrumandthecepstrum[8],whichistheFouriertrans-
formofthelog-spectrum. Thelowordercepstralcoefficients(upto10to20)provideaparsimonious
representationoftheshort-timespeechsegmentwhichisusuallysufficientforphoneticidentification.
The cepstralparametersare oftenaugmentedbytheso-called delta cepstrum[9]whichcharacter-
izes dynamic aspects of the time-varyingspeech process.
47.5.2 Pattern Training
Pattern training is the method by which representative sound patterns (for the unit being trained)
are converted into reference patterns for use by the pattern matching algorithm. There are several
ways in which pattern training can be performed, including:
1. Casualtraininginwhich asinglesoundpatternisused directlytocreateeither a template
or a crude statistical model (due to the paucityof data).
2. Robust training in which se veral (typically 2 to 4) versions of the sound pattern (usually
extracted from the speech of a single talker) are used to create a single merged template
or statistical model.
c
1999 by CRC Press LLC
3. Clustering training in which a large number of versions of the sound pattern (extracted
fromawiderangeoftalkers)isusedtocreateoneormoretemplatesorareliablestatistical
model of the sound pattern.
Inordertobetterunderstandhowandwhystatisticalmodelsaresobroadlyusedinspeechrecognition,
we now formally define an important class of statistical models, namely the hidden Markov model
(HMM) [6].
The HMM
TheHMMisastatisticalcharacterizationofboththedynamics(timevaryingnature)andstatics
(the spectral characterization of sounds) of speech during speaking of a sub-word unit, a word, or
even a phrase. The basic premise of the HMM is that a Markov chain can be used to describe the
probabilistic nature of the temporal sequence of sounds in speech, i.e., the phonemes in the speech,
via a probabilistic state sequence. The states in the sequence are not observed with certaintybecause
the correspondence between linguistic sounds and the speech waveform is probabilistic in nature;
hence the concept of a hidden model. Instead, the states manifest themselves through the second
component of the HMM which is a set of output distributions governing the production of the
speechfeaturesin eachstate(thespectral characterizationofthesounds). Inotherwords,theoutput
distributions (which are observed) represent the local statistical knowledge of the speech pattern
within the state, and the Markov chain characterizes, through a set of state transition probabilities,
how these sound processes evolve from one sound to another. Integrated together, the HMM is
particularly well suited for modeling speech processes.
FIGURE 47.3: Characterization of a word (or phrase, or subword) using a N(5) state, left-to-right,
HMM, withcontinuous observation densities in each state of the model.
An example of an HMM of a speech pattern is shown in Fig. 47.3. The model has five states
(corresponding to five distinct “sounds” or “phonemes” w ithin the speech), and the state (corre-
sponding to the sound being spoken) proceeds from left-to-right (as time progresses). Within each
state (assumed to represent a stable acoustical distribution) the spectral features of the speechsignal
c
1999 by CRC Press LLC
are characterized bya mixtureGaussian density of spectral features (called the observationdensity),
alongwithan energydistribution,andastatedurationprobability. The statesrepresentthe changing
temporal nature of the speech signal; hence indirectly they represent the speech sounds within the
pattern.
ThetrainingproblemforHMMsconsistsofestimatingtheparametersofthestatisticaldistributions
within each state (e.g., means, variances, mixture gains, etc.), along with the state transitionproba-
bilities for the composite HMM. Well-established techniques(e.g., the Baum-Welchmethod [10]or
the segmental K-means method [11]) have been defined for doing this pattern training efficiently.
47.5.3 Pattern Matching
Pattern matching refers to the process of assessing the similarity between two speech patterns, one
of which represents the unknown speechand one of w hich represents the reference pattern (derived
from the training process) of each element that can be recognized. When the reference pattern is a
“typical” utterance template, pattern matching produces a gross similarity (or dissimilarity) score.
Whenthereferencepattern consistsofaprobabilisticmodel,suchas anHMM,theprocessof pattern
matching is equivalent to using the statistical knowledge contained in the probabilistic model to
assess the likelihood of the speech (which led to the model) being realized as the unknown pattern.
FIGURE47.4: Resultsoftime aligning twoversionsoftheword“seven”, showinglinearalignmentof
thetwoutterances(toppanel);optimaltime-alignmentpath(middlepanel);andnonlinearlyaligned
patterns (lower panel).
A majorproblemincomparingspeech patternsisdueto speakingratevariations. HMMs provide
animplicittimenormalizationaspartoftheprocessformeasuringlikelihood. However,fortemplate
c
1999 by CRC Press LLC
approaches, explicit time normalization is required. Figure 47.4 demonstrates the effect of explicit
timenormalizationbetweentwopatternsrepresentingisolatedwordutterances. Thetoppanelofthe
figure shows the log energy contour of the two patterns (for the spoken word “seven”) — one called
the reference (known) pattern and the other called the test (or unknown input) pattern. It can be
seen that the inherent duration of the two patterns, 30 and 35 frames (where each frame is a 15-ms
segment ofspeech),isdifferentandthat linearalignmentisgrosslyinadequatefor internallyaligning
events within the two patterns (compare the locations of the vowel peaks in the two patterns). A
basic principle of time alignment is to nonuniformly warp the time scale so as to achieve the best
possible matching score between the two patterns (regardless of whether the two patterns are of the
same word identity or not). This can be accomplished by a dynamic programming procedure, often
calleddynamictimewarping(DTW)[12]whenappliedtospeechtemplatematching. The“optimal”
nonlinear alignment result of dynamic time warping is shown at the bottom of Fig. 47.4 in contrast
to the linear alignment of the patterns at the top. It is clear that the nonlinear alignment provides a
more realistic measure of similarity between the patterns.
47.5.4 Decision Strategy
The decision strategy takes all the matching scores (from the unknown pattern to each of the stored
reference patterns) into account, finds the “closest” match, and decides if the quality of the match is
goodenoughtomakearecognition decision. If not,theuseris askedtoprovide anothertokenofthe
speech (e.g., the word or phrase) for another recognition attempt. This is necessary because often
the user may speak words that are incorrect in some sense (e.g.,hesitation,incorrectly spoken word,
etc.) or simply outside of the vocabulary of the recognition system.
47.5.5 Results of Isolated Word Recognition
Using the pattern recognition model of Fig. 47.2, and using either the non-parametric template
approach or the statistical HMM method to derive reference patterns, a wide variety of tests of the
recognizer have been performed on telephone speech with isolated word inputs in both speaker-
dependent (SD) and speaker-independent (SI) modes. Vocabulary sizes have ranged from as few
as 10 words (i.e., the digits zero–nine) to as many as 1109 words. Table 47.1 gives a summary of
recognizer performance under the conditions describedabove.
TABLE 47.1 Performance of Isolated Word Re cognizers
Vocabulary Mode Word error rate (%)
10 Digits SI 0.1
SD 0.0
39 Alphadigits SI 7.0
SD 4.5
129 Airline terms SI 2.9
SD 1.0
1109 Basic English SD 4.3
47.6 Connected Word Recognition
The systems we have been describing in previous sections have all been isolated word recognition
systems. In this section we consider extensions of the basic processing methods described in pre-
c
1999 by CRC Press LLC
vious sections in order to handle recognition of sequences of words, the so-called connected word
recognition system.
The basic approach to connected word recognition is shown in Fig. 47.5. Assume we are given
a fluently spoken sequence of words, represented by the (unknown) test pattern T , and we are also
givenasetofV reference patterns, {R
1
,R
2
, ,R
V
} each representing one of the words in the
vocabulary. The connectedwordrecognitionproblemconsistsoffindingtheconcatenatedreference
pattern, R
S
, which best matches the test pattern, in the sense that the overall similarity between T
and R
S
is maximum over all sequence lengths and over all combinations of vocabulary words.
FIGURE 47.5: Illustration of the problem of matching a connected word string, spoken fluently,
using whole word patterns concatenated together to provide the best match.
There are several problems associated with solving the connected word recognition problem, as
formulated above. First of all, we do not know how many words were spoken; hence, we have to
considersolutionswitharangeonthenumberofwordsintheutterance. Second,wedonotknownor
can we reliably find word boundar ies within the test pattern. Hence, we cannot use word boundary
information to segment the problem into simple “word-matching” recognition problems. Finally,
sincethecombinatoricsof tryingtosolvetheproblemexhaustively(bytryingtomatcheverypossible
string) are exponential in nature, we need to deviseefficient algorithms to solve this problem. Such
efficient algorithms have been developed and they solve the connected word recognition problem
by iteratively building up time-aligned matches between sequences of reference patterns and the
unknown test pattern, one frame at a time [13, 14, 15].
47.6.1 Performance of Connected Word Recognizers
Typical recognition performance for connected word recognizers is given in Table 47.2 for a range
of vocabularies, and for a range of associated tasks. In the next section we will see how we exploit
linguistic constraints of the task to improve recognition accuracy for word strings beyond the level
one would expect on the basis of word error rates of the system.
c
1999 by CRC Press LLC
[...]... Speech, and Signal Processing, ASSP30(9),1639–1641, Sept 1990 [12] Sakoe, H and Chiba, S., Dynamic programming optimization for spoken word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-26(1),43–49, Feb 1978 [13] Sakoe, H., Two-level DP matching — a dynamic programming-based pattern matching algorithm for connected word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ... monosyllabic word recognition in continuously spoken sentences, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-28(4),357–366, Aug 1980 c 1999 by CRC Press LLC [9] Furui, S., Speaker independent isolated word recognition using dynamic features of speech spectrum, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-34(1),52–59, Feb 1986 [10] Baum, L.E., Petrie, T., Soules, G and Weiss, N.,... Minimum prediction residual principle applied to speech recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-23,57–72, Feb 1975 [3] Lesser, V.R., Fennell, R.D., Erman, L.D and Reddy D.R., Organization of the Hearsay-II Speech Understanding System, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-23(1),11– 23, 1975 [4] Lippmann, R., An introduction to computing with neural networks,... assume no explicit knowledge of the signal environment, while adaptive methods attempt to estimate the adverse condition and adjust the signal (or the reference models) accordingly in order to achieve reliable matching results When channel or transducer distortions are the major factor, it is convenient to assume that the linear distortion effect appears as an additive signal bias in the cepstral domain... ASSP37(11),1659–1671, Nov 1989 [24] Rahim, M.G and Juang, B.H., Signal bias removal for robust telephone speech recognition in adverse environments, Proc ICASSP-94, Apr 1994 [25] Lee, C.-H., Lin, C.-H and Juang, B.H., A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-39(4),806–814, Apr 1991 [26] Wilpon, J.G.,... Signal Processing, ASSP-27(6),588–595, Dec 1979 [14] Myers, C.S and Rabiner, L.R., A level building dynamic time warping algorithm for connected word recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-29(3),351–363, June 1981 [15] Bridle, J.S., Brown, M.D and Chamberlain, R.M., An algorithm for connected word recognition, Proc ICASSP-82, 899–902, May 1982 [16] Lee, C.H., Rabiner,... speech analysis technique, Proc ICASSP-29, 121–124, 1992 [20] Mansour, D and Juang, B.H., The short-time modified coherence representation and noisy speech recognition, IEEE Trans Acoustics, Speech, and Signal Processing, ASSP-37(6),795– 804, June 1989 [21] Ghitza, O., Auditory nerve representation as a front-end for speech recognition in a noisy environment, Comp Speech Lang., 1(2),109–130, Dec 1986 [22]... beginning of the system prompt An echo canceler, with a proper double talk detector, is used to cancel the system prompt while attempting to detect if the near-end signal from the talker (i.e., speech to be recognized) is present The tentatively detected signal is then passed through the recognizer with rejection thresholds to produce the partial recognition results The rejection technique is critical because... ASSP-39(4),806–814, Apr 1991 [26] Wilpon, J.G., Rabiner, L.R., Lee, C.-H and Goldman, E., Automatic recognition of keywords in unconstrained speech using hidden Markov models, IEEE Trans Acoustics, Speech, and Signal Processing, 38(11),1870–1878, Nov 1990 [27] Rahim, M., Lee, C.-H and Juang, B.H., Robust utterance verification for connected digit recognition, Proc ICASSP-95, WA02.02, May 1995 c 1999 by CRC Press... monitoring of manufacturing processes (e.g., parts inspection) for quality control 3 Telephone or telecommunications Applications include automation of operator assisted services (the Voice Recognition Call Processing system by AT&T to automate operator service routing according to call types), inbound and outbound telemarketing, information services (the ANSER system by NTT for limited home banking services, . assume no explicit knowledge of the signal
environment,whileadaptivemethodsattempttoestimatetheadverseconditionandadjustthesignal
(or the reference models). the
lineardistortioneffectappearsasanadditivesignalbiasinthecepstraldomain. Thisdistortionmodel
leadstothe methodofcepstralmeansubtractionand,moregenerally,signalbias removal[24]w