Tài liệu xử lý tiếng nói tiếng anh

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	477
Dung lượng	22,43 MB

Nội dung

Bài giảng xử lý tiếng nói ...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Digital Speech Processing, Synthesis, and Recognition Signal Processing and Communications Series Editor K J Ray Liu University of Maryland College Park, Maryland Editorial Board Sadaoki Furui, Tokyo lnstitute of Technology Yih-Fang Huang, University of Notre Dame Aggelos K Katsaggelos,Northwestern University Mos Kaveh,University of Minnesota P K Raja Rajasekaran, Texas lnstruments John A Sorenson, Technical University of Denmark Digital Signal Processing for Multimedia Systems, edited by Keshnb K Parhi and Tnkuo Nishitani Multimedia Systems, Standards, and Networks, edited by Atul Puri and Tsuhan Chen EmbeddedMultiprocessors:SchedulingandSynchronization, Sundm-arajan Sriranz and Shuvra S Bhattcrcharyva David C Swanson Signal Processing for Intelligent Sensor Systerns, edited by Ming-Ting Sun and Amy Compressed Video over Networks, R Riebmm Xiang-Gen Modulated Coding for Intersymbol Interference Channels, Xia Digital Speech Processing, Synthesis, and Recognition: Second Edition, Revised and Expanded,Sadaoki Furui Additiorml Volzmes irt Preparation Modern Digital Halftoning,David L Lau altd Gonzalo R Arce Blind Equalization and Identification,Zhi Ding and Ye (Geoffrey) Li Video Coding for Wireless Communications, King H Ngan, Chu Yu Yap, aud Keng T.Tal2 Digital Speech Processing, Synthesis, and Recognition Second Edition, Revised andExpanded Sadaoki Furui Tokyo Institute of Technology Tokyo, Japan MARCEL MARCEL DEKKER, INC D E K K E R NEWYORK BASEL Library of Congress Cataloging-in-Publication Data Furui, Sadaoki Digital speech processing, synthesis, and recognition / Sadaoki Furui.ed., rev and expanded p cm - (Signal processing and communications; 7) ISBN 0-8247-0452-5 (alk paper) Speech processing systems I Title 11 Series TK788TS65 F87 2000 006.4’54-dc3 2nd 00-060 197 This book is printed on acid-free paper Headquarters Marcel Dekker, Inc 270 Madison Avenue, New York NY 10016 tel: 21 2-696-9000:fax: 12-685-4540 Eastern Hemisphere Distribution Marcel Dekker AG Hutgasse 4, Postfach 812, CH-4001 Basel Switzerland tel: 1-61-261-8482; fax: 1-6 1-26 1-8896 World Wide Web http://www.dekker.com The publisher offers discounts on this book when orderedinbulkquantities For moreinformation, write to Special Sales/Professional Marketingat the headquarters address above Copyright (0 2001 by Marcel Dekker, Inc All Rights Reserved Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any informationstorage and retrieval system, without permission in writing from the publisher Current printing (last digit) 10987654321 PRINTED IN THE UNITED STATES OF AMERICA Series Introduction Over the past 50 years, digital signal processing has evolved as a major engineering discipline The fields of signal processing have grown from the origin of fast Fourier transform and digital filter design to statisticalspectral analysis andarray processing, and image,audio, and multimedia processing, and shaped developments in high-performance VLSI signal processor design Indeed, there are few fields that enjoy so many #applications-signal processing is everywhere in our lives Whenone uses a cellular phone,the voice is compressed, coded, andmodulated using signal processing techniques As a cruise missile winds along hillsides searching forthetarget,the signal processor is busy processing the images taken along theway When we are watching a movie in HDTV, millions of audio and video dataare being sent toour homes and received with unbelievable fidelity When scientists compare DNA samples, fast pattern recognition techniques are being used On and on, one can see the impact of signal processing in almost every engineering and scientific discipline Because of the immense importance of signal processing and the fast-growing demands of business and industry, this series on signal processing serves toreportup-to-datedevelopmentsand advances in the field The topics of interest include but are not limited to the following: iii Series Introduction iv 0 0 0 Signal theoryandanalysis Statistical signal processing Speech andaudio processing Image and video processing Multimedia signal processing and technology Signal processing forcommunications Signal processing architectures and VLSI design I hopethis series will providetheinterestedaudience withhigh-quality,state-of-the-artsignalprocessingliterature through research monographs,editedbooks,and rigorously written textbooks by experts in their fields K J R q ’Liu Preface to the Second Edition More than a decade has passed since the first edition of Digital Speed1 Processiug, Synthesis, nnd Recog?zitio?l was published The as both a bookhas beenwidely used throughouttheworld textbook and a reference work The clear need for such a book stems fromthe fact that speech is themost naturalform of communication among humans and that it alsoplays an ever more salient role in hunm nlachine communication Realizing any such system of conmunication necessitates a clear andthorough understanding of the core technologies of speech processing The field of speech processing, synthesis, and recognition has witnessed significant progress in thispastdecade,spurred by advancesinsignalprocessing,algorithms,architectures,and hardware These advancesinclude: ( I ) international standardization of various hybrid speech coding techniqu,es, especially CELP, and its widespread use in manyapplications, such as cellular phones; (2) waveform unit concatenation-based speech synthesis; (3) large-vocabularycontinuous-speechrecognition based on a statistical pattern recognitionparadigm,e.g.,hidden Markov models (HMMs)and stochasticlanguage models; (4) increased robustness of speech recognition systems against speech variation, such as speaker-to-speaker variability, noise, and channel distor( ) speakerrecognitionmethodsusingthe HMM tion;and technology Preface toEdition the Second vi This second edition includes these significant advances and details important emerging technologies The newly added sections include Robust and Flexible Speech Coding, Corpus-Based Speech Synthesis, Theory and Implementation of HMM, Large-VocabularyContinuous-SpeechRecognition,Speaker-Independent and Adaptive Recognition, and Robust Algorithms Against Noise and Channel Variations In an effort to retain brevity, older technologies now rarely used in recent systems have been omitted The basic technology parts of the book have also been rewritten for easier understanding It is my hope that users of the first edition, as well as new readersseeking to explore both thefundamentalandmodern technologies in this increasingly vital field, will benefit from this second edition for many years to come "_" " " , " " " """~"_l " - " Acknowledgments I am grateful for permission from many organizations and authors to use their copyrighted material in original or adapted form: Figure 2.5 containsmaterial which is copyright Lawrence Erlbaum Associates, 1986 Used with permission All rights reserved Figure 2.6 contains material which is copyright 0Dr H Sato, 1975 Reprintedwithpermission of copyright owner All rights reserved Figures 2.7, 3.8,4.9, 7.1, 7.4, 7.6, and 7.7 contain material which respectively is copyright 19-52, 1980, 1967, 1972, 1980,1987, and 1987 AmericanInstitute of Physics Reproduced with permission All rights reserved Figures 2.8, 2.9, and 2.10 containmaterial which is copyright Dr H Irii, 1987 Used with permission All rights reserved Figure 2.11 contains material which is copyright 0Dr S Saito, 1958 Reprintedwithpermissionofcopyright owner All rights reserved Figure 3.5 contains material which is copyright 0Dr G Fant, 1959 Reproducedwithpermission All rights reserved Figures 3.6, 3.7, 6.6, 6.33, 6.35, and 7.8 contain material which respectively is copyright 1972, 1972, 1975, 1986, vii Index 438 Air Travel Information System (ATIS), 323 A-law, 142 Alexander Graham Bell, Aliasing distortion, 47 Allophones, 19, 229 All-pole: model, 89 polynomial spectral density function, 90 spectrum, 68 speech production system, 68 Allophonic variations, 320 Amplitude: density distribution function, 21 level, 20 Analog-to-digital (A/D) conversion, 45, 51 Analysis-by-synthesis coder, 196 Analysis-by-synthesis (A-b-S) method, 42, 1, 190 Analysis-synthesis, 73,135 Antiformant, 30,127 Antiresonance, 30 circuit, 27, 224 Anti-model, 347 A posteriori probability, 363, 365 Area function, 33, 11 AR process, 91 Arithmetic coding, 134 Articulation, 9, 11, 27, 30 manner of, 12 place of, 12 Articulation equivalent transmission loss (AEN), 201 Articulators, 11 Articulatory model, 223 Articulatory movement, 11 Articulatory organs, 11, 246 Articulatory units, 381 Artificial intelligence (AI), 382 Aspiration, 11 Auditory critical bandwidth, 251 Auditory nerve system, Auditory scene analysis, 385 Audrey, 243 Augmented transition network (ATN), 312 Autocorrelation: function, 52, 53, 251, 252 method, 87, 252 Automation control, 301 Autoregressive (AR) process, 89 Average branching factor, 322 B Back-propagation training algorithm, 401 Backward prediction error, 102 Backward propagation wave, 33 Backward variable, 285 Bakis model, 279 Band-pass: filter (BPF), 82, 70, 250 bank, 76, 159, 251 lifters, 252 Bark-scale frequency axis, 25 Basilar membrane, 25 Baum-Welch algorithm, 282, 288 Bayes’ rule, 13 Bayes’ sense, 364 Bayesian learning, 335 Beam search method, 31 Bernoulli effect, 10 Best-first method, 1 BIC (Bayesian Information Criterion), 325, 328 Bigram, 316 Index Binary tree coding (BTC), 178 Blackboard model, 10 Blind equalization, 36 Bottom-up, 308 Boundary condition: at the lips and glottis, 115 for the time warping function, 268 Breadth-first method, 11 C Cascade connection, 225 Case frame, 312 Centroid, 176,337,394 Cepstral analysis, 79 Cepstral coefficient, 62, 77, 251 Cepstral distance (CD), 202 Cepstrunl, 62 method, 252 Cepstral mean: normalization (CMN), 325, 341, 361 subtraction (CMS), 341, 361 CHATR, 238 Cholesky decomposition method, 89 City block distance, 269 Claimed speaker, 364 Class N-gram, 317 Clustering, 332 Clustering-based methods, 176 Cluster-splitting method (LBG algorithm), 176, 395 Coarticulation, 16, 245, 378 dynamic model of, 383 Code: vectors, 176,393 Codebook, 176, 281, 393 Codeword, 279 439 Code-excited linear predictive coding (CELP), 193 Coding, 45, 47, 199 bit rate, 200 delay, 200 in frequency domain, 159 methods, evaluation of, 199 in time domain, 141 Cohort speakers, 364 Complexity of coder and decoder, 200 Composite sinusoidal model (CSM), 126 Concatenation synthesizer, 238 Connected word recognition, 295 Connection strength, 399 Consonant, Context, 308 Context-dependent phoneme units, 229, 247 Context-free grammar (CFG), 312 Context-oriented-clustering (COC) method, 237 Continuous speech recognition, 246 Conversational speech recognition, 246 Convolution, 387 Convolutional (multiplicative) distortion, 344 Corpus, 314 Corpus-based speech synthesis, 237 Cosh measure, 256 Covariance method, 87 CS-ACELP, 205 Customer (registered speaker), 354 440 CVC syllable, 228, 247 CV syllable, 228, 247 D DARPA speech recognition projects, 323 Database for evaluation, 386 Decision criterion (threshold), 356 DECtalk system, 236 Deemphasis, Delayed decision encoding, 173 Delayed feedback effect, Deleted interpolation method, 316 Delta-cepstrum, 262, 363 Delta-delta-cepstrum, 263 Delta modulation (DM or AM), 149 Demisyllable, 229, 297 Depth-first method, 1 Detection-based approach, 344 Devocalization, 266 Diaphragm, 10 Differential coding, 148 Differential PCM (DPCM), 145, 148 Differential quantization, 149 Digital filter bank, 70 Digital processing of speech, Digital signal processors (DPSs), 386 Digital-to-analog (D/A) conversion, Digitization, 45 Diphone, 229, 247 Diphthong, 13 Discounting ratio, 17 Discourse, 264 Index Discrete cosine transform (DCT), 163 Discrete Fourier transform (DFT), 57, 163 Discriminant analysis, 364 Discriminative training, 293, 347 Distance (similarity) measure, 176, 249 based on LPC, 252 based on nonparametric spectral analysis, 25 Distance normalization, 364 Distinctive features, 20 Distortion rate function, 135 Divergence, 363 Double-SPLIT method, 278 Dual z-transform, 68 Duration, 230, 234, 264 Durbin’s recursive solution method, 89, 105, 108 Dyad, 229, 247 Dynamic characteristics, 367 Dynamic spectral features, 262, 367, 378 Dynamic programming (DP), matching, 266, 277, 297 asymmetrical, 270 staggered array, 272, 249 symmetrical, 270 unconstrained endpoint, 270 variations in, 270 method, 287 CW (clockwise), 300 O(n) (order n), 301 OS (one-stage), 301 path, 270 Dynamic spectral features (spectral transition), 262, 367, 378 Index Dynamic time warping (DTW), 260, 266 E Ears, EM algorithm, 290 Energy level, 248 Entropy, 322 coding, 133 Equivalent vocabulary size, 322 Error: deletion, 323 insertion, 323 rate, 323 substitution, 323 Euclidean distance, 250 Evaluation: factors for speech coding systems, 199 methods objective, 200 subjective, 200 for speech processing technologies, 385 F False acceptance (FA), 354 False rejection (FR), 354 Fast Fourier transform (FFT), 57, 251 Feedforward nets, 399 FFT cepstrum, 69 Filler, 305 speech model, 303 Filter bank, 70 Fine structure, 64 Finite state VQ (FSVQ), 182 441 First-order differential processing, 114 Fixed prediction, 147 FI-F~ plane, 16 Formant, 14,127 bandwidth, 19 frequency, 14, 39 extraction, Formant-type speech synthesis method, 224 Forward-backward algorithm, 282, 283 Forward and backward waves, 223 Forward prediction error, 102 Forward propagation wave, 33 Forward-type AP-DPCM, 153 Forward variable, 283 Fourier transform, 53 pair (Wiener-Khintchine theorem), 54 Frame, 60 interval, 60 length, 60 F-ratio (inter- to intravariance ratio), 363 Frequency resolution, 60 Frequency spectrum, 52 Fricative, 10 Full search coding (FSC), 178 Fundamental equations, 35 Fundamental frequency (pitch), 10, 24, 79, 230, 351 Fundamental period, 10 G Gaussian, 29 mixture, 305 mixture model (GMM), 325, 37 Index 442 Generation rules (rewriting rules), 12 Glottal area, 42 Glottal source, 10 Glottal volume velocity, 42 Glottis, 10 Good-Turing estimation theory, 317 Grammar, 14 Granular noise, 150 H Hamming window, 58 Hanning window, 58 Hard limiters, 399 Harmonic plus noise model (HNM), 220 Harpy system, 31 Hat theory of intonation, 230 Hearing, Hearsay I1 system, 10 Hidden layers, 399 Hidden Markov model (HMM), 278 coding, 184 composition, 344, 363 continuous, 279, 290 decomposition, 344 discrete, 279 ergodic, 279, 305 based method, 371 evaluation problem, 282 hidden state sequence hidden state sequence uncovering problem, 283 left-to-right, 279 linear predictive, 37 [Hidden Markov model (HMM)] mixture autoregressive (AR), 37 MMI training of, 292 MCE/GPD training of, 292, 335 problems, procedures, semicontinuous, 292 system for word recognition, 293 theory and implementation of, 278 three basic algorithms for, 282 tied mixture, 292 training problem, 283 Hidden nodes, 399 Hierarchy model, 308 High-emphasis filter, 102 Homomorphic analysis, 66 Homomorphic filtering, 66 Homomorphic prediction, 129 Huffman coding, 133 Human-computer dialog systems, 323 Human-computer interaction, 243 Hybrid coding, 135,187 I IBM, 325 Impostor, 354 Individual characteristics, 349, 351 Individual differences: acquired, 35 hereditary, 35 Individuality, 246 Index Information: rate distortion theory, 134, 177 transmission theory, 13 Initial state distribution, 28 Input and output nodes, 399 Integer band sampling, 162 Intelligibility test, 200 Internal thresholds, 399 Interpolation characteristics, 126 Inter-session (temporal) variability, 360 Intonation, 7, 10 component, basic, 230 Intraspeaker variation, 360, 364 Inverse filter, 85, 255 first- or second-order critical damping, 361 Inverse filtering method, 93, 114 Irreversible coding, 133 Island-driven method, 11 Isolated word recognition, 246 Itakura-Saito distance (distortion), 254 J Jaw, K Karhunen-Loeve transform (KLT), 163 Katz's backoff smoothing, 17 Kelly's speech synthesis (production) model, 37, 110 K-means algorithm (Lloyd's algorithm), 176, 394 K-nearest neighbor (KNN) method, 332 443 Knockout method, 363 Knowledge processing, advanced, 382 Knowledge source, 308, 382 L Lag window, 252 Language model, 314, 344 Large-vocabulary continuous speech recognition, 306 Larynx, Lattice, 248 filter, 109 diagram, 285 LBG algorithm (cluster-splitting method), 176, 395 LD-CELP, 205 Left-to-right method, 11 Level building (LB) method, 298 Lexicon, 306 Lifter, 77, 261 Liftering, 65 Likelihood, 248, 282 normalization, 364 ratio, 347, 363, 364 LIMSI, 324 Linear delta modulation (LDM), 149 Linearly separable equivalent circuit, 30, 64, 73, 85 Linear PCM, 142 Linear prediction, 2, 83, 145 Linear predictive coding (LPC), 2,78 analysis, 68, 83, 250, 252 procedure, 86 methods: code-excited, 138 multi-pulse-excited, 138 Index 444 [Linear predictive coding (LPC)] residual-excited, 138, 187 speech-excited, 138, 187 parameters, mutual relationships between, 127 speech synthesizer, 228 Linear predictor: coefficients, 84 filter, 84 Linear transformation, 335 based on multiple regression analysis, 336 Line spectrum pair (LSP), 16 analysis, 16 principle of, 16 solution of, 119 parameters, 121 coding of, 126 synthesis filter, 122 Linguistic constraints, 246 Linguistic information, 5, 243 Linguistic knowledge, 246 Linguistic science, new, 383 Linguistic units, 38 Lip rounding, 12 Lips, Lloyd's algorithm (K-means algorithm), 176,394 Local decoder, 145 Locus theory, 229 Log likelihood ratio distance, 255 Log PCM, 142 Lombard effect, 341 Long-term (pitch) prediction, 148, 153 Long-term (term) averaged speech spectrum (LAS), 23, 370 Long-term-statistics-based method, 368 Loss: heat conducgion, 32 leaky, 32 viscous, 32 Loudness, 230 LPC: cepstral coefficients, 257 cepstral distance, 257 cepstrum, 69 correlation coefficients, 260 correlation function, 127 LSI for speech processing use, 386 Lungs, 89 M Markov: chains, 279 sources, 279 Mass conservation equation, 32 Matched filter principle, 197 Matrix quantization (MQ), 138, 182, 337 Maximum a posteriori (MAP), 330 decoding rule, 314 estimates, 335 probability, 13 Maximum likelihood (ML): estimation, 293 method, 70, 254 spectral distance, 254 spectral estimation, 89 formulation of, 89 physical meaning of, 93 MDL (Minimum Description Length) criterion, 325 Mean opinion score (MOS), 200 Index Me1 frequency cepstral coefficient (MFCC), 252 Mel-scale frequency axis, 25 Mimicked voice, 352 Minimum phase impulse response, 77 Minimum residual energy, 256 Mismatches: acoustic, 341 linguistic, 341 MITalk-79 system, 234 Mixed excitation LPC (MELP), 196 Mixture, 290 M-L method, 173 MLLR (maximum likelihood linear regression) method, 325, 330 Models, 244 Modified, autocorrelation function, 14, 98, 107 Modified correlation method, 79 Momentum equation, 32 Morph, 234 Morphemes, 17 Morphological analysis, 17 p-law, 142 Multiband excitation (MBE), 196 Multilayer perceptrons, 399 Multipath search coding, 173 Multiple regression analysis, 336 Multi-pulse-excited LPC (MPC), 189 Multistage processing, 178 Multistage VQ, 179 Multitemplate method, 332 Multivariate autoregression (MAR), 370 Mutual information, 292 445 N N-best: based adaptation, 339 hypotheses, 339 results, 12 N-gram language model, 316 Nasal, 11 cavity, Nasalization, 11 Nasalized vowel, 1 Nearest-neighbor selection rule, 394 Network model, 310 Neural net, 399 Neutral vowel, 13 Neyman-Pearson: hypothesis testing formulation, 305 lemma, 347 Noise: additive, 341 shaping, 138, 156 source, 44 threshold, 135 Nonlinear quantization, 138 Nonlinear warping of the spectrum, 335 Nonparametric analysis (NPA), 52 Nonuniform sampling, 266 Nonspeech sounds, 249 Normal equation, 89 Normalized residual energy, 256 Nyquist rate, 47 Objective evaluation, 200 Observation probability, 28 distribution, 28 446 Opinion-equivalent SNR (SNRq), 200 Opinion tests, 200 Optimal (minimum-distortion) quantizer, 394 Oral cavity, Orthogonal polynomial representation, 367 Out-of-vocabulary, 305, 344 P Pair comparison (A-B test), 200 Parallel connection, 225 Parallel model combination (PMC), 344, 363 Parametric analysis (PA), 52 PARCOR (partial autocorrelation): analysis, 102 formulation of, 102 analysis-synthesis system, 110 coefficient, 102 extraction process, 89 and LPC coefficients, relationship between, 108 synthesis filter, 109 Partial correlator, 107 Peak factor, 21 Peak-weighted distance, 258 Perceiving dynamic signals, 385 Perceptually-based weighting, 192 Perceptual units, 38 Periodogram, 92 Perplexity, 322 log, 322 test-set, 322 Pharynx, Phase equalization, 195 Index Phone, Phoneme, -6, 247 reference template, 275 Phoneme-based algorithm, 247 Phoneme-based system, 229 Phoneme-based word recognition, 275 Phoneme-like templates, 277 Phoneme context, 238 Phonemic symbol, Phonetic decision tree, 320 Phonetic information, 246 Phonetic invariants, 331 Phonetic symbol, Phonocode method, 184 Phrase component, 230 Physical units, 382 Pitch, 10, 264 error double-, 79 half-, 79 extraction, 78 by correlation processing, 79 by spectral processing, 79 by waveform processing, 79 Pitch-synchronous waveform concatenation, 220 Pitch (long-term) prediction, 148, 153 n type four-terminal circuits, 223 Plosive, 10 Pole-zero analysis, 127 by maximum likelihood estimation, 130 Polynomial coefficients, 367 Polynomial expansion coefficients, lower order, 262 Positive definiteness, 250 Postfilter, adaptive noiseshaping, 158 Index Postfiltering, 158 Pragmatics, 264, 308 Preemphasis, 51 Predicate logic, 12 Prediction, 145 error, 102 operators, forward and backward, 106 gain, 147 residual, 141, 145,256 Predictive coding, 141, 143 Procedural knowledge representation, 12 Production: model, 383 system, 12 Progressing wave model, 32 Prosodic features, 379 control of, 230 Prosodics, 308 Prosody, 264 Pseudophoneme, 277 PSI-CELP, 205 Pulse code modulation (PCM), 138,141 Pulse generator, 27 Q Quadrature mirror filter (QMF), 162 Quantization, 47 distortion, 49, 177 error, 49 noise, 49 step size, 47 Quantizing, 45 Quefrency, 64 Quefrency-weighted cepstral distance measure, 262 447 R Radiation, 9, 27 Random learning, 176 Rate distortion function, 135 Receiver operating characteristic (ROC) curve, 354 Recognition: speaker, 349 speech, 243 Rectangular window, 58 Reduction, 245 Reference template, 244, 264 Reflection coefficient, 35 11 1, 223 Registered speaker (customer), 354 Regression coefficients, 262 Residual: energy, 255 error, 84 signal, 99, 107 Residual-excited LPC vocoder (RELP), 187 Resonance (formant), 30 characteristics, 12 circuit, 27, 224 model, 38 Reversible coding, 133 Rewriting rules (generation rules), 12 Robust algorithms, 339 Robust and flexible speech coding, 21 S Sampling, 45, 46 frequency, 46 period, 46 Scalar quantization, 177 448 Search: one-pass, 320 multi-pass, 320 Segment quantization, 138 Segmental k-means training procedure, 295 Segmental SNR (SNR,,,), 201 Segmentation, 245 Selective listening, Semantic class, 12 Semantic information, 312 Semantic markers, 312 Semantic net, 312 Semantics, 264, 308 Semivowel, 1 Sentence, hypothesis, 248 Shannon-Fano coding, 133 Shannon’s information source coding theory, 133 Shannon-Someya’s sampling theorem, 46 Sheep and goats phenomenon, 334, 379 Short-term (spectral envelope) prediction, 148 Short-term spectrum, 52 Side information, 143,156 Sigmoidal nonlinearities, 399 Signal-to-amplitude-correlated noise ratio, 200 Signal-to-quantization noise ratio (SNR), 507 of a PCM signal, 142 Similarity matrix, 277 Similarity (distance) measure, 249 Simplified inverse filter tracking (SIFT) algorithm, 79 Single-path search coding, 175 Index Sinusoidal transform coder (STC), 196 Slope: constraint, 270 overload distortion, 149 Smaller-than-word units, 248 Soft palate (velum), 10 Sound: pressure, 33 source model, 383 production, 27 spectrogram (voice print), 14, 60, 70, 349 spectrograph, 60 Source, 30 generation, parameter, 78 estimation, 98 from residual signals, 98 Speaker: adaptation, 33 1, 335 unsupervised, 336 cluster selection, 335 identification, 352 normalization, 33 1, 334 recognition, 349 algorithms, textindependent, 380 human and computer, 349 methods, 352 principles of, 349 systems: examples of, 366 structure of, 354 text-dependent, 366 text-independent, 368 text-prompted, 373 text-dependent, 352 text-independent, 352 Index [Speaker:] text-prompted, 353 verification, 352 Special-purpose LSIs, 386 Spectral analysis, 52 Spectral clustering:, hierarchical, 337 Spectral distance measure, 249 Spectral distortion, 126 Spectral envelope, 52, 64, 351 prediction, 148 Spectral equalization, 114, 361 Spectral equalizer, 102 Spectral fine structure, 52 Spectral parameters, statistical features of, 362 Spectral mapping, 335 Spectral similarity, 249 Speech: acoustic characteristics of, 14 analysis-synthesis system by LPC, 99 chain, coding, 133 principal techniques for, 133 voice dependency in, 380 communication, corpus, 237 database, 237 information processing future directions of, 375 technologies, 375 perception mechanism, clarification of, 384 period detection, 248 principal characteristics of, processing basic units for, 381 technologies, evaluation methods for, 385 449 [Speech:] production, 5, 27, 383 mechanism, clarification of, 383 ratio, 26 recognition, 243 advantages of, 243 based method, 371 classification of, 246 continuous, 245 conversational, 246 difficulties in, 245 principles of, 243 speaker-adaptive, 330 speaker-dependent, 246 speaker-independent, 246, 330 spectral structure of, 52 statistical characteristics of, 20 synthesis, 13 based on analysis-synthesis method, 216, 221 based on speech production mechanism, 222 based on waveform coding, 216, 217 by HMM, 222 principles of, 213 synthesizer by J Q Stewart, 216 by von Kempelen, 214 understanding, 246 SPLIT method, 277, 333 Spoken language, 385 Spontaneous speech recognition, 344 Stability, 101, 107, 121,391 Stack algorithm, 311 Standardization of speech coding methods, 199, 203 450 State transition probability, 28 distribution, 28 State-tying, 320 Stationary Gaussian process, 90 Statistical characteristics, 351 Statistical features, 359 Statistical language modeling, 312, 314 Stochastically excited LPC, 193 Stop consonant, 10 Stress, 7, 264 Sturm-Liouville derivative equation, 38 Subband coding (SBC), 143, 159 Subglottal air pressure, 10 Subjective evaluation, 200 Subword units, 248, 264 Supra-segmental attributes, 264 Syllable, Symmetry, 250 Syntactic information, 312 Syntax, 264, 308 Synthesis by rule, 216, 226 principles of, 226 Synthesized voice quality, 380 T Talker recognition, 349 Task evaluation, 385 Technique evaluation, 386 Telephone, Templates, 176 Temporal characteristics, 35 Temporal (inter-session) variability, 360, 381 Terminal analog method, 222, 224 Text-to-speech conversion, 23 1, 234 Index Threshold logic elements, 399 Tied-mixture models, 18 Tied-state Gaussian-mixture triphone models, 320 Time: and frequency division, 141 resolution, 60 warping function, 267 Time-averaged spectrum, 361 Time-domain harmonic scaling (TDHS) algorithm, 168 Time domain pitch synchronous overlap add (TD-PSOLA) method, 220 Toeplitz matrix, 89 Tokyo Institute of Technology, 328 Tongue, Top-down, 248, 308 Trachea, Training mechanism, 331 Transcription, 243, 246, 323 Transform coding, 141 Transitional cepstral coefficient, 252 Transitional cepstral distance, 262 Transitional distance measure, 263 Transitional features, 378 Transitional logarithmic energy, 263 Tree coding: variable rate (VTRC), 196 Tree search, 178, 11 coding, 173 Tree-trellis algorithm, 12 Trellis: coding, 173, 184 diagram, 285 Index Trigram, 316 Triphone, 318 Two-level DP matching, 295 Two-mass model, 40 U Unigram, 316 Units of reference templates/ models, 247 Universal coding, 134 Unsupervised (online) adaptation, 33 Unvoiced consonant, 11 Unvoiced sound, 11 V Variable length: coding, 133 VCV syllable, 247 VCV units, 228 Vector PCM (VPCM), 176 Vector quantization (VQ), 141, 173, 278, 279 algorithm, 393 based method, 370 based word recognition, 337 codebook, 337, 370 for linear predictor parameters, 180 principles of, 175 Vector-scalar quantization, 179 Velum (soft palate), 10 VFS (vector-field smoothing), 330 Visual units, 382 Viterbi algorithm, 282, 286, 451 Vocal cord, 10 model, 40 spectrum, 334, 363 vibration waveform, 42 Vocal organ, Vocal tract, analog method, 222, 223 area, estimation based on PARCOR analysis, 110 characteristics, 363 length, 334 transmission function, 38 model, 32 Vocal vibration, 10 Vocoder, 73 baseband, 187 channel, 76 correlation, 77 formant, 77 homomorphic, 77 linear predictive, 78 LSP, 78 maximum likelihood, 78 PARCOR, 78 pattern matching, 77 voice-excited, 187 Vocoder-driven ATC, 166, 188 Voder by H Dudley, 216 Voiced consonant, 11 Voiced sound, 11 Voiced/unvoiced decision, 77, 1, 249 Voice-excited LPC vocoder (VELP), 187 Voice individuality, extraction and normalization of, 379 Voice print, 349 Volume velocity, 33 Index 452 Vowel, 6, 10 triangle, 16 VQ-based preprocessor, 333 VQ-based word recognition, 337 VSELP, 205 W Waveform coding, 135 Waveform interpolation (WI), 196 Waveform-based method, 228 Webster's horn equation, 38 Weighted cepstral distance, 260, 370 Weighted distances based on auditory sensitivity, 250 Weighted likelihood ratio (WLR), 258 Weighted slope metric, 262 Whispering, 11 White noise generator, 27 Wiener-Khintchine theorem, 54 Window function, 57 Word, 6, 247 dictionary, 264 lattice, 320 model, 264 recognition, 247 systems, structure of, 264 using phoneme units, 275 spotting, 249, 303 template, 264 World model, 366 Y Yule-Walker equation, 89 Z Zero-crossing: analysis, 70 number, 248 rate, 71 Zero-phase impulse response, 77 Z-transform, 68, 387, 388

Ngày đăng: 15/12/2017, 14:07