Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 22 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
22
Dung lượng
335,46 KB
Nội dung
Furui, S. & Rosenberg, A.E. “Speaker Verification”
Digital SignalProcessing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c
1999byCRCPressLLC
48
Speaker Verification
Sadaoki Furui
Tokyo Institute
of Technology
Aaron E. Rosenberg
AT&T Labs — Research
48.1 Introduction
48.2 Personal Identity Characteristics
48.3 Vocal Personal Identity Characteristics
48.4 Basic Elements of a Speaker Recognition System
48.5 Extracting Speaker Information from the Speech Signal
48.6 Feature Similarity Measurements
48.7 Units of Speech for Representing Speakers
48.8 Input Modes
Text-Dependent (Fixed Passwords)
•
Text Independent (No
SpecifiedPasswords)
•
Text Dependent(RandomlyPrompted
Passwords)
48.9 Representations
Representations That Preserve Temporal Characteristics
•
Representations That do not Preserve Temporal Character-
istics
48.10 Optimizing Criteria for Model Construction
48.11 Model Training and Updating
48.12 Signal Feature and Score Normalization Techniques
Signal Feature Normalization
•
Likelihood and Normalized
Scores
•
Cohort or Speaker Background Models
48.13 Decision Process
Specifying Decision Thresholds and Measuring Performance
•
ROC Curves
•
Adaptive Thresholds
•
Sequential Decisions
(Multi-Attempt Trials)
48.14 Outstanding Issues
Defining Terms
References
48.1 Introduction
Speakerrecognition is the processofautomaticallyextractingpersonalidentity infor mation byanal-
ysisofspokenutterances. Inthissection,speaker recognitionistakentobeageneral processwhereas
speakeridentificationandspeakerverificationrefertospecifictasksordecisionmodesassociatedwith
this process. Speaker identification refers to the task of determining who is speaking and speaker
verification is the task of validating a speaker’s claimed identity.
Many applications have been considered for automatic speaker recognition. These include secure
access control by voice, customizing services or information to individuals by voice, indexing or
labeling speakers in recorded conversations or dialogues, surveillance, and criminal and forensic in-
vestigationsinvolvingrecordedvoicesamples. Currently, themostfrequently mentionedapplication
c
1999 by CRC Press LLC
is access control. Access control applications include voice dialing, banking transactions over a tele-
phone network, telephone shopping, database access services, information and reservation services,
voice mail, and remote access to computers. Speaker recognition technology, as such, is expected
to create new services and make our daily lives more convenient. Another potentially important
application of speaker recognition technology is its use for forensic purposes [24].
For access control and other important applications, speaker recognition operates in a speaker
verificationtaskdecision mode. Forthisreasonthesectionisentitled speakerverification. However,
the term speaker recognition is used frequently in this section when referring to general processes.
This section is not intended to be a comprehensive review of speaker recognition technology.
Rather,itisintendedtogiveanoverviewofrecentadvancesandtheproblemsthatmustbesolvedinthe
future. ThereaderisreferredtopapersbyDoddington[4],Furui[10,11,12,13],O’Shaughnessy[39],
and Rosenberg and Soong [48] for more general reviews.
48.2 Personal Identity Characteristics
A universal human faculty is the ability to distinguish one person from another by personal identity
characteristics. The most prominent of these characteristics are facial and vocalfeatures. Organized,
scientificeffortstomakeuseofpersonalidentify ing characteristicsforsecurityandforensic purposes
began about 100 years ago. The most successful such effort was fingerprint classification which has
gained widespread use in forensic investigations.
Today, there is a rapidly growing technology based on biomet rics, the measurement of human
physiologicalorbehavioral characteristics, for the purpose of identifying individuals or verifying the
claimedorassertedidentityofanindividual[34]. Thegoalofthesetechnologicaleffortsistoproduce
completelyautomated systems for personal identityidentification or verification that are convenient
to use and offer high performance and reliability. Some of the personal identity characteristics
which have received serious attention are blood typing, DNA analysis, hand shape, retinal and iris
patterns, and signatures, in addition to fingerprints, facial features, and voice characteristics. In
general, characteristics that are subject to the least amount of contamination or distortion and
variability provide thegreatestaccuracyandreliability. Difficulties arise, for example, withsmudged
fingerprints, inconsistentsignaturehandw riting,recordingandchanneldistortions,andinconsistent
speaking behavior for voice char acteristics. Indeed, behavioral characteristics, intrinsic to signature
and voice features, although potentially an important source of identifying information, are also
subject to large amounts of variability from one sample to another.
Thedemandforeffectivebiometrictechniquesforpersonalidentityverificationcomesfromforen-
sicandsecurityapplications. Forsecurityapplications,especially,thereisagreatneedfortechniques
that are not intrusive, that are convenient and efficient, and are fully automated. For these reasons,
techniquessuch as signature verification or speaker verification are attractiveevenif they are subject
tomoresources of variability than other techniques. Speakerverification,in addition, is particularly
usefulforremoteaccess,sincevoicecharacteristicsareeasilyrecordedandtransmittedovertelephone
lines.
48.3 Vocal Personal Identity Characteristics
Both physiology and behavior underly personal identity characteristics of the voice. Physiological
correlates are associated with the size and configuration of the components of the vocal tr act (see
Fig. 48.1).
Forexample,variationsinthesizeofvocaltractcavitiesareassociatedwithcharacteristicvariations
in the spectral distributions in the speech signal for different speech sounds. The most prominent
of these spectral features are the characteristic resonances associated with voiced speech sounds
known as formants [6]. Vocal cord variations are associated with the average pitch or fundamental
c
1999 by CRC Press LLC
FIGURE48.1: Simplifieddiagramofthehumanvocaltractshowinghowspeechsoundsaregenerated.
The size and shape of the ar ticulators differ from person to person.
frequency of voiced speech sounds. Variations in the velum and nasal cavities are associated with
characteristicvariations in the spectrum of nasalizedspeechsounds. Atypicalanatomicalvariations,
in the configuration of the teeth or the structure of the palate are associated with atypical speech
sounds such as lisps or abnormal nasality.
Behavioral correlates of speaker identity in the speech signal are more difficult to specify. “Low
level”behavioralcharacteristics areassociatedwithindividualityinarticulatingspeechsounds,char-
acteristic pitch contours, rhythm, timing, etc. Characteristics of speech that have to do with indi-
vidual speech sounds, or phones, are referred to as “segmental”, while those that pertain to speech
phenomena over a sequence of phones are referred to as “suprasegmental”. Phonetic or articu-
latory suprasegmental “settings” distinguishing speakers have been identified which are associated
with characteristic “breathy”, nasal, and other voice qualities [38]. “High-level” speaker behavioral
characteristics refer to individual choice of words and phrases and other aspects of speaking styles.
48.4 Basic Elements of a Speaker Recognition System
The basic elements of a speaker recognition system are shown in Fig. 48.2. An input utterance from
anunknownspeakerisanalyzed toextractspeakercharacteristic features. The measured featuresare
compared with prototype features obtained from known speaker models.
Speaker recognition systems can operate in either an identification decision mode (Fig. 48.2(a))
or verification decision mode (Fig. 48.2(b)). The fundamental difference between these two modes
is the number of decision alternatives.
In the identification mode, a speech sample from an unknown speaker isanalyzed and compared
c
1999 by CRC Press LLC
FIGURE 48.2: Basic structures of speaker recognition systems.
with models of known speakers. The unknown speakerisidentified as the speaker whose model best
matches the input speech sample. In the “closed set” identification mode, the number of decision
alternatives is equal to the size of the population. In the “open set” identification mode, a reference
model for the unknown speaker may not exist. In this case, an additional alternative, “the unknown
does not match any of the models”, is required.
Intheverificationdecisionmode,anidentityclaimismadebyorassertedfortheunknow nspeaker.
The unknown speaker’s speech sample is compared with the model for the speaker whose identity
is claimed. If the match is good enough, as indicated by passing a threshold test, the identity claim
is verified. In the verification mode there are two decision alternatives, accept or reject the identity
claim, regardless of the size of the population. Verification can be considered as a special case of the
“open set” identification mode in which the known population size is one.
Crucial to the operation of a speaker recognition system is the establishment and maintenance
of speaker models. One or more enrollment sessions are required in which training utterances are
obtained from known speakers. Features are extracted from the training utterances and compiled
c
1999 by CRC Press LLC
into models. In addition, if the system oper ates in the “open set” or verification decision mode,
decision thresholds must also be set. Many speaker recognition systems include an updating facility
in which test utterances are used to adapt speaker models and decision thresholds.
A list of terms commonly found in the speaker recognition literature can be found at the end
of this chapter. In the remaining sections of the chapter, the following subjects are treated: how
speaker characteristic features are extracted from speech signals, how these features are used to
represent speakers, how speaker models are constructed and maintained, how speech utterances
from unknown speakers are compared with speaker models and scored to make speaker recognition
decisions, and how speaker verification performance is measured. The chapter concludes with a
discussion of outstanding issues in speaker recognition.
48.5 Extracting Speaker Information from the Speech Signal
Explicit measurements of speaker characteristics in the speech signal are often difficult to carry out.
Segmenting, labeling, and measuring specific segmental speech e vents that characterize speakers,
such as nasalized speech sounds, is difficult because of variable speech behavior and variable and
distorted recording and transmission conditions. Overall qualities, such as breathiness, are difficult
to correlate with specific speech signal measurements and are subject to variability in the same way
as segmental speech events.
Eventhoughvoicecharacteristicsaredifficulttospecifyandmeasureexplicitly,mostcharacteristics
are captured implicitly in the kinds of speech measurements that can be performed relatively easily.
Such measurements as short-time and long-time spectral energy, overall energy, and fundamental
frequency are relatively easy to obtain. They can often resolve differences in speaker characteristics
surpassing human discriminability. Although subject to distortion and variability, features based on
these analysis tools form the basis for most automatic speaker recognition systems.
Themostimportantanalysistoolisshort-timespectralanalysis. Itisnocoincidencethatshort-time
spectral analysis also forms the basis for most speech recognition systems [42]. Short-time spectral
analysisnotonlyresolvesthecharacteristicsthatdifferentiateonespeechsoundfromanother,butalso
manyof thecharacteristicsalreadymentionedthatdifferentiate onespeakerfromanother. Thereare
two principal modes of short-time spectral analysis: filter bank analysis and linear predictive coding
(LPC) analysis.
In filter bank analysis, the speech signal is passed through a bank of bandpass filters covering the
available range of frequencies associated w ith the signal. Typically, this range is 200 to 3,000 Hz
for telephone band speech and 50 to 8,000 Hz for wide band speech. A typical filter bank for wide
band speech contains 16 bandpass filters spaced uniformly 500 Hz apart. The output of each filter
is usually implemented as a windowed, short-time Fourier transform [using fast Fourier transform
(FFT) techniques] at the center frequency of the filter. The speech is typically windowed using a
10 to 30 ms Hamming window. Instead of uniformly spacing the bandpass filters, a nonuniform
spacing is often carried out reflecting perceptual criteria that allot approximately equal perceptual
contributions for each such filter. Such mel scale or bark scale filters [42] provide a spacing linear in
frequency below 1000 Hz and logarithmic above.
LPC-based spectral analysis is widely used for speech and speaker recognition. The LPC model of
the speech signal specifies that a speech sample at time t, s(t), can be represented as a linear sum of
the p previous samples plus an excitation term, as follows:
s(t) = a
1
s(t − 1) + a
2
s(t − 2) +···+a
p
s(t − p) + Gu(t) (48.1)
The LPC coefficients, a
i
, are computed by solving a set of linear equations resulting from the mini-
mization of the mean-squared er ror between the signal at time t and the linearly predicted estimate
c
1999 by CRC Press LLC
ofthesignal. Two generally used methodsforsolvingtheequations,theautocorrelation method and
the covariance method, are described in Rabiner and Juang [42].
TheLPCrepresentationiscomputationallyefficientandeasilyconvertibletoothertypesofspectral
representations. While the computational advantage is less important today than it was for early
digital implementations of speech and speakerrecognition systems, LPC analysis competeswellwith
other spectral analysis techniques and continues to be widely used.
An important spectr al representation for speech and speaker recognition is the cepstrum. The
cepstrumisthe(inverse)Fouriertransformofthelogofthesignalspectrum. Thus,thelogspectrum
can be represented as a Fourier series expansion in terms of a set of cepstral coefficients c
n
log S(ω) =
∞
n=−∞
c
n
e
−nj ω
(48.2)
Thecepst rumcanbecalculatedfromthefilter-bankspectr umorfromLPCcoefficientsbyarecursion
formula [42]. In the latter case it is known as the LPC cepstrum indicating that it is based on an
all-pole representation of the speech signal. The cepstrum has many interesting properties. Since
the cepstrum represents the log of the signal spectrum, signals that can be representedas the cascade
of two effects which are products in the spectral domain are additive in the cepstral domain. Also,
pitchharmonics,whichproduceprominentripplesinthespectral envelope, areassociatedwith high
order cepstral coefficients. Thus, the set of cepstral coefficients truncated, for example, at order 12
to 24 can be used to reconstruct a relatively smooth version of the speech spectrum. The spectral
enve lope obtainedisassociated withvocaltractresonancesanddoesnothavethevariable,oscillatory
effects of the pitch excitation. It is considered that one of the reasons that cepstral representation
has been found to be more effective than other representationsfor speech and speakerrecognition is
this property of separability of source and tract. Since the excitation function is considered to have
speaker dependent characteristics, it may seem contradictory that a representation which largely
removes these effects works well for speaker recognition. However, in short-time spectral analysis
the effects of the source spectrum are highly variable so that they are not especially effective in
providing consistent representations of the source spectrum.
Other spectral features such as PARCOR coefficients, log area ratio coefficients, LSP (line spectral
pair coefficients), havebeen used for both speech and speaker recognition [42]. Gener ally speaking,
however, the cepstral representationismostwidely used and is usually associated w ith better speaker
recognition performance than other representations.
Cruder measures of spectral energy, such as waveform zero-crossing or level-crossing measure-
ments have also been used for speech and speaker recognition in the interest of saving computation
with some success.
Additional features have been proposed for speaker recognition which are not used often or con-
sidered to be marginally useful for speech recognition. For example, pitch and energy features,
particularly when measured as a function of time overasufficientlylong utterance,have been shown
tobeuseful forspeakerrecognition [27]. Suchtimesequencesor“contours”arethoughttorepresent
characteristic speaking inflections and rhythms associated with indiv idual speaking behavior. Pitch
and energ y measurements have an advantage over short-time spectral measurements in that they
are more robust to many different kinds of transmission and recording variations and distortions
since they are not sensitive to spectral amplitude variability. However, since speaking behavior can
be highly variable due to both voluntaryand involuntaryactivity, pitchandenergy can acquiremore
variability than short-time spectral features and are more susceptible to imitation.
The time course of feature measurements, as represented by so-called feature contours, provides
valuablespeakerchar acterizinginformation. Thisisbecausesuchcontoursprovideoverall,supraseg-
mental information characterizing speaking behavior and also because they contain information on
a more local, segmental time scale describing transitions from one speech sound to another. This
c
1999 by CRC Press LLC
latter kind of information can be obtained explicitly by measuring the local trajectory in time of a
measuredfeatureateachanalysisframe. Suchmeasurementscanbeobtainedbyaveragingsuccessive
differencesof the feature in a window around each analysis frame, or by fitting a polynomial in time
to the successive feature measurements in the window. The window size is typically 5 to 9 analysis
frames. The polynomial fit provides a less noisy estimate of the trajectory than averaging successive
differences. Theorderofthepolynomialistypically1or2,andthepolynomialcoefficients arecalled
delta- and delta-delta-feature coefficients. It has been shown in experiments that such dynamic fea-
turemeasurementsarefairlyuncorrelated with the original static feature measurementsand provide
improve d speech and speaker recognition performance [9].
48.6 Feature Similarity Measurements
Much of the originality and distinctiveness in the design of a speaker recognition system is found in
howfeaturesarecombinedandcompared with referencemodels. Underlyingthis design is the basic
representation of features in some space and the formation of adistance or distortion measurement
to use when one set of features is compared with another. The distortion measure can be used
to partition the feature vectors representing a speaker’s utterances into regions representative of
the most prominent speech sounds for that speaker, as in the vector quantization (VQ) codebook
representation (Section 48.9.2). It can be used to segment utterances into speech sound units. And
it can be used toscore an unknow n speaker’s utterancesagainst a known speaker’s utterance models.
A general approach for calculating a distance between two feature vectors is to make use of a
distancemetric from the family of L
p
norm distances d
p
, such as the absolutevalue of the difference
between the feature vectors
d
1
=
D
i=1
|f
i
− f
i
| (48.3)
or the Euclidean distance
d
2
=
D
i=1
f
i
− f
i
2
(48.4)
where f
i
,f
i
,i= 1, 2, ,D are the coefficients of two feature vectors f and f
. The feature
vectors, for example, could comprise filter-bank outputs or cepstral coefficients described in the
previous section. (It is not common, however, to use filter bank outputs directly, as previously
mentioned, because of the variability associated with these features due to harmonics from the pitch
excitation.)
For example, a weighted Euclidean distance distortion measure for cepstral features of the form
d
2
cw
=
D
i=1
w
i
c
i
− c
i
2
(48.5)
where
w
i
= 1/σ
i
(48.6)
andσ
2
i
isanestimateofthevarianceoftheithcoefficienthasbeenshowntoprovidegoodperformance
forbothspeechandspeakerrecognition. AstillmoregeneralformulationistheMahalanobisdistance
formulation which accounts for interactions between coefficients with a full covariance matr ix.
An alternate approach to comparing vectors in a feature space with a distortion measurement is
to establish a probabilistic formulation of the feature space. It is assumed that the feature vectors
in a subspace associated with, for example, a particular speech sound for a particular speaker, can
c
1999 by CRC Press LLC
be specified by some probability distribution. A common assumption is that the feature vector is a
random variable x whose probability distribution is Gaussian
p(x|λ) =
1
(
2π
)
D/2
||
1/2
exp
−
1
2
(
x − µ
)
T
−1
(x − µ)
(48.7)
where λ represents the parameters of the distribution, which are the mean vector µ and covariance
matrix .
When x is a feature vector sample, p(x|λ) is referred to as the likelihood of x with respect to
λ. Suppose there is a population of n speakers each modeled by a Gaussian distribution of feature
vectors, λ
i
, i = 1, 2, ,n. In the maximum likelihood formulation, a sample x is associated with
speaker I if
p
(
x|λ
I
)
>p
(
x|λ
i
)
, for all i = I
(48.8)
where p(x|λi) is the likelihood of the test vector x for speaker model λ
i
. It is common to use log
likelihoods to evaluate Gaussian models. From Eq. (48.7)
L
(
x|λ
i
)
= log p
(
x|λ
i
)
=−
D
2
log 2π −
1
2
log |
i
|−
1
2
(
x − µ
i
)
T
−1
i
(
x − µ
i
)
(48.9)
It can be seen from Eq. (48.9) that, using log likelihoods, the maximum likelihood classifier is
equivalent to the minimum distance classifier using a Mahalanobis distance formulation.
A more general probabilistic formulation is the Gaussian mixture distribution of a feature vector
x
p(x|λ) =
M
i=1
w
i
b
i
(x) (48.10)
where b
i
(x) is the Gaussian probability density function with mean µ
i
and covariance
i
, w
i
is the
weight associated with the ith component, and M is the number of Gaussian components in the
mixture. The weights w
i
are constrained so that
n
i=1
w
i
= 1. The model parameters λ are
λ =
{
µ
i
,
i
,w
i
,i = 1, 2, ,M
}
(48.11)
The Gaussian mixture probability function is capable of approximating a wide variety of smooth,
continuous, probability functions.
48.7 Units of Speech for Representing Speakers
An important consideration in the design of a speaker recognition system is the choice of a speech
unit to model a speaker’s utterances. The choice of units includes phonetic or linguistic units such
as whole sentences or phrases, words, syllables, and phone-like units. It also includes acoustic
units such as subword segments, segmented from utterances and labeled on the basis of acoustic
rather than phonetic criteria. Some speakerrecognitionsystemsmodel speakers directly from single
featurevectorsratherthanthroughanintermediatespeechunitrepresentation. Suchsystemsusually
operateinatextindependentmode(seeSections48.8 and 48.9) and seek to obtain a general model
of a speaker’s utterances from a usually large number of training feature vectors. Direct models
might include long-time averages, VQ codebooks, segment and matrix quantization codebooks, or
Gaussian mixture models of the feature vectors.
Most speech recognizers of moderate to large vocabulary are based on subword units such as
phones so that large numbers of utterances transcribed as sequences of phones can be represented
as concatenations of phone models. For speaker recognition, there is no absolute need to represent
c
1999 by CRC Press LLC
utterances in terms of phones or other phonetically based units because there is no absolute need
to account for the linguistic or phonetic content of utterances in order to build speaker recognition
models. Generally speaking, systems in which phonetic representations are used are more complex
thanotherrepresentationsbecausetheyrequirephonetictranscriptionsforbothtrainingandtesting
utterances and because they require accurate and reliable segmentations of utterances in terms of
these units. The case in which phonetic representations are required for speaker recognition is the
same as for speech recognition: where there is a need to represent utterances as concatenations of
smallerunits. SpeakerrecognitionsystemsbasedonsubwordunitshavebeendescribedbyRosenberg
et al. [46] and Matsui and Furui [31].
48.8 Input Modes
Speaker recognition systems typically operate in one of two input modes: text dependent or text
independent. In the text-dependent mode, speakers must provide utterances of the same text for
both training and recognition trials. In the text-independent mode, speakers are not constrained to
provide specific texts in recognition trials. Since the text-dependent mode can directly exploit the
voice individuality associated with each phoneme orsyllable,itgenerallyachieveshigherrecognition
performance than the text-independent mode.
48.8.1 Text-Dependent (Fixed Passwords)
The structure of a system using fixed passwords is rather simple; input speech is time aligned with
reference templates or models created by using training utterances for the passwords. If the fixed
passwordsaredifferentfromspeakertospeaker,thedifferencecanalsobeusedasadditionalindividual
information. This helps to increase performance.
48.8.2 Text Independent (No Specified Passwords)
Thereareseveralapplicationsinwhichpredeterminedpasswordscannotbeused. Inaddition,human
beingscanrecog nizespeakersirrespectiveofthecontentoftheutterance. Therefore,text-independent
methodshaverecentlybeenactivelyinvestigated. Anotheradvantageoftext-independentrecognition
is that it can be done sequentially, until a desiredsignificance levelisreached,without the annoyance
of having to repeat passwords again and again.
48.8.3 Text Dependent (Randomly Prompted Passwords)
Both text-dependent and independent methods have a potentially serious problem. Namely, these
systems can be defeated because someone who plays back the recorded voice of a registered speaker
uttering key words or sentences into the microphone could be accepted as the registered speaker. To
cope with this problem, there are methods in which a small set of words, such as digits, are used as
key words and each user is prompted to utter a givensequence of key words that is randomly chosen
every time the system is used [20, 47].
Recently, atext-promptedspeakerrecognition methodwasproposedinwhichpasswordsentences
are completely changed every time [31, 33]. The system accepts the input utterance only when it
judgesthattheregisteredspeakerutteredthepromptedsentence. Becausethevocabularyisunlimited,
prospectiveimpostorscannotknowinadvancethesentencetheywillbepromptedtosay. Thismethod
cannot only accurately recognize speakers, but can also reject utterances whose text differs from the
prompted text, even if it is uttered by a registered speaker. T hus, a recorded and played-back voice
can be correctly rejected.
c
1999 by CRC Press LLC
[...]... randomized phrase prompting, Digital Signal Processing, 1, 89–106, 1991 [21] Juang, B.-H., Rabiner, L.R and Wilpon, J.G., On the use of bandpass liftering in speech recognition, IEEE Trans Acoust., Speech and Signal Processing, ASSP-35, 947–954, 1987 [22] Juang, B.-H and Soong, F.K., Speaker recognition based on source coding approaches, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 613–616, 1990... identification, IEEE SignalProcessing Magazine, 11(4), 18–32, 1994 [18] Griffin, C., Matsui, T and Furui, S., Distance measures for text-independent speaker recognition based on MAR model, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, Adelaide, I309–312, 1994 [19] Higgins, A.L and Wohlford, R.E., A new method of text-independent speaker recognition, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, ... Acoust., Speech, Signal Processing, 865–8, 1986 [15] Gish, H., Robust discrimination in automatic speaker identification, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 289–292, 1990 [16] Gish, H., Karnofsky, K., Krasner, K., Roucos, S., Schwartz, R and Wolf, J., Investigation of textindependent speaker identification over telephone channels, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, ... Trans Acoust., Speech, Signal Processing, 29 (2), 254–272, 1981 [10] Furui, S., Research on individuality features in speech waves and automatic speaker recognition techniques, Speech Commun., 5 (2), 183–197, 1986 [11] Furui, S., Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, New York, 1989 [12] Furui, S., Speaker-dependent-feature extraction, recognition and processing techniques,... recognition, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, I, 125–128, 1994 [34] Miller, B., Vital signs of identity, IEEE Spectrum, 22–30, Feb 1994 c 1999 by CRC Press LLC [35] Montacie, C et al., Cinematic techniques for speech processing: temporal decomposition and multivariate linear prediction, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, I, 153-156, 1992 [36] Naik, J.M., Netsch,... Doddington, G.R., Speaker verification over long distance telephone lines, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 524–527, 1989 [37] Netsch, L.P and Doddington, G.R., Speaker verification using temporal decorrelation postprocessing, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, II, 181-184, 1992 [38] Nolan, F., The Phonetic Bases of Speaker Recognition, Cambridge University... Oglesby, J and Mason, J.S., Optimization of neural models for speaker identification, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 261–264, 1990 [41] Poritz, A.B., Linear predictive hidden Markov models and the speech signal, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 1291–1294, 1982 [42] Rabiner, L.R and Juang, B.-H., Fundamentals of Speech Recognition, Prentice-Hall, Englewood... Conference on Spoken Language Processing, Kobe, Japan [47] Rosenberg, A.E., Lee, C.-H and Gokcen, S., Connected word talker verification using whole word hidden Markov models, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, Toronto, 381–384, 1991 [48] Rosenberg, A.E and Soong, F.K., Recent research in automatic speaker recognition, in Advances in Speech Signal Processing, Furui, S and Sondhi,... Spoken Language Processing, Banff, 599–602, 1992 [50] Savic, M and Gupta, S.K., Variable parameter speaker verification system based on hidden Markov modeling, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, 281–284, 1990 [51] Soong, F.K and Rosenberg, A.E., On the use of instantaneous and transitional spectral information in speaker recognition, IEEE Trans Acoust., Speech, Signal Processing, ASSP-36(6),... speaker recognition methods using VQ-distortion and discrete/continuous HMMs, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, II, 157-160, 1992 [31] Matsui, T and Furui, S., Concatenated phoneme models for text-variable speaker recognition, Proc IEEE Intl Conf Acoust., Speech, Signal Processing, II, 391–394, 1993 [32] Matsui, T and Furui, S., Similarity normalization method for speaker verification . Furui, S. & Rosenberg, A.E. “Speaker Verification”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton:. representation of the speech signal. The cepstrum has many interesting properties. Since
the cepstrum represents the log of the signal spectrum, signals that can be