Furui, S. & Rosenberg, A.E. “Speaker Verification” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c  1999byCRCPressLLC 48 Speaker Verification Sadaoki Furui Tokyo Institute of Technology Aaron E. Rosenberg AT&T Labs — Research 48.1 Introduction 48.2 Personal Identity Characteristics 48.3 Vocal Personal Identity Characteristics 48.4 Basic Elements of a Speaker Recognition System 48.5 Extracting Speaker Information from the Speech Signal 48.6 Feature Similarity Measurements 48.7 Units of Speech for Representing Speakers 48.8 Input Modes Text-Dependent (Fixed Passwords) • Text Independent (No SpecifiedPasswords) • Text Dependent(RandomlyPrompted Passwords) 48.9 Representations Representations That Preserve Temporal Characteristics • Representations That do not Preserve Temporal Character- istics 48.10 Optimizing Criteria for Model Construction 48.11 Model Training and Updating 48.12 Signal Feature and Score Normalization Techniques Signal Feature Normalization • Likelihood and Normalized Scores • Cohort or Speaker Background Models 48.13 Decision Process Specifying Decision Thresholds and Measuring Performance • ROC Curves • Adaptive Thresholds • Sequential Decisions (Multi-Attempt Trials) 48.14 Outstanding Issues Defining Terms References 48.1 Introduction Speakerrecognition is the processofautomaticallyextractingpersonalidentity infor mation byanal- ysisofspokenutterances. Inthissection,speaker recognitionistakentobeageneral processwhereas speakeridentificationandspeakerverificationrefertospecifictasksordecisionmodesassociatedwith this process. Speaker identification refers to the task of determining who is speaking and speaker verification is the task of validating a speaker’s claimed identity. Many applications have been considered for automatic speaker recognition. These include secure access control by voice, customizing services or information to individuals by voice, indexing or labeling speakers in recorded conversations or dialogues, surveillance, and criminal and forensic in- vestigationsinvolvingrecordedvoicesamples. Currently, themostfrequently mentionedapplication c  1999 by CRC Press LLC is access control. Access control applications include voice dialing, banking transactions over a tele- phone network, telephone shopping, database access services, information and reservation services, voice mail, and remote access to computers. Speaker recognition technology, as such, is expected to create new services and make our daily lives more convenient. Another potentially important application of speaker recognition technology is its use for forensic purposes [24]. For access control and other important applications, speaker recognition operates in a speaker verificationtaskdecision mode. Forthisreasonthesectionisentitled speakerverification. However, the term speaker recognition is used frequently in this section when referring to general processes. This section is not intended to be a comprehensive review of speaker recognition technology. Rather,itisintendedtogiveanoverviewofrecentadvancesandtheproblemsthatmustbesolvedinthe future. ThereaderisreferredtopapersbyDoddington[4],Furui[10,11,12,13],O’Shaughnessy[39], and Rosenberg and Soong [48] for more general reviews. 48.2 Personal Identity Characteristics A universal human faculty is the ability to distinguish one person from another by personal identity characteristics. The most prominent of these characteristics are facial and vocalfeatures. Organized, scientificeffortstomakeuseofpersonalidentify ing characteristicsforsecurityandforensic purposes began about 100 years ago. The most successful such effort was fingerprint classification which has gained widespread use in forensic investigations. Today, there is a rapidly growing technology based on biomet rics, the measurement of human physiologicalorbehavioral characteristics, for the purpose of identifying individuals or verifying the claimedorassertedidentityofanindividual[34]. Thegoalofthesetechnologicaleffortsistoproduce completelyautomated systems for personal identityidentification or verification that are convenient to use and offer high performance and reliability. Some of the personal identity characteristics which have received serious attention are blood typing, DNA analysis, hand shape, retinal and iris patterns, and signatures, in addition to fingerprints, facial features, and voice characteristics. In general, characteristics that are subject to the least amount of contamination or distortion and variability provide thegreatestaccuracyandreliability. Difficulties arise, for example, withsmudged fingerprints, inconsistentsignaturehandw riting,recordingandchanneldistortions,andinconsistent speaking behavior for voice char acteristics. Indeed, behavioral characteristics, intrinsic to signature and voice features, although potentially an important source of identifying information, are also subject to large amounts of variability from one sample to another. Thedemandforeffectivebiometrictechniquesforpersonalidentityverificationcomesfromforen- sicandsecurityapplications. Forsecurityapplications,especially,thereisagreatneedfortechniques that are not intrusive, that are convenient and efficient, and are fully automated. For these reasons, techniquessuch as signature verification or speaker verification are attractiveevenif they are subject tomoresources of variability than other techniques. Speakerverification,in addition, is particularly usefulforremoteaccess,sincevoicecharacteristicsareeasilyrecordedandtransmittedovertelephone lines. 48.3 Vocal Personal Identity Characteristics Both physiology and behavior underly personal identity characteristics of the voice. Physiological correlates are associated with the size and configuration of the components of the vocal tr act (see Fig. 48.1). Forexample,variationsinthesizeofvocaltractcavitiesareassociatedwithcharacteristicvariations in the spectral distributions in the speech signal for different speech sounds. The most prominent of these spectral features are the characteristic resonances associated with voiced speech sounds known as formants [6]. Vocal cord variations are associated with the average pitch or fundamental c  1999 by CRC Press LLC FIGURE48.1: Simplifieddiagramofthehumanvocaltractshowinghowspeechsoundsaregenerated. The size and shape of the ar ticulators differ from person to person. frequency of voiced speech sounds. Variations in the velum and nasal cavities are associated with characteristicvariations in the spectrum of nasalizedspeechsounds. Atypicalanatomicalvariations, in the configuration of the teeth or the structure of the palate are associated with atypical speech sounds such as lisps or abnormal nasality. Behavioral correlates of speaker identity in the speech signal are more difficult to specify. “Low level”behavioralcharacteristics areassociatedwithindividualityinarticulatingspeechsounds,char- acteristic pitch contours, rhythm, timing, etc. Characteristics of speech that have to do with indi- vidual speech sounds, or phones, are referred to as “segmental”, while those that pertain to speech phenomena over a sequence of phones are referred to as “suprasegmental”. Phonetic or articu- latory suprasegmental “settings” distinguishing speakers have been identified which are associated with characteristic “breathy”, nasal, and other voice qualities [38]. “High-level” speaker behavioral characteristics refer to individual choice of words and phrases and other aspects of speaking styles. 48.4 Basic Elements of a Speaker Recognition System The basic elements of a speaker recognition system are shown in Fig. 48.2. An input utterance from anunknownspeakerisanalyzed toextractspeakercharacteristic features. The measured featuresare compared with prototype features obtained from known speaker models. Speaker recognition systems can operate in either an identification decision mode (Fig. 48.2(a)) or verification decision mode (Fig. 48.2(b)). The fundamental difference between these two modes is the number of decision alternatives. In the identification mode, a speech sample from an unknown speaker isanalyzed and compared c  1999 by CRC Press LLC FIGURE 48.2: Basic structures of speaker recognition systems. with models of known speakers. The unknown speakerisidentified as the speaker whose model best matches the input speech sample. In the “closed set” identification mode, the number of decision alternatives is equal to the size of the population. In the “open set” identification mode, a reference model for the unknown speaker may not exist. In this case, an additional alternative, “the unknown does not match any of the models”, is required. Intheverificationdecisionmode,anidentityclaimismadebyorassertedfortheunknow nspeaker. The unknown speaker’s speech sample is compared with the model for the speaker whose identity is claimed. If the match is good enough, as indicated by passing a threshold test, the identity claim is verified. In the verification mode there are two decision alternatives, accept or reject the identity claim, regardless of the size of the population. Verification can be considered as a special case of the “open set” identification mode in which the known population size is one. Crucial to the operation of a speaker recognition system is the establishment and maintenance of speaker models. One or more enrollment sessions are required in which training utterances are obtained from known speakers. Features are extracted from the training utterances and compiled c  1999 by CRC Press LLC into models. In addition, if the system oper ates in the “open set” or verification decision mode, decision thresholds must also be set. Many speaker recognition systems include an updating facility in which test utterances are used to adapt speaker models and decision thresholds. A list of terms commonly found in the speaker recognition literature can be found at the end of this chapter. In the remaining sections of the chapter, the following subjects are treated: how speaker characteristic features are extracted from speech signals, how these features are used to represent speakers, how speaker models are constructed and maintained, how speech utterances from unknown speakers are compared with speaker models and scored to make speaker recognition decisions, and how speaker verification performance is measured. The chapter concludes with a discussion of outstanding issues in speaker recognition. 48.5 Extracting Speaker Information from the Speech Signal Explicit measurements of speaker characteristics in the speech signal are often difficult to carry out. Segmenting, labeling, and measuring specific segmental speech e vents that characterize speakers, such as nasalized speech sounds, is difficult because of variable speech behavior and variable and distorted recording and transmission conditions. Overall qualities, such as breathiness, are difficult to correlate with specific speech signal measurements and are subject to variability in the same way as segmental speech events. Eventhoughvoicecharacteristicsaredifficulttospecifyandmeasureexplicitly,mostcharacteristics are captured implicitly in the kinds of speech measurements that can be performed relatively easily. Such measurements as short-time and long-time spectral energy, overall energy, and fundamental frequency are relatively easy to obtain. They can often resolve differences in speaker characteristics surpassing human discriminability. Although subject to distortion and variability, features based on these analysis tools form the basis for most automatic speaker recognition systems. Themostimportantanalysistoolisshort-timespectralanalysis. Itisnocoincidencethatshort-time spectral analysis also forms the basis for most speech recognition systems [42]. Short-time spectral analysisnotonlyresolvesthecharacteristicsthatdifferentiateonespeechsoundfromanother,butalso manyof thecharacteristicsalreadymentionedthatdifferentiate onespeakerfromanother. Thereare two principal modes of short-time spectral analysis: filter bank analysis and linear predictive coding (LPC) analysis. In filter bank analysis, the speech signal is passed through a bank of bandpass filters covering the available range of frequencies associated w ith the signal. Typically, this range is 200 to 3,000 Hz for telephone band speech and 50 to 8,000 Hz for wide band speech. A typical filter bank for wide band speech contains 16 bandpass filters spaced uniformly 500 Hz apart. The output of each filter is usually implemented as a windowed, short-time Fourier transform [using fast Fourier transform (FFT) techniques] at the center frequency of the filter. The speech is typically windowed using a 10 to 30 ms Hamming window. Instead of uniformly spacing the bandpass filters, a nonuniform spacing is often carried out reflecting perceptual criteria that allot approximately equal perceptual contributions for each such filter. Such mel scale or bark scale filters [42] provide a spacing linear in frequency below 1000 Hz and logarithmic above. LPC-based spectral analysis is widely used for speech and speaker recognition. The LPC model of the speech signal specifies that a speech sample at time t, s(t), can be represented as a linear sum of the p previous samples plus an excitation term, as follows: s(t) = a 1 s(t − 1) + a 2 s(t − 2) +···+a p s(t − p) + Gu(t) (48.1) The LPC coefficients, a i , are computed by solving a set of linear equations resulting from the mini- mization of the mean-squared er ror between the signal at time t and the linearly predicted estimate c  1999 by CRC Press LLC ofthesignal. Two generally used methodsforsolvingtheequations,theautocorrelation method and the covariance method, are described in Rabiner and Juang [42]. TheLPCrepresentationiscomputationallyefficientandeasilyconvertibletoothertypesofspectral representations. While the computational advantage is less important today than it was for early digital implementations of speech and speakerrecognition systems, LPC analysis competeswellwith other spectral analysis techniques and continues to be widely used. An important spectr al representation for speech and speaker recognition is the cepstrum. The cepstrumisthe(inverse)Fouriertransformofthelogofthesignalspectrum. Thus,thelogspectrum can be represented as a Fourier series expansion in terms of a set of cepstral coefficients c n log S(ω) = ∞  n=−∞ c n e −nj ω (48.2) Thecepst rumcanbecalculatedfromthefilter-bankspectr umorfromLPCcoefficientsbyarecursion formula [42]. In the latter case it is known as the LPC cepstrum indicating that it is based on an all-pole representation of the speech signal. The cepstrum has many interesting properties. Since the cepstrum represents the log of the signal spectrum, signals that can be representedas the cascade of two effects which are products in the spectral domain are additive in the cepstral domain. Also, pitchharmonics,whichproduceprominentripplesinthespectral envelope, areassociatedwith high order cepstral coefficients. Thus, the set of cepstral coefficients truncated, for example, at order 12 to 24 can be used to reconstruct a relatively smooth version of the speech spectrum. The spectral enve lope obtainedisassociated withvocaltractresonancesanddoesnothavethevariable,oscillatory effects of the pitch excitation. It is considered that one of the reasons that cepstral representation has been found to be more effective than other representationsfor speech and speakerrecognition is this property of separability of source and tract. Since the excitation function is considered to have speaker dependent characteristics, it may seem contradictory that a representation which largely removes these effects works well for speaker recognition. However, in short-time spectral analysis the effects of the source spectrum are highly variable so that they are not especially effective in providing consistent representations of the source spectrum. Other spectral features such as PARCOR coefficients, log area ratio coefficients, LSP (line spectral pair coefficients), havebeen used for both speech and speaker recognition [42]. Gener ally speaking, however, the cepstral representationismostwidely used and is usually associated w ith better speaker recognition performance than other representations. Cruder measures of spectral energy, such as waveform zero-crossing or level-crossing measure- ments have also been used for speech and speaker recognition in the interest of saving computation with some success. Additional features have been proposed for speaker recognition which are not used often or con- sidered to be marginally useful for speech recognition. For example, pitch and energy features, particularly when measured as a function of time overasufficientlylong utterance,have been shown tobeuseful forspeakerrecognition [27]. Suchtimesequencesor“contours”arethoughttorepresent characteristic speaking inflections and rhythms associated with indiv idual speaking behavior. Pitch and energ y measurements have an advantage over short-time spectral measurements in that they are more robust to many different kinds of transmission and recording variations and distortions since they are not sensitive to spectral amplitude variability. However, since speaking behavior can be highly variable due to both voluntaryand involuntaryactivity, pitchandenergy can acquiremore variability than short-time spectral features and are more susceptible to imitation. The time course of feature measurements, as represented by so-called feature contours, provides valuablespeakerchar acterizinginformation. Thisisbecausesuchcontoursprovideoverall,supraseg- mental information characterizing speaking behavior and also because they contain information on a more local, segmental time scale describing transitions from one speech sound to another. This c  1999 by CRC Press LLC latter kind of information can be obtained explicitly by measuring the local trajectory in time of a measuredfeatureateachanalysisframe. Suchmeasurementscanbeobtainedbyaveragingsuccessive differencesof the feature in a window around each analysis frame, or by fitting a polynomial in time to the successive feature measurements in the window. The window size is typically 5 to 9 analysis frames. The polynomial fit provides a less noisy estimate of the trajectory than averaging successive differences. Theorderofthepolynomialistypically1or2,andthepolynomialcoefficients arecalled delta- and delta-delta-feature coefficients. It has been shown in experiments that such dynamic fea- turemeasurementsarefairlyuncorrelated with the original static feature measurementsand provide improve d speech and speaker recognition performance [9]. 48.6 Feature Similarity Measurements Much of the originality and distinctiveness in the design of a speaker recognition system is found in howfeaturesarecombinedandcompared with referencemodels. Underlyingthis design is the basic representation of features in some space and the formation of adistance or distortion measurement to use when one set of features is compared with another. The distortion measure can be used to partition the feature vectors representing a speaker’s utterances into regions representative of the most prominent speech sounds for that speaker, as in the vector quantization (VQ) codebook representation (Section 48.9.2). It can be used to segment utterances into speech sound units. And it can be used toscore an unknow n speaker’s utterancesagainst a known speaker’s utterance models. A general approach for calculating a distance between two feature vectors is to make use of a distancemetric from the family of L p norm distances d p , such as the absolutevalue of the difference between the feature vectors d 1 = D  i=1 |f i − f  i | (48.3) or the Euclidean distance d 2 = D  i=1  f i − f  i  2 (48.4) where f i ,f  i ,i= 1, 2, ,D are the coefficients of two feature vectors f and f  . The feature vectors, for example, could comprise filter-bank outputs or cepstral coefficients described in the previous section. (It is not common, however, to use filter bank outputs directly, as previously mentioned, because of the variability associated with these features due to harmonics from the pitch excitation.) For example, a weighted Euclidean distance distortion measure for cepstral features of the form d 2 cw = D  i=1  w i  c i − c  i  2 (48.5) where w i = 1/σ i (48.6) andσ 2 i isanestimateofthevarianceoftheithcoefficienthasbeenshowntoprovidegoodperformance forbothspeechandspeakerrecognition. AstillmoregeneralformulationistheMahalanobisdistance formulation which accounts for interactions between coefficients with a full covariance matr ix. An alternate approach to comparing vectors in a feature space with a distortion measurement is to establish a probabilistic formulation of the feature space. It is assumed that the feature vectors in a subspace associated with, for example, a particular speech sound for a particular speaker, can c  1999 by CRC Press LLC be specified by some probability distribution. A common assumption is that the feature vector is a random variable x whose probability distribution is Gaussian p(x|λ) = 1 ( 2π ) D/2 || 1/2 exp  − 1 2 ( x − µ ) T  −1 (x − µ)  (48.7) where λ represents the parameters of the distribution, which are the mean vector µ and covariance matrix . When x is a feature vector sample, p(x|λ) is referred to as the likelihood of x with respect to λ. Suppose there is a population of n speakers each modeled by a Gaussian distribution of feature vectors, λ i , i = 1, 2, ,n. In the maximum likelihood formulation, a sample x is associated with speaker I if p ( x|λ I ) >p ( x|λ i ) , for all i = I (48.8) where p(x|λi) is the likelihood of the test vector x for speaker model λ i . It is common to use log likelihoods to evaluate Gaussian models. From Eq. (48.7) L ( x|λ i ) = log p ( x|λ i ) =− D 2 log 2π − 1 2 log | i |− 1 2 ( x − µ i ) T  −1 i ( x − µ i ) (48.9) It can be seen from Eq. (48.9) that, using log likelihoods, the maximum likelihood classifier is equivalent to the minimum distance classifier using a Mahalanobis distance formulation. A more general probabilistic formulation is the Gaussian mixture distribution of a feature vector x p(x|λ) = M  i=1 w i b i (x) (48.10) where b i (x) is the Gaussian probability density function with mean µ i and covariance  i , w i is the weight associated with the ith component, and M is the number of Gaussian components in the mixture. The weights w i are constrained so that  n i=1 w i = 1. The model parameters λ are λ = { µ i , i ,w i ,i = 1, 2, ,M } (48.11) The Gaussian mixture probability function is capable of approximating a wide variety of smooth, continuous, probability functions. 48.7 Units of Speech for Representing Speakers An important consideration in the design of a speaker recognition system is the choice of a speech unit to model a speaker’s utterances. The choice of units includes phonetic or linguistic units such as whole sentences or phrases, words, syllables, and phone-like units. It also includes acoustic units such as subword segments, segmented from utterances and labeled on the basis of acoustic rather than phonetic criteria. Some speakerrecognitionsystemsmodel speakers directly from single featurevectorsratherthanthroughanintermediatespeechunitrepresentation. Suchsystemsusually operateinatextindependentmode(seeSections48.8 and 48.9) and seek to obtain a general model of a speaker’s utterances from a usually large number of training feature vectors. Direct models might include long-time averages, VQ codebooks, segment and matrix quantization codebooks, or Gaussian mixture models of the feature vectors. Most speech recognizers of moderate to large vocabulary are based on subword units such as phones so that large numbers of utterances transcribed as sequences of phones can be represented as concatenations of phone models. For speaker recognition, there is no absolute need to represent c  1999 by CRC Press LLC utterances in terms of phones or other phonetically based units because there is no absolute need to account for the linguistic or phonetic content of utterances in order to build speaker recognition models. Generally speaking, systems in which phonetic representations are used are more complex thanotherrepresentationsbecausetheyrequirephonetictranscriptionsforbothtrainingandtesting utterances and because they require accurate and reliable segmentations of utterances in terms of these units. The case in which phonetic representations are required for speaker recognition is the same as for speech recognition: where there is a need to represent utterances as concatenations of smallerunits. SpeakerrecognitionsystemsbasedonsubwordunitshavebeendescribedbyRosenberg et al. [46] and Matsui and Furui [31]. 48.8 Input Modes Speaker recognition systems typically operate in one of two input modes: text dependent or text independent. In the text-dependent mode, speakers must provide utterances of the same text for both training and recognition trials. In the text-independent mode, speakers are not constrained to provide specific texts in recognition trials. Since the text-dependent mode can directly exploit the voice individuality associated with each phoneme orsyllable,itgenerallyachieveshigherrecognition performance than the text-independent mode. 48.8.1 Text-Dependent (Fixed Passwords) The structure of a system using fixed passwords is rather simple; input speech is time aligned with reference templates or models created by using training utterances for the passwords. If the fixed passwordsaredifferentfromspeakertospeaker,thedifferencecanalsobeusedasadditionalindividual information. This helps to increase performance. 48.8.2 Text Independent (No Specified Passwords) Thereareseveralapplicationsinwhichpredeterminedpasswordscannotbeused. Inaddition,human beingscanrecog nizespeakersirrespectiveofthecontentoftheutterance. Therefore,text-independent methodshaverecentlybeenactivelyinvestigated. Anotheradvantageoftext-independentrecognition is that it can be done sequentially, until a desiredsignificance levelisreached,without the annoyance of having to repeat passwords again and again. 48.8.3 Text Dependent (Randomly Prompted Passwords) Both text-dependent and independent methods have a potentially serious problem. Namely, these systems can be defeated because someone who plays back the recorded voice of a registered speaker uttering key words or sentences into the microphone could be accepted as the registered speaker. To cope with this problem, there are methods in which a small set of words, such as digits, are used as key words and each user is prompted to utter a givensequence of key words that is randomly chosen every time the system is used [20, 47]. Recently, atext-promptedspeakerrecognition methodwasproposedinwhichpasswordsentences are completely changed every time [31, 33]. The system accepts the input utterance only when it judgesthattheregisteredspeakerutteredthepromptedsentence. Becausethevocabularyisunlimited, prospectiveimpostorscannotknowinadvancethesentencetheywillbepromptedtosay. Thismethod cannot only accurately recognize speakers, but can also reject utterances whose text differs from the prompted text, even if it is uttered by a registered speaker. T hus, a recorded and played-back voice can be correctly rejected. c  1999 by CRC Press LLC [...]... the nearest reference speaker, conditional probabilities must be calculated for all the reference speakers, which involves a high computational cost Second, the maximum conditional probability value is rather variable from speaker to speaker, depending on how close the nearest speaker is in the reference set 48. 12.3 Cohort or Speaker Background Models A set of speakers, “cohort speakers”, has been chosen... reference speaker, customer c 1999 by CRC Press LLC Genuine speaker: A speaker whose real identity is in accordance with the claimed identity Alternative terms: true speaker, correct speaker Impostor: In the context of speaker identification, a speaker who does not belong to the set of registered speakers In the context of speaker verification, a speaker whose real identity is different from his/her claimed identity... response to a speaker (or speaker class) verification task Rejection: A decision outcome which involves refusal to assign a registered identity (or class) in the context of open-set speaker identification or speaker verification Misclassification: Erroneous identity assignment to a registered speaker in speaker identification False rejection: Erroneous rejection of a genuine speaker in open-set speaker identification... or not the claimed speaker is included in the speaker set for normalization; the cohort speaker set in the likelihood-ratio-based method does not include the claimed speaker, whereas the normalization term for the a posteriori-probability-based method is calculated using all the reference speakers, including the claimed speaker Matsui and Furui approximated the summation in Eq (48. 15) by the summation... different speakers must be to be acoustically and perceptually indistinguishable Fundamental research on these questions will provide answers for developing better speaker recognition technology Defining Terms Registered speaker: A speaker who belongs to the list of known (registered) users for a given speaker recognition system Alternative terms: reference speaker, customer c 1999 by CRC Press LLC Genuine speaker: ... efficient means of characterizing speaker- specific features [25, 29, 45, 52] A speakerspecific codebook is generated by clustering the training feature vectors of each speaker In the recognition stage, an input utterance is vector-quantized using the codebook of each reference speaker, c 1999 by CRC Press LLC FIGURE 48. 3: Typical structure of the DTW-based text-dependent speaker verification system and the... chosen for calculating the normalization term of Eq (48. 12) Higgins et al proposed the use of speakers that are representative of the population near the claimed speaker: p(x|S) log l(x) = log p (x|S = Sc ) − log (48. 14) S∈Cohort,S=Sc Experimental results show that this normalization method improves speaker separability and reduces the need for speaker- dependent or text-dependent thresholding, compared... described The density at point x for all speakers other than the true speaker S can be dominated by the density for the nearest reference speaker, if we assume that the set of reference speakers is representative of c 1999 by CRC Press LLC all speakers We can, therefore, arrive at the decision criterion log l(x) = log p (x|S = Sc ) − max S∈Ref,S=Sc log p(x|S) (48. 13) This shows that likelihood ratio... system each speaker is represented by a VQ codebook and an NTN classifier The NTN classifier is trained on both customer and impostor training data 48. 11 Model Training and Updating Trial-to-trial variations have a major impact on the performance of speaker recognition systems Variations arise from the speaker himself/herself, from differences in recording and transmission conditions, and from noise Speakers... way of choosing the cohort speaker set is to use speakers who are typical of the general population Reynolds [43] reported that a randomly selected, gender-balanced background speaker population outperformed a population near the claimed speaker Matsui and Furui [31] proposed a normalization method based on a posteriori probability: p(x|S) log l(x) = log p (x|S = Sc ) − log (48. 15) S∈Ref The difference . is rathervariablefromspeakertospeaker,dependingonhowclosethenearestspeakerisinthereference set. 48. 12.3 Cohort or Speaker Background Models A set of speakers, “cohort speakers”,. 