Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
177,72 KB
Nội dung
Richard V.Cox. “Speech Coding.”
2000 CRC Press LLC. <http://www.engnetbase.com>.
SpeechCoding
RichardV.Cox
AT&TLabs—Research
45.1Introduction
ExamplesofApplications
•
SpeechCoderAttributes
45.2UsefulModelsforSpeechandHearing
TheLPCSpeechProductionModel
•
ModelsofHumanPer-
ceptionforSpeechCoding
45.3TypesofSpeechCoders
Model-BasedSpeechCoders
•
TimeDomainWaveform-
FollowingSpeechCoders
•
FrequencyDomainWaveform-
FollowingSpeechCoders
45.4CurrentStandards
CurrentITUWaveformSignalCoders
•
ITULinearPrediction
Analysis-by-SynthesisSpeechCoders
•
DigitalCellularSpeech
CodingStandards
•
SecureVoiceStandards
•
Performance
References
45.1 Introduction
Digitalspeechcodingisusedinawidevarietyofeverydayapplicationsthattheordinaryperson
takesforgranted,suchasnetworktelephonyortelephoneansweringmachines.Byspeechcoding
wemeanamethodforreducingtheamountofinformationneededtorepresentaspeechsignalfor
transmissionorstorageapplications.Formostapplicationsthismeansusingalossycompression
algorithmbecauseasmallamountofperceptibledegradationisacceptable.Thissectionreviews
someoftheapplications,thebasicattributesofspeechcoders,methodscurrentlyusedforcoding,
andsomeofthemostimportantspeechcodingstandards.
45.1.1 ExamplesofApplications
Digitalspeechtransmissionisusedinnetworktelephony.Thespeechcodingusedisjustsample-by-
samplequantization.Thetransmissionrateformostcallsisfixedat64kilobitspersecond(kb/s).
Thespeechissampledat8000Hz(8kHz)andalogarithmic8-bitquantizerisusedtorepresenteach
sampleasoneof256possibleoutputvalues.Internationalcallsovertransoceaniccablesorsatellites
areoftenreducedinbitrateto32kb/sinordertoboostthecapacityofthisrelativelyexpensive
equipment.Digitalwirelesstransmissionhasalreadybegun.InNorthAmerica,Europe,andJapan
therearedigitalcellularphonesystemsalreadyinoperationwithbitratesrangingfrom6.7to13kb/s
forthespeechcoders.SecuretelephonyhasexistedsinceWorldWarII,basedonthefirstvocoder.
(Vocoderisacontractionofthewordsvoicecoder.)Securetelephonyinvolvesfirstconvertingthe
speechtoadigitalform,thendigitallyencryptingitandthentransmittingit.Atthereceiver,it
isdecrypted,decoded,andreconvertedbacktoanalog.Currentvideotelephonyisaccomplished
c
1999byCRCPressLLC
through digital transmission of both the speech and the video signals. An emerging use of speech
coders is for simultaneous voice and data. In these applications, users exchange data (text, images,
FAX, or any other form of digitalinformation) while carrying on a conversation.
All of the above examples involve real-time conversations. Today we use speech coders for many
storage applications that make our lives easier. For example, voice mail systems and telephone
answering machines allow us to leave messages for others. T he calledparty canretrieve the message
when they wish, even from halfway around the world. The same storage technology can be used to
broadcast announcements to many different individuals. Another emerging use of speechcoding is
multimedia. Most forms ofmultimedia involve only one-way communications, so we include them
with storage applications. Multimedia documents on computers can have snippets of speech as an
integral part. Capabilities currently exist to allow users to make voice annotations onto documents
stored on a personal computer (PC) or workstation.
45.1.2 Speech Coder Attributes
Speech coders have attributes that can be placed in four groups: bit rate, quality, complexity, and
delay. For a given application, some of these attributes are pre-determined while tradeoffs can be
made among the others. For example, the communications channel may set a limit on bit rate, or
cost considerations may limit complexity. Quality can usually be improved by increasing bit rate or
complexity,andsometimesbyincreasingdelay. Inthefollowing sections,wediscusstheseattributes.
Primarily we will be discussing telephone bandwidth speech. This is a slightly nebulous term. In
the telephone network, speech is first bandpass filtered from roughly 200 to 3200Hz. This is often
referredtoas3kHzspeech. Speechissampledat8kHzinthetelephonenetwork. Theusualtelephone
bandwidth filter rolls off to about 35 dB by 4 kHz in order to eliminate the aliasing artifacts caused
by sampling.
There is a second bandwidth of interest. It is referred to aswideband speech. The sampling rate is
doubled to 16 kHz. The lowpass filter is assumed to begin rolling off at 7 kHz. At the low end, the
speechisassumed tobeuncontamined byline noise andonlythe DC componentneedstobefiltered
out. Thus,thehighpass filtercutofffrequencyis50Hz. Whenwerefertowidebandspeech,wemean
speech with a bandwidth of 50 to 7000 Hz and a sampling rate of 16 kHz. This is also referred to as
7 kHz speech.
Bit Rate
Bitratetellsusthedegreeofcompressionthatthecoderachieves. Telephonebandwidthspeech
issampledat8kHzanddigitizedwithan8-bitlogarithmicquantizer,resultinginabitrateof64kb/s.
Fortelephone bandwidth speechcoders,wemeasurethedeg ree ofcompressionby howmuchthe bit
rate is lowered from 64 kb/s. International telephone network standards currently exist for coders
operating from 64kb/s down to 5.3 kb/s. Thespeech coders for regional cellular standards spanthe
range from 13 to 3.45 kb/s and those for secure telephony span the range from 16 kb/s to 800 b/s.
Finally,there are proprietaryspeech coders that are in common use which span the entire range.
Speech coders need not have a constant bit rate. Considerable compression can be gained by not
transmitting speech during the silence intervals of aconversation. Nor is it necessary to keepthe bit
rate fixed during the talkspurts of a conversation.
Delay
The communication delay of the coder is more important for transmission than for storage
applications. In real-time conversations, a large communication delay can impose an awkward
protocol on talkers. Large communication delays of 300 ms or greater are particularly objectionable
to users even if there are no echoes.
c
1999 by CRCPress LLC
Most low bit rate speech coders are block coders. They encode a block of speech, also known as
a frame, at a time. Speechcoding delay can be allocated as follows. First, there is algorithmic delay.
Some coders have anamount of look-ahead or other inherent delays in addition to their frame size.
The sum of frame size and other inherent delays constitutes algorithmic delay. The coder requires
computation. Theamountoftimerequired for this is calledprocessingdelay. It is dependent onthe
speed of the processor used. Other delays in a complete system are the multiplexing delay and the
transmission delay.
Complexity
The degree ofcomplexity is adetermining factor inboththecostandpowerconsumptionofa
speechcoder. Cost is almost alwaysafactorintheselectionof a speechcoder foragivenapplication.
With the advent of wireless and portable communications, power consumption has also become an
important factor. Simple scalar quantizers, such as linear or logarithmic PCM, are necessary in any
coding system and have the lowest possible complexity.
More complex speech coders are first simulated on host processors, then implemented on DSP
chips and may later be implemented on special purpose VLSI devices. Speed and random access
memory(RAM)are the two most importantcontributing factors of complexity. The faster the chip
or the greater the chip size, the greater the cost. In fact, complexity is a determining factor for both
cost and power consumption. Generally 1 word of RAM takes up as much on-chip area as 4 to 6
words of readonly memory (ROM). Most speechcoders areimplemented on fixed point DSPchips,
soonewayto comparethecomplexity ofcodersistomeasuretheirspeedand memor y requirements
when efficiently implemented on commercially available fixed point DSP chips.
DSP chips are available in both 16-bit fixed point and 32-bit floating point. 16-bit DSP chips
are generally preferred for dedicated speech coder implementations because the chips are usually
less expensive and consume less power than implementations based on floating point DSPs. A
disadvantage of fixed-point DSP chips is that the speechcoding algorithm must be implemented
using 16-bit arithmetic. As part of the implementation process, a representation must be selected
for eachandevery variable. Some can berepresentedina fixedformat, someinblockfloatingpoint,
and still others may require double precision. As VLSI technology has advanced, fixed point DSP
chips contain a richer set of instructions to handle the data manipulations required to implement
representations such as block floating point. The advantage of floating point DSP chips is that
implementing speech coders is much quicker. Their arithmetic precision is about the same as that
of a highlevel language simulation, so the steps of determining the representation of each and e very
variable andhow these representations affect performance can beomitted.
Quality
The attribute of quality has many dimensions. Ultimately quality is determined by how the
speech sounds to a listener. Some of the factors that affect the performance of a coder are whether
the input speech is clean or noisy, whether the bit stream has been corrupted by errors, and whether
multiple encodings have taken place.
Speech coder quality ratings are determined by means of subjective listening tests. The listening
is done in a quiet booth and may use specified telephone handsets, headphones, or loudspeakers.
The speech material is presented to the listeners at specified levels and isoriginally prepared to have
particular frequency characteristics. The most often used test is the absolute category rating (ACR)
test. Subjects hear pairs of sentences and are asked to give one of the follow ing ratings: excellent,
good, fair, poor,orbad. A typicaltest contains a variety of differenttalkers and a numberofdifferent
coders or reference conditions. The data resulting from this test can be analyzed in many ways. The
simplest way is to assign anumerical rankingto each response, givinga5tothebestpossible rating,
4 to the next best, down to a 1 for the worst rating, then computing the mean rating for each of the
c
1999 by CRC Press LLC
conditions under test. This is a referred to as a mean opinionscore (MOS) and the ACR test is often
referred to as a MOS test.
Therearemanyotherdimensionstoqualitybesidesthosepertainingtonoiselesschannels. Biterror
sensitivity is another aspect of quality. For some low bit rate applications such as secure telephones
over 2.4 or 4.8 kb/s modems, it might be reasonable to expect the distribution of bit errors to be
random and coders should be made robust for low random bit error rates up to 1 to 2%. For radio
channels, such as in digital cellular telephony, provision is made for additional bits to be used for
channel codingtoprotecttheinformationbearing bits. Errors aremore likely tooccur inburstsand
the speech coder requires a mechanism to recover from an entire lost frame. This is referred to as
frame erasure concealment, another aspect of quality for cellular speech coders.
Forthepurposesofconservingbandwidth,voiceactivitydetectorsaresometimesusedwithspeech
coders. During non-speech intervals, the speech coder bit stream is discontinued. At the receiver
“comfort noise” is injected to simulate the background acoustic noise at the encoder. This method
is used for some cellular systems and also in digital speech interpolation (DSI) systems to increase
the effective number of channels or circuits. Most international phone calls car ried on undersea
cables orsatellitesuseDSI systems. Thereissomeimpacton quality when thesetechniquesareused.
Subjective testing can determine the degree of degradation.
45.2 Useful Models for Speech and Hearing
45.2.1 The LPC Speech Production Model
Human speech is produced in the vocal tract by a combination of the vocal cords in the glottis
interacting with the articulators of the vocal tract. The vocal tract can be approximated as a tube
of varying diameter. The shape of the tube gives r ise to resonant frequencies called formants. Over
the years, the most successfulspeechcoding techniques havebeen based on linear predictioncoding
(LPC).TheLPCmodelisderivedfromamathematicalapproximationtothevocaltractrepresentation
as a variable diameter tube. The essential element of LPC is the linear prediction filter. This is an
all polefilter which predicts the value of the next sample based on a linear combination of previous
samples.
Let x
n
be the speech sample value at sampling instant n. The object is to find a set of prediction
coefficients {a
i
} such thatthe prediction error fora frame ofsize M is minimized:
ε =
M−1
m=0
I
i=1
a
i
x
n+m−i
+ x
n+m
2
(45.1)
where I is theorder of the linear prediction model. The prediction value for x
n
is given by
˜x
n
=−
I
i=1
a
i
x
n−i
(45.2)
The prediction error signal {e
n
} is also referred to as the residual signal. In z-transform notation
we can write
A(z) = 1 +
I
i=1
a
i
z
−i
(45.3)
1/A(z) isreferredtoastheLPC synthesis filter and (ironically) A(z) isreferredtoastheLPC inverse
filter.
c
1999 by CRC Press LLC
LPCanalysisiscarriedoutasablockprocessona frameofspeech. Themostoftenusedtechniques
arereferredtoastheautocorrelationandtheautocovariancemethods[1]–[3]. Bothmethodsinvolve
inverting mat rices containing correlation statistics of thespeech signal. If the polesof the LPC filter
are close to the unit circle, then these matrices become more ill-conditioned, which means that
the techniques used for inversion are more sensitive to errors caused by finite numerical precision.
Various techniques for dealing with this aspect of LPC analysis include windows for the data [1, 2],
windows for the correlation statistics [4], and bandwidth expansion of the LPC coefficients.
For forward adaptive coders, the LPC information must also be quantized and transmitted or
stored. Direct quantization of LPC coefficients is not efficient. A small quantization error in a
single coefficient can render the entire LPC filter unstable. Even if the filter is stable, sufficient
precision is required and too many bits will be needed. Instead, it is better to transform the LPC
coefficientstoanotherdomaininwhichstabilityismoreeasilydeterminedandfewerbitsarerequired
for representing the quantizationlevels.
The first such domain to be considered is the reflection coefficient [5]. Reflection coefficients are
computed as a byproduct of LPC analysis. One of their properties is that all reflection coefficients
must have magnitudes less than 1, making stability easily verified. Direct quantization of reflection
coefficients is still not efficient because the sensitivity of the LPC filter to errors is much greater
when reflection coefficients are nearly 1 or −1. More efficient quantizers have been designed by
transformingtheindividualreflectioncoefficientswithanonlinearitythatmakestheerrorsensitivity
more uniform. Two such nonlinear functions are the inverse sine function, arcsin(k
i
), and the
logarithm ofthe area ratio, log
1+k
i
1−k
i
.
Aseconddomainthathasattractedevengreaterinterestrecentlyisthelinespectralfrequency(LSF)
domain [6]. The transformation is given as follows. We first use A(z) to define two polynomials:
P(z) = A(z) + z
−(I +1)
A
z
−1
(45.4a)
Q(z) = A(z) − z
−(I +1)
A
z
−1
(45.4b)
These polynomials can be shown to have two useful properties: all zeroes of P(z)and Q(z) lie on
the unit circle and they are interlaced with each other. Thus, stability is easily checked by assuring
both the interlaced property and that no two zeroes are too close together. A second property
is that the frequencies tend to be clustered near the formant frequencies; the closer together two
LSFs are, the sharper the formant. LSFs have attracted more interest recently because they typically
resultinquantizers having eitherbetterrepresentationsor using fewerbits thanreflectioncoefficient
quantizers.
The simplest quantizers are scalar quantizers [8]. Each of the values (in whatever domain is
being used to represent the LPC coefficients) is represented by one of the possible quantizer levels.
The individual values are quantized independently of each other. There may also be additional
redundancy between successive frames, especially during stationary speech. In such cases, values
may be quantized differentially between frames.
Amoreefficient,butalso morecomplex, methodofquantizationiscalled vectorquantization[9].
Inthistechnique,thecompletesetofvaluesisquantized jointly. The actualsetofvaluesiscompared
againstallsets inthecodebook usingadistancemetric. Thesetthat isnearestisselected. Inpractice,
an exhaustive codebook search is too complex. For example, a 10-bit codebook has 1024 entries.
This seems like a practical limit for most codebooks, but does not give sufficient performance for
typical 10th order LPC. A 20-bit codebook would give increased performance, but would contain
over 1 million vectors. This is both too much storage and too much computational complexity to
be practical. Instead of using large codebooks, product codes are used. In one technique, an initial
codebook is used, then the remaining error vector is quantized by a second stage codebook. In the
c
1999 by CRC Press LLC
secondtechnique,thevectorissub-dividedandeachsub-vectorisquantizedusingitsowncodebook.
Both of these techniques lose efficiency compared to a full-search vector quantizer, but represent a
good means for reducing computational complexity and codebook size for bit rate or quality.
45.2.2 Models of Human Perception for Speech Coding
Our ears have a limited dynamic range that depends on both the level and the frequency content of
the input signal. The typical bandpass telephone filter has a stopband of onlyabout35 dB. Also, the
logarithmicquantizercharacteristicsspecifiedbyCCITTRec. G.711resultinasignal-to-quantization
noiseratioofabout35dB.Isthisacoincidence? Ofcoursenot! Ifasignal maintainsan SNRof about
35 dB or greater for telephone bandwidth,then most humans will perceive little or no noise.
Conceptually, the masking property tells us that we can permit greater amounts of noise in and
near the formant regions andthatnoisewillbemostaudiblein the spectral valleys. Ifwe use acoder
that produces a white noise characteristic, then the noise spectrum is flat. The white noise would
probably be audible in all but the formant regions.
In modern speech coders, an additional linear filter is added to weight the difference between the
original speech signal and the synthesized signal. The object is to minimize the error in a space
whose metric is like that of the human auditory system. If the LPC filter information is available, it
constitutes the best available estimate of the speech spectrum. It can be used to form the basis for
this “perceptual weighting filter” [10]. The perceptual weighting filter isgiven by
W(z) =
1 − A(z/γ
1
)
1 − A(z/γ
2
)
0 <γ
2
<γ
1
< 1 (45.5)
The perceptual weighting filter de-emphasizes the importance of noise in the formant region and
emphasizes its importance in spectral valleys. The quantization noise will have a spectral shapethat
is similar to that of the LPC spectral estimate, making iteasier to mask.
The adaptive postfilter is an additional linear filter that is combined with the synthesis filter to
reducenoiseinthespectralvalleys[11]. OnceagaintheLPCsynthesisfilterisavailableastheestimate
of the speech spectrum. As in the perceptual weighting filter, the synthesis filter is modified. This
idea was later further extended to include a long-term (pitch) filter. A tilt-compensation filter was
added to correct for thelow pass characteristic that causes a muffled sound. A gain control st rategy
helped prevent any segments from being either too loud or too soft. Adaptive postfilters are now
included as a part of many standards.
45.3 Types of Speech Coders
Thispartofthesectiondescribesavarietyofspeechcodersthatarewidelyused. Theyaredividedinto
two categories: waveform-following coders and model-based coders. Waveform-following coders
havethepropertythatiftherewerenoquantizationerror,theoriginalspeechsignalwouldbe exactly
reproduced. Model-based coders are based on parametric models of speech production. Only the
values of the parameters are quantized. If there were no quantization error, the reproduced signal
would not be the original speech.
45.3.1 Model-Based Speech Coders
LPC Vocoders
AblockdiagramoftheLPCvocoderisshowninFig.45.1. LPCanalysisisperformedonaframe
of speech and the LPCinformation is quantized and transmitted. A voiced/unvoiced determination
is made. The decision may be based on either the original speech or the LPC residual signal, but it
c
1999 by CRC Press LLC
will always be based on the degree of periodicit y of the signal. If the frame is classified as unvoiced,
the excitation signal is white noise. If the frame is voiced, the pitch period is transmitted and the
excitationsignalisa periodic pulsetrain. Ineithercase, theamplitudeof theoutputsignalisselected
such that its power matches that of the original speech. For more informationon the LPC vocoder,
the reader isreferred to [12].
FIGURE 45.1: Block diagram of LPCvocoder.
Multiband Excitation (MBE) Coders
Figure45.2is ablockdiagramofamultibandsinusoidalexcitationcoder. Thebasicpremiseof
these coders is that the speech waveform can be modeled as a combination of harmonically related
sinusoidal waveforms and narrowband noise. Within a given bandwidth, the speech is classified as
periodic or aperiodic. Harmonically relatedsinusoidsare used togenerate the periodic components
and white noise is used to generate the aperiodic components. Rather than transmitting a single
voiced/unvoiced decision, a frame consists of a number of voiced/unvoiced decisions corresponding
to the different bands. In addition, the spectral shape and gain must be transmitted to the receiver.
LPC may or may not be used to quantize the spectral shape. Most often the analysis of the encoder
is performed via fast Fourier t ransform (FFT). Synthesis at the decoder is usually performed by a
number of parallel sinusoid and white noise generators. MBE coders are model-based because they
do not t ransmit the phase of the sinusoids, nor do they attempt to capture anything more than the
energy of the aperiodic components. For more information the reader is referred to [13]–[16].
FIGURE 45.2: Block diagram of multiband excitation coder.
c
1999 by CRC Press LLC
Waveform Interpolation Coders
Figure 45.3 is a block diagram of a waveform interpolation coder. In this coder, the speech
is assumed to be composed of a slowly evolving periodic waveform (SEW) and a rapidly evolving
noise-like waveform (REW). A frame is analyzed first to extract a “characteristic waveform”. The
evolution of these waveforms is filtered to separate the REW from the SEW. REW updates are made
several times more often than SEW updates. The LPC, the pitch, the spectra of the SEW and REW,
and the overall energy are all transmitted independently. Atthereceiver a parametric representation
of the SEW and REW information is constructed, summed, and passed through the LPC synthesis
filter to produce output speech. For more information the reader is referred to [17, 18].
FIGURE 45.3: Block diagram of waveform interpolation coder.
45.3.2 Time Domain Waveform-Following Speech Coders
Allofthetimedomainwaveformcodersdescribedinthissectionincludeapredictionfilter. Webegin
with the simplest.
Adaptive Differential Pulse Code Modulation (ADPCM)
Adaptive differential pulse code modulation (ADPCM) [19] is based on sample-by-sample
quantization of the prediction error. A simple blockdiagram is shown in Fig.45.4. Two partsofthe
coder may be adaptive: the quantizer step-size and/or the prediction filter. ITU Recommendations
G.726 and G.727 adapt both. The adaptation may be either forward or backward adaptive. In a
backward adaptive system, the adaptation is based only on the previously quantized sample values
and the quantizer codewords. At the receiver, the backward adaptive par ameter values must be
recomputed. An important feature of such adaptation schemes isthat they must use predictors that
include a leakage factor thatallows the effectsoferroneousvaluescausedbychannel errorstodieout
over time. In a forward a daptive system, the adapted values are quantized and transmitted. This
additional“sideinformation” uses bitrate, butcanimprovequality. Additionally, itdoesnot require
recomputation at the decoder.
Delta Modulation Coders
Indeltamodulationcoders[20],thequantizer isjustthesignbit. Thequantizationstep sizeis
adaptive. Not all the adaptation schemes used for ADPCM will work for delta modulation because
the quantization is so coarse. The quality of delta modulation coders tends to be proportional to
their sampling clock: the greater the sampling clock, the greater the correlation between successive
samples, and the finer the quantization step size that can be used. The block diagram for delta
modulation is the same as that of ADPCM.
c
1999 by CRC Press LLC
FIGURE 45.4: ADPCM encoder and decoder block diagrams.
Adaptive Predictive Coding
The better the performance of the prediction filter, the lower the bit rate needed to encode a
speech signal. This is the basis of the adaptive predictive coder [21] shown in Fig. 45.5. A forward
adaptive higher order linear prediction filter is used. The speech is quantized on a frame-by-frame
basis. In this way the bit rate for the excitation can be reduced compared to an equivalent quality
ADPCM coder.
FIGURE 45.5: Adaptive predictive coding encoder and decoder.
Linear Prediction Analysis-by-Synthesis Speech Coders
Figure45.6showsatypicallinearpredictionanalysis-by-synthesisspeechcoder[22]. LikeAPC,
theseareframe-by-framecoders. TheybeginwithanLPCanalysis. Typically theLPCinformationis
forwardadaptive,butthereareexceptions. LPAS codersborrowtheconceptfromADPCMof having
alocallyavailabledecoder. Thedifferencebetweenthequantizedoutputsignalandtheoriginalsignal
ispassedthrough aperceptualweightingfilter. Possibleexcitationsignalsareconsideredandthebest
(minimum mean squareerrorintheperceptual domain) is selected. Thelong-term prediction filter
removes long-term correlation (the pitch str ucture) in the signal. If pitch structure is present in the
coder, the parameters for the long-term predictor are determined first. The most commonly used
system is theadaptive codebook, where samples from previous excitation sequences are stored. The
pitchperiodandgainthat resultinthegreatestreductionofperceptual errorareselected, quantized,
and transmitted. The fixed codebook excitation is next considered and, again, the excitation vector
c
1999 by CRC Press LLC
[...]... scale of bit rate Figure 45. 8 only includes telephone bandwidth speech coders The 7-kHz speech coders have been omitted Figure 45. 9 compares the complexity as measured in MIPS and RAM for a fixed point DSP c 1999 by CRC Press LLC FIGURE 45. 8: Approximate speech quality of speechcoding standards FIGURE 45. 9: Approximate complexity of speechcoding standards c 1999 by CRC Press LLC implementation for most... implementations of the standard 45. 4.1 Current ITU Waveform Signal Coders Table 45. 1 describes current ITU speechcoding recommendations that are based on sample-bysample scalar quantization Three of these coders operate in the time domain on the original sampled signal while the fourth is based on a two-band sub-band coder for wideband speech TABLE 45. 1 ITU Waveform Speech Coders Standard body Number... transform coding with LPC and time-domain pitch analysis [25] The residual signal is coded using ATC 45. 4 Current Standards This part of the section is divided into descriptions of current speech coder standards and activities The subsections contain information on speech coders that have been or will soon be standardized We begin first by briefly describing the standards organizations who formulate speech coding. .. split the signal 45. 4.2 ITU Linear Prediction Analysis-by-Synthesis Speech Coders Table 45. 2 describes three current analysis-by-synthesis speech coder recommendations of the ITU All three are block coders based on extensions of the original multipulse LPC speech coder TABLE 45. 2 Coders ITU Linear Prediction Analysis-By-Synthesis Speech Standard body Number Year ITU ITU ITU G.728 G.729 G.723.1 1992... Telecommunications Standards Bureau (ITU-B) is the bureaucracy handling all of the paperwork Speechcoding standards are handled jointly by Study Groups 16 and 12 within the ITU-T Other Study Groups may originate requests for speech coders for specific applications The speechcoding experts are found in SG16 The experts on speech performance are found in SG12 When a new standard is being formulated, SG16 draws... X., CCITT standardizing activities in speech coding, Proc ICASSP ‘86, 817–820, 1986 [30] Chen, J.-H., Cox, R.V., Lin, Y.-C., Jayant, N., and Melchner, M.J., A low-delay CELP coder for the CCITT 16 kb/s speechcoding standard, IEEE JSAC, 10, 830–849, 1992 [31] Johansen, F.T., A non bit-exact approach for implementation verification of the CCITT LDCELP speech coder, Speech Commun., 12, 103–112, 1993 [32]... channel coding, so the speech coder itself must be designed to be robust for the channel conditions The noisy background conditions have proven to be difficult for vocoders making voiced/unvoiced classification decisions, whether the decisions are made for all bands or for individual bands 45. 4.5 Performance Figure 45. 8 is included to give an impression of the relative performance for clean speech of... coders Figure 45. 8 is based on the relative performances of these coders across a number of tests that have been reported In the case of coders that are not yet standards, their performance is projected and shown as a circle The vertical axis of Fig 45. 8 gives the approximate single encoding quality for clean input speech The horizontal axis is a logarithmic scale of bit rate Figure 45. 8 only includes... rate used ACELP G.723.1 and G.729 are the first ITU coders to be specified by a bit exact fixed point ANSI C code simulation of the encoder and decoder 45. 4.3 Digital Cellular SpeechCoding Standards Table 45. 3 describes the first and second generation of speech coders to be standardized for digital cellular telephony The first generation coders provided adequate quality Two of the second generation coders... J.L., Optimizing digital speech coders by exploiting masking properties of the human ear, J Acoustical Soc Am., 66, 1647–1652, Dec 1979 [11] Chen, J.-H and Gersho, A., Adaptive postfiltering for quality enhancement of coded speech, IEEE Trans on Speech and Audio Processing, 3, 59–71, 1995 [12] Tremain, T., The Government Standard Linear Predictive Coding Algorithm: LPC-10, Speech Technol., 40–49, Apr . <http://www.engnetbase.com>.
SpeechCoding
RichardV.Cox
AT&TLabs—Research
45. 1Introduction
ExamplesofApplications
•
SpeechCoderAttributes
45. 2UsefulModelsforSpeechandHearing
TheLPCSpeechProductionModel
•
ModelsofHumanPer-
ceptionforSpeechCoding
45. 3TypesofSpeechCoders
Model-BasedSpeechCoders
•
TimeDomainWaveform-
FollowingSpeechCoders
•
FrequencyDomainWaveform-
FollowingSpeechCoders
45. 4CurrentStandards
CurrentITUWaveformSignalCoders
•
ITULinearPrediction
Analysis-by-SynthesisSpeechCoders
•
DigitalCellularSpeech
CodingStandards
•
SecureVoiceStandards
•
Performance
References
45. 1. by CRC Press LLC
FIGURE 45. 8: Approximate speech quality of speech coding standards.
FIGURE 45. 9: Approximate complexity of speech coding standards.
c
1999