Figure 2.2 Spectrogram of a music signal cor anglais, 44.1 kHz2.3 SCALABLE SERIES An MPEG-7 ScalableSeries description is a standardized way of representing a series of LLD features scal
Trang 1The purpose of Chapter 2 is to provide the reader with a detailed overview
of low-level audio descriptors To a large extent this chapter provides the
foun-dations and definitions for most of the remaining chapters of the book SinceMPEG-7 provides an established framework with a large set of descriptors, thestandard is used as an example to illustrate the concept The mathematical def-initions of all MPEG-7 low-level audio descriptors are outlined in detail Otherestablished low-level descriptors beyond MPEG-7 are introduced To help thereader visualize the kind of information that these descriptors convey, someexperimental results are given to illustrate the definitions
In Chapter 3 the reader is introduced to the concepts of sound similarity and
sound classification Various classifiers and their properties are discussed
Low-level descriptors introduced in the previous chapter are employed for illustration.The MPEG-7 standard is again used as a starting point to explain the practicalimplementation of sound classification systems The performance of MPEG-7systems is compared with the well-established MFCC feature extraction method.The chapter provides in great detail simulation results of various systems forsound classification
Chapter 4 focuses on MPEG-7 SpokenContent description It is possible to
follow most of the chapter without reading the other parts of the book Theprimary goal is to provide the reader with a detailed overview of ASR and
its use for MPEG-7 SpokenContent description The structure of the MPEG-7
SpokenContent description itself is presented in detail and discussed in the
context of the spoken document retrieval (SDR) application The contribution of the MPEG-7 SpokenContent tool to the standardization and development of future
SDR applications is emphasized Many application examples and experimentalresults are provided to illustrate the concept
Music description tools for specifying the properties of musical signals arediscussed in Chapter 5 We focus explicitly on MPEG-7 tools Concepts for
instrument timbre description to specify perceptual features of musical sounds are discussed using reduced sets of descriptors Melodies can be described using
MPEG-7 description schemes for melodic similarity matching We will discussquery-by-humming applications to provide the reader with examples of howmelody can be extracted from a user’s input and matched against melodiescontained in a database
An overview of audio fingerprinting and audio signal quality description isprovided in Chapter 6 In general, the MPEG-7 low-level descriptors can be seen
as providing a fingerprint for describing audio content Audio fingerprinting has
to a certain extent been described in Chapters 2 and 3 We will focus in Chapter 6
on fingerprinting tools specifically developed for the identification of a piece ofaudio and for describing its quality
Chapter 7 finally provides an outline of example applications using the cepts developed in the previous chapters Various applications and experimentalresults are provided to help the reader visualize the capabilities of concepts forcontent analysis and description
Trang 3• Basic descriptors: audio waveform (AWF), audio power (AP).
• Basic spectral descriptors: audio spectrum envelope (ASE), audio spectrumcentroid (ASC), audio spectrum spread (ASS), audio spectrum flatness (ASF)
• Basic signal parameters: audio harmonicity (AH), audio fundamental quency (AFF)
fre-• Temporal timbral descriptors: log attack time (LAT) and temporal centroid(TC)
• Spectral timbral descriptors: harmonic spectral centroid (HSC), harmonicspectral deviation (HSD), harmonic spectral spread (HSS), harmonic spectralvariation (HSV) and spectral centroid (SC)
• Spectral basis representations: audio spectrum basis (ASB) and audio spectrumprojection (ASP)
An additional silence descriptor completes the MPEG-7 foundation layer
MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval H.-G Kim, N Moreau and T Sikora
Trang 4This chapter gives the mathematical definitions of all low-level audio tors according to the MPEG-7 audio standard To help the reader visualize thekind of information that these descriptors convey, some experimental results aregiven to illustrate the definitions.1
descrip-2.2 BASIC PARAMETERS AND NOTATIONS
There are two ways of describing low-level audio features in the MPEG-7standard:
• An LLD feature can be extracted from sound segments of variable lengths
to mark regions with distinct acoustic properties In this case, the summary
descriptor extracted from a segment is stored as an MPEG-7 AudioSegment
description An audio segment represents a temporal interval of audio material,which may range from arbitrarily short intervals to the entire audio portion of
a media document
• An LLD feature can be extracted at regular intervals from sound frames In
this case, the resulting sampled values are stored as an MPEG-7 ScalableSeries
description
This section provides the basic parameters and notations that will be used
to describe the extraction of the frame-based descriptors The scalable seriesdescriptions used to store the resulting series of LLDs will be described inSection 2.3
2.2.1 Time Domain
In the time domain, the following notations will be used for the input audiosignal:
• n is the index of time samples
• sn is the input digital audio signal
• Fsis the sampling rate ofsn
And for the time frames:
• l is the index of time frames
• hopSize is the time interval between two successive time frames.
1See also the LLD extraction demonstrator from the Technische Universität Berlin (MPEG-7 Audio
Analyzer), available on-line at: http://mpeg7lld.nue.tu-berlin.de/.
Trang 5• Nhop denotes the integer number of time samples corresponding to hopSize.
• Lwis the length of a time frame (withLw≥ hopSize).
• Nw denotes the integer number of time samples corresponding toLw
• L is the total number of time frames in sn
These notations are portrayed in Figure 2.1
The choice of hopSize andLw depends on the kind of descriptor to extract
However, the standard constrains hopSize to be an integer multiple or divider
of 10 ms (its default value), in order to make descriptors that were extracted at
different hopSize intervals compatible with each others.
2.2.2 Frequency Domain
The extraction of some MPEG-7 LLDs is based on the estimation of short-termpower spectra within overlapping time frames In the frequency domain, thefollowing notations will be used:
• k is the frequency bin index
• Slk is the spectrum extracted from the lth frame of sn
• Plk is the power spectrum extracted from the lth frame of sn
Several techniques for spectrum estimation are described in the literature (Goldand Morgan, 1999) MPEG-7 does not standardize the technique itself, eventhough a number of implementation features are recommended (e.g an Lw of
30 ms for a default hopSize of 10 ms) The following just describes the most
classical method, based on squared magnitudes of discrete Fourier transform(DFT) coefficients After multiplying the frames with a windowing function
Figure 2.1 Notations for frame-based descriptors
Trang 6wn (e.g a Hamming window), the DFT is applied as:
0 ≤ l ≤ L − 1 0 ≤ k ≤ NFT− 1 (2.1)whereNFT is the size of the DFTNFT≥ Nw In general, a fast Fourier transform(FFT) algorithm is used and NFT is the power of 2 just larger than Nw (theenlarged frame is then padded with zeros)
According to Parseval’s theorem, the average power of the signal in the lthanalysis window can be written in two ways, as:
Pl= 1
Ew
Nw −1 n=0
sn + lNhopwn2
NFTEw
N FT−1 k=0
The power spectrumPlk of the lth frame is defined as the squared magnitude
of the DFT spectrumSlk Since the signal spectrum is symmetric around theNyquist frequency Fs/2, it is possible to consider the first half of the powerspectrum only0 ≤ k ≤ NFT/2 without losing any information In order to ensurethat the sum of all power coefficients equates to the average power defined inEquation (2.2), each coefficient can be normalized in the following way:
In the FFT spectrum, the discrete frequencies corresponding to bin indexes
Trang 7Figure 2.2 Spectrogram of a music signal (cor anglais, 44.1 kHz)
2.3 SCALABLE SERIES
An MPEG-7 ScalableSeries description is a standardized way of representing a
series of LLD features (scalars or vectors) extracted from sound frames at regular
time intervals Such a series can be described at full resolution or after a scaling
operation In the latter case, the series of original samples is decomposed intoconsecutive sub-sequences of samples Each sub-sequence is then summarized
by a single scaled sample.
An illustration of the scaling process and the resulting scalable series tion is shown in Figure 2.3 (ISO/IEC, 2001), wherei is the index of the scaled
descrip-Figure 2.3 Structure of a scalable series description
Trang 8series In this example, the 31 samples of the original series (filled circles) aresummarized by 13 samples of the scaled series (open circles).
The scale ratio of a given scaled sample is the number of original samples
it stands for Within a scalable series description, the scaled series is itselfdecomposed into successive sequences of scaled samples In such a sequence,all scaled samples share the same scale ratio In Figure 2.3, for example, the firstthree scaled samples each summarize two original samples (scale ratio is equal
to 2), the next two six, the next two one, etc
The attributes of a ScalableSeries are the following:
• Scaling: is a flag that specifies how the original samples are scaled If absent,
the original samples are described without scaling
• totalNumOfSamples: indicates the total number of samples of the original
series before any scaling operation
• ratio: is an integer value that indicates the scale ratio of a scaled sample,
i.e the number of original samples represented by that scaled sample Thisparameter is common to all the elements in a sequence of scaled samples The
value to be used when Scaling is absent is 1.
• numOfElements: is an integer value indicating the number of consecutive
elements in a sequence of scaled samples that share the same scale ratio If
Scaling is absent, it is equal to the value of totalNumOfSamples.
The last sample of the series may summarize fewer than ratio samples In the
example of Figure 2.3, the last scaled sample has a ratio of 2, but actuallysummarizes only one original sample This situation is detected by comparing
the sum of ratio times numOfElements products to totalNumOfSamples.
Two distinct types of scalable series are defined for representing series ofscalars and series of vectors in the MPEG-7 LLD framework Both types inheritfrom the scalable series description The following sections present them indetail
2.3.1 Series of Scalars
The MPEG-7 standard contains a SeriesOfScalar descriptor to represent a series
of scalar values, at full resolution or scaled This can be used with any temporal
series of scalar LLDs The attributes of a SeriesOfScalar description are:
• Raw: may contain the original series of scalars when no scaling operation is applied It is only used if the Scaling flag is absent to store the entire series at
full resolution
Trang 9• Weight: is an optional series of weights If this attribute is present, each weight
corresponds to a sample in the original series These parameters can be used
to control scaling
• Min, Max and Mean: are three real-valued vectors in which each dimension
characterizes a sample in the scaled series For a given scaled sample, a
Min, Max and Mean coefficient is extracted from the corresponding group of
samples in the original series The coefficient in Min is the minimum original sample value, the coefficient in Max is the maximum original sample value and the coefficient in Mean is the mean sample value The original samples
are averaged by arithmetic mean, taking the sample weights into account if the
Weight attribute is present (see formulae below) These attributes are absent
if the Raw element is present.
• Variance: is a real-valued vector Each element corresponds to a scaled sample.
It is the variance computed within the corresponding group of original samples
This computation may take the sample weights into account if the Weight attribute is present (see formulae below) This attribute is absent if the Raw
element is present
• Random: is a vector resulting from the selection of one sample at random
within each group of original samples used for scaling This attribute is absent
if the Raw element is present.
• First: is a vector resulting from the selection of the first sample in each group
of original samples used for scaling This attribute is absent if the Raw element
is present
• Last: is a vector resulting from the selection of the last sample in each group
of original samples used for scaling This attribute is absent if the Raw element
is present
These different attributes allow us to summarize any series of scalar features
Such a description allows scalability, in the sense that a scaled series can be derived indifferently from an original series (scaling operation) or from a previ- ously scaled SeriesOfScalar (rescaling operation).
Initially, a series of scalar LLD features is stored in the Raw vector Each element Raw(l) 0 ≤ l ≤ L − 1 contains the value of the scalar feature extractedfrom thelth frame of the signal Optionally, the Weight series may contain the
weightsWl associated to each Raw(l) feature.
When a scaling operation is performed, a new SeriesOfScalar is generated
by grouping the original samples (see Figure 2.3) and calculating the
above-mentioned attributes The Raw attribute is absent in the scaled series descriptor.
Let us assume that theith scaled sample stands for the samples Raw(l) contained
betweenl = lLoi and l = lHii with:
Trang 10where ratio is the scale ratio of theith scaled sample (i.e the number of original
samples it stands for) The corresponding Min and Max values are then defined as:
Mini = minlHiil=lLoiRawl and Maxi = maxlHiil=lLoiRawl (2.8)
The Mean value is given by:
if no sample weightsWl are specified in Weight If weights are present, the
Mean value is computed as:
In the same way, there are two computational methods for the Variance
depend-ing on whether the original sample weights are absent:
a SeriesOfVector descriptor to represent temporal series of feature vectors As
before, a series can be stored at the full original resolution or scaled The
attributes of a SeriesOfVector description are:
• vectorSize: is the number of elements of each vector in the series.
• Raw: may contain the original series of vectors when no scaling operation is applied It is only used if the Scaling flag is absent to store the entire series at
full resolution
Trang 11• Weight: is an optional series of weights If this attribute is present, each weight
corresponds to a vector in the original series These parameters can be used
to control scaling in the same way as for the SeriesOfScalar description.
• Min, Max and Mean: are three real-valued matrices The number of rows is equal to the sum of numOfElements over the scaled series (i.e the number
of scaled vectors) The number of columns is equal to vectorSize Each row characterizes a scaled vector For a given scaled vector, a Min, Max and Mean
row vector is extracted from the corresponding group of vectors in the original
series The row vector in Min contains the minimum coefficients observed among the original vectors, the row vector in Max contains the maximum coefficients observed among the original vectors and the row vector in Mean
is the mean of the original vectors Each vector coefficient is averaged in the
same way as the Mean scalars in the previous section These attributes are absent if the Raw element is present.
• Variance: is a series of variance vectors whose size is set to vectorSize.
Each vector corresponds to a scaled vector Its coefficients are equal to thevariance computed within the corresponding group of original vectors This
computation may take the sample weights into account if the Weight attribute
is present This attribute is absent if the Raw element is present.
• Covariance: is a series of covariance matrices It is represented as a dimensional matrix: the number of rows is equal to the sum of numOfElements
three-parameters over the scaled series; the number of columns and number of pages
are both equal to vectorSize Each row is a covariance matrix describing a
given scaled vector It is estimated from the corresponding group of original
vectors (see formula below) This attribute is absent if the Raw element is
present
• VarianceSummed: is a series of summed variance coefficients Each coefficient
corresponds to a scaled vector For a given scaled vector, it is obtained by
summing the elements of the corresponding Variance vector (see formula below) This attribute is absent if the Raw element is present.
• MaxSqDist: is a series of maximum squared distance (MSD) coefficients For
each scaled vector, an MSD coefficient is estimated (see formula below),representing an upper bound of the distance between the corresponding group
of original vectors and their mean This attribute is absent if the Raw element
is present
• Random: is a series of vectors resulting from the selection of one vector at
random within each group of original vectors used for scaling This attribute
is absent if the Raw element is present.
• First: is a series of vectors resulting from the selection of the first vector in
each group of original vectors used for scaling This attribute is absent if the
Raw element is present.
• Last: is a series of vectors resulting from the selection of the last vector in
each group of original samples used for scaling This attribute is absent if the
Raw element is present.
Trang 12As in the case of SeriesOfScalar, these attributes aim at summarizing a series of
vectors through scaling and/or rescaling operations
Initially, a series of vector LLD features is stored in the Raw attribute Each element Raw(l) (0 ≤ l ≤ L − 1 contains the vector extracted from the lth frame of
the signal Optionally, the Weight series may contain the weightsWl associated
to each vector
When a scaling operation is performed, a new SeriesOfVector is generated The Min, Max, Mean and Weight attributes of the scaled series are defined in the same way as for the SeriesOfScalar scaling operation described in Section 2.3.1
(the same formulae are applied with vectors instead of scalars) The elements of
the Covariance matrix of theith scaled sample are defined as:
andb and bare indexes of vector dimensions Raw( l b) and Mean(i b) are the bth coefficients of vectors Raw(l) and Mean(i) The VarianceSummed attribute
of theith scaled sample is defined as:
MaxSqDisti = maxlHiil=lLoiRawl − Meani2
2.3.3 Binary Series
The standard defines a binary form of the aforementioned SeriesOfScalar and
SeriesOfVector descriptors: namely, the SeriesOfScalarBinary and torBinary descriptors These descriptors are used to instantiate series of scalars
SeriesOfVec-or vectSeriesOfVec-ors with a unifSeriesOfVec-orm power-of-2 ratio The goal is to ease the comparison of
series with different scaling ratios, as the decimation required for the comparisonbetween two binary series is also a power of 2
2.4 BASIC DESCRIPTORS
The goal of the following two descriptors is to provide a simple and economicaldescription of the temporal properties of an audio signal
Trang 132.4.1 Audio Waveform
A simple way to get a compact description of the shape of an audio nal sn is to consider its minimum and maximum samples within successivenon-overlapping frames (i.e Lw= hopSize) For each frame, two values arestored:
sig-• minRange: the lower limit of audio amplitude in the frame.
• maxRange: the upper limit of audio amplitude in the frame.
The audio waveform (AWF) descriptor consists of the resulting temporal series
of these (minRange, maxRange) pairs The temporal resolution of the AWF is given by the hopSize parameter If desired, the raw signal can be stored in an AWF descriptor by setting hopSize to the sampling period 1/Fsofsn.The AWF provides an estimate of the signal envelope in the time domain
It also allows economical and straightforward storage, display or comparisontechniques of waveforms The display of the AWF description of a signal consists
in drawing for each frame a vertical line from minRange to maxRange The time axis is then labelled according to the hopSize information.
Figure 2.4 gives graphical representations of the series of basic LLDs extractedfrom the music excerpt used in Figure 2.2 We can see that the MPEG-7 AWFprovides a good approximation of the shape of the original waveform
Figure 2.4 MPEG-7 basic descriptors extracted from a music signal (cor anglais,44.1 kHz)
Trang 142.4.2 Audio Power
The audio power (AP) LLD describes the temporally smoothed instantaneouspower of the audio signal The AP coefficients are the average square of wave-form valuessn within successive non-overlapping frames (Lw= hopSize) The
AP coefficient of thelth frame of the signal is thus:
APl = 1
Nhop
Nhop−1 n=0
sn + lNhop2
0 ≤ l ≤ L − 1 (2.17)
whereL is the total number of time frames The AP allows us to measure theevolution of the amplitude of the signal as a function of time In conjunctionwith other basic spectral descriptors (described below), it provides a quickrepresentation of the spectrogram of a signal
An example of the AP description of a music signal is given in Figure 2.4.The AP is measured in successive signal frames and given as a function of time(expressed in terms of frame indexl) This provides a very simple representation
of the signal content: the power peaks correspond to the parts where the originalsignal has a higher amplitude
2.5 BASIC SPECTRAL DESCRIPTORS
The four basic spectral LLDs provide time series of logarithmic frequencydescriptions of the short-term audio power spectrum The use of logarithmicfrequency scales is supposed to approximate the response of the human ear.All these descriptors are based on the estimation of short-term power spectrawithin overlapping time frames This section describes the descriptors, based onthe notations and definitions introduced in Section 2.2 For reasons of clarity,the frame indexl will be discarded in the following formulae
2.5.1 Audio Spectrum Envelope
The audio spectrum envelope (ASE) is a log-frequency power spectrum that can
be used to generate a reduced spectrogram of the original audio signal It isobtained by summing the energy of the original power spectrum within a series
of frequency bands
The bands are logarithmically distributed (base 2 logarithms) between two
frequency edges loEdge (lower edge) and hiEdge (higher edge) The spectral
resolutionr of the frequency bands within the [loEdge,hiEdge] interval can be
chosen from eight possible values, ranging from 1/16 of an octave to 8 octaves:
r = 2joctaves−4 ≤ j ≤ +3 (2.18)
Trang 15Both loEdge and hiEdge must be related to 1 kHz in the following way:
wherer is the resolution in octaves and n is an integer value
The default value of hiEdge is 16 kHz, which corresponds to the upper limit
of hearing The default value of loEdge is 62.5 Hz so that the default [loEdge,
hiEdge] range corresponds to an 8-octave interval, logarithmically centred at a
frequency of 1 kHz
Within the default [loEdge, hiEdge] range, the number of logarithmic bands
that corresponds to r is Bin= 8/r The low (loFb) and high (hiFb) frequencyedges of each band are given by:
loFb= loEdge × 2b−1r
hiFb= loEdge × 2br 1 ≤ b ≤ Bin (2.20)The sum of power coefficients in bandb loFb hiFb gives the ASE coefficientfor this frequency range The coefficient for the bandb is:
However, the repartition of the power spectrum coefficientsPk among thedifferent frequency bands can be a problem, particularly for the narrower low-frequency bands when the resolutionr is high It is reasonable to assume that apower spectrum coefficient whose distance to a band edge is less than half theFFT resolution (i.e less thanF/2) contributes to the ASE coefficients of bothneighbouring bands How such a coefficient should be shared by the two bands
is not specified by the standard A possible method is depicted in Figure 2.5.TheBinwithin-band band power coefficients are completed by two additional
values: the powers of the spectrum between 0 Hz and loEdge and between hiEdge
and the Nyquist frequency Fs/2 (provided that hiEdge < Nyquist frequency) These two values represent the out-of-band energy.
In the following,B = Bin+ 2 will describe the total number of coefficientsASEb 0 ≤ b ≤ B − 1 forming the ASE descriptor extracted from one frame
With loEdge and hiEdge default values, the dimension of an ASE can be chosen
betweenB = 3 Bin= 1 with the minimal resolution of 8 octaves and B = 130
Bin= 128 with the maximal resolution of 1/16 octave
The extraction of an ASE vector from a power spectrum is depicted in
Figure 2.6 with, as an example, the loEdge and hiEdge default values and a