Mpeg 7 audio and beyond audio content indexing and retrieval phần 2 doc

Figure 2.2 Spectrogram of a music signal cor anglais, 44.1 kHz2.3 SCALABLE SERIES An MPEG-7 ScalableSeries description is a standardized way of representing a series of LLD features scal

Trang 1

The purpose of Chapter 2 is to provide the reader with a detailed overview

of low-level audio descriptors To a large extent this chapter provides the

foun-dations and definitions for most of the remaining chapters of the book SinceMPEG-7 provides an established framework with a large set of descriptors, thestandard is used as an example to illustrate the concept The mathematical def-initions of all MPEG-7 low-level audio descriptors are outlined in detail Otherestablished low-level descriptors beyond MPEG-7 are introduced To help thereader visualize the kind of information that these descriptors convey, someexperimental results are given to illustrate the definitions

In Chapter 3 the reader is introduced to the concepts of sound similarity and

sound classification Various classifiers and their properties are discussed

Low-level descriptors introduced in the previous chapter are employed for illustration.The MPEG-7 standard is again used as a starting point to explain the practicalimplementation of sound classification systems The performance of MPEG-7systems is compared with the well-established MFCC feature extraction method.The chapter provides in great detail simulation results of various systems forsound classification

Chapter 4 focuses on MPEG-7 SpokenContent description It is possible to

follow most of the chapter without reading the other parts of the book Theprimary goal is to provide the reader with a detailed overview of ASR and

its use for MPEG-7 SpokenContent description The structure of the MPEG-7

SpokenContent description itself is presented in detail and discussed in the

context of the spoken document retrieval (SDR) application The contribution of the MPEG-7 SpokenContent tool to the standardization and development of future

SDR applications is emphasized Many application examples and experimentalresults are provided to illustrate the concept

Music description tools for specifying the properties of musical signals arediscussed in Chapter 5 We focus explicitly on MPEG-7 tools Concepts for

instrument timbre description to specify perceptual features of musical sounds are discussed using reduced sets of descriptors Melodies can be described using

MPEG-7 description schemes for melodic similarity matching We will discussquery-by-humming applications to provide the reader with examples of howmelody can be extracted from a user’s input and matched against melodiescontained in a database

An overview of audio fingerprinting and audio signal quality description isprovided in Chapter 6 In general, the MPEG-7 low-level descriptors can be seen

as providing a fingerprint for describing audio content Audio fingerprinting has

to a certain extent been described in Chapters 2 and 3 We will focus in Chapter 6

on fingerprinting tools specifically developed for the identification of a piece ofaudio and for describing its quality

Chapter 7 finally provides an outline of example applications using the cepts developed in the previous chapters Various applications and experimentalresults are provided to help the reader visualize the capabilities of concepts forcontent analysis and description

Trang 3

• Basic descriptors: audio waveform (AWF), audio power (AP).

• Basic spectral descriptors: audio spectrum envelope (ASE), audio spectrumcentroid (ASC), audio spectrum spread (ASS), audio spectrum flatness (ASF)

• Basic signal parameters: audio harmonicity (AH), audio fundamental quency (AFF)

fre-• Temporal timbral descriptors: log attack time (LAT) and temporal centroid(TC)

• Spectral timbral descriptors: harmonic spectral centroid (HSC), harmonicspectral deviation (HSD), harmonic spectral spread (HSS), harmonic spectralvariation (HSV) and spectral centroid (SC)

• Spectral basis representations: audio spectrum basis (ASB) and audio spectrumprojection (ASP)

An additional silence descriptor completes the MPEG-7 foundation layer

MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval H.-G Kim, N Moreau and T Sikora

Trang 4

This chapter gives the mathematical definitions of all low-level audio tors according to the MPEG-7 audio standard To help the reader visualize thekind of information that these descriptors convey, some experimental results aregiven to illustrate the definitions.1

descrip-2.2 BASIC PARAMETERS AND NOTATIONS

There are two ways of describing low-level audio features in the MPEG-7standard:

• An LLD feature can be extracted from sound segments of variable lengths

to mark regions with distinct acoustic properties In this case, the summary

descriptor extracted from a segment is stored as an MPEG-7 AudioSegment

description An audio segment represents a temporal interval of audio material,which may range from arbitrarily short intervals to the entire audio portion of

a media document

• An LLD feature can be extracted at regular intervals from sound frames In

this case, the resulting sampled values are stored as an MPEG-7 ScalableSeries

description

This section provides the basic parameters and notations that will be used

to describe the extraction of the frame-based descriptors The scalable seriesdescriptions used to store the resulting series of LLDs will be described inSection 2.3

2.2.1 Time Domain

In the time domain, the following notations will be used for the input audiosignal:

• n is the index of time samples

• sn is the input digital audio signal

• Fsis the sampling rate ofsn

And for the time frames:

• l is the index of time frames

• hopSize is the time interval between two successive time frames.

1See also the LLD extraction demonstrator from the Technische Universität Berlin (MPEG-7 Audio

Analyzer), available on-line at: http://mpeg7lld.nue.tu-berlin.de/.

Trang 5

• Nhop denotes the integer number of time samples corresponding to hopSize.

• Lwis the length of a time frame (withLw≥ hopSize).

• Nw denotes the integer number of time samples corresponding toLw

• L is the total number of time frames in sn

These notations are portrayed in Figure 2.1

The choice of hopSize andLw depends on the kind of descriptor to extract

However, the standard constrains hopSize to be an integer multiple or divider

of 10 ms (its default value), in order to make descriptors that were extracted at

different hopSize intervals compatible with each others.

2.2.2 Frequency Domain

The extraction of some MPEG-7 LLDs is based on the estimation of short-termpower spectra within overlapping time frames In the frequency domain, thefollowing notations will be used:

• k is the frequency bin index

• Slk is the spectrum extracted from the lth frame of sn

• Plk is the power spectrum extracted from the lth frame of sn

Several techniques for spectrum estimation are described in the literature (Goldand Morgan, 1999) MPEG-7 does not standardize the technique itself, eventhough a number of implementation features are recommended (e.g an Lw of

30 ms for a default hopSize of 10 ms) The following just describes the most

classical method, based on squared magnitudes of discrete Fourier transform(DFT) coefficients After multiplying the frames with a windowing function

Figure 2.1 Notations for frame-based descriptors

Trang 6

wn (e.g a Hamming window), the DFT is applied as:

0 ≤ l ≤ L − 1 0 ≤ k ≤ NFT− 1 (2.1)whereNFT is the size of the DFTNFT≥ Nw In general, a fast Fourier transform(FFT) algorithm is used and NFT is the power of 2 just larger than Nw (theenlarged frame is then padded with zeros)

According to Parseval’s theorem, the average power of the signal in the lthanalysis window can be written in two ways, as:

Pl= 1

Ew

Nw −1 n=0

sn + lNhopwn2

NFTEw

N FT−1 k=0

The power spectrumPlk of the lth frame is defined as the squared magnitude

of the DFT spectrumSlk Since the signal spectrum is symmetric around theNyquist frequency Fs/2, it is possible to consider the first half of the powerspectrum only0 ≤ k ≤ NFT/2 without losing any information In order to ensurethat the sum of all power coefficients equates to the average power defined inEquation (2.2), each coefficient can be normalized in the following way:

In the FFT spectrum, the discrete frequencies corresponding to bin indexes

Trang 7

Figure 2.2 Spectrogram of a music signal (cor anglais, 44.1 kHz)

2.3 SCALABLE SERIES

An MPEG-7 ScalableSeries description is a standardized way of representing a

series of LLD features (scalars or vectors) extracted from sound frames at regular

time intervals Such a series can be described at full resolution or after a scaling

operation In the latter case, the series of original samples is decomposed intoconsecutive sub-sequences of samples Each sub-sequence is then summarized

by a single scaled sample.

An illustration of the scaling process and the resulting scalable series tion is shown in Figure 2.3 (ISO/IEC, 2001), wherei is the index of the scaled

descrip-Figure 2.3 Structure of a scalable series description

Trang 8

series In this example, the 31 samples of the original series (filled circles) aresummarized by 13 samples of the scaled series (open circles).

The scale ratio of a given scaled sample is the number of original samples

it stands for Within a scalable series description, the scaled series is itselfdecomposed into successive sequences of scaled samples In such a sequence,all scaled samples share the same scale ratio In Figure 2.3, for example, the firstthree scaled samples each summarize two original samples (scale ratio is equal

to 2), the next two six, the next two one, etc

The attributes of a ScalableSeries are the following:

• Scaling: is a flag that specifies how the original samples are scaled If absent,

the original samples are described without scaling

• totalNumOfSamples: indicates the total number of samples of the original

series before any scaling operation

• ratio: is an integer value that indicates the scale ratio of a scaled sample,

i.e the number of original samples represented by that scaled sample Thisparameter is common to all the elements in a sequence of scaled samples The

value to be used when Scaling is absent is 1.

• numOfElements: is an integer value indicating the number of consecutive

elements in a sequence of scaled samples that share the same scale ratio If

Scaling is absent, it is equal to the value of totalNumOfSamples.

The last sample of the series may summarize fewer than ratio samples In the

example of Figure 2.3, the last scaled sample has a ratio of 2, but actuallysummarizes only one original sample This situation is detected by comparing

the sum of ratio times numOfElements products to totalNumOfSamples.

Two distinct types of scalable series are defined for representing series ofscalars and series of vectors in the MPEG-7 LLD framework Both types inheritfrom the scalable series description The following sections present them indetail

2.3.1 Series of Scalars

The MPEG-7 standard contains a SeriesOfScalar descriptor to represent a series

of scalar values, at full resolution or scaled This can be used with any temporal

series of scalar LLDs The attributes of a SeriesOfScalar description are:

• Raw: may contain the original series of scalars when no scaling operation is applied It is only used if the Scaling flag is absent to store the entire series at

full resolution

Trang 9

• Weight: is an optional series of weights If this attribute is present, each weight

corresponds to a sample in the original series These parameters can be used

to control scaling

• Min, Max and Mean: are three real-valued vectors in which each dimension

characterizes a sample in the scaled series For a given scaled sample, a

Min, Max and Mean coefficient is extracted from the corresponding group of

samples in the original series The coefficient in Min is the minimum original sample value, the coefficient in Max is the maximum original sample value and the coefficient in Mean is the mean sample value The original samples

are averaged by arithmetic mean, taking the sample weights into account if the

Weight attribute is present (see formulae below) These attributes are absent

if the Raw element is present.

• Variance: is a real-valued vector Each element corresponds to a scaled sample.

It is the variance computed within the corresponding group of original samples

This computation may take the sample weights into account if the Weight attribute is present (see formulae below) This attribute is absent if the Raw

element is present

• Random: is a vector resulting from the selection of one sample at random

within each group of original samples used for scaling This attribute is absent

if the Raw element is present.

• First: is a vector resulting from the selection of the first sample in each group

of original samples used for scaling This attribute is absent if the Raw element

is present

• Last: is a vector resulting from the selection of the last sample in each group

of original samples used for scaling This attribute is absent if the Raw element

is present

These different attributes allow us to summarize any series of scalar features

Such a description allows scalability, in the sense that a scaled series can be derived indifferently from an original series (scaling operation) or from a previ- ously scaled SeriesOfScalar (rescaling operation).

Initially, a series of scalar LLD features is stored in the Raw vector Each element Raw(l) 0 ≤ l ≤ L − 1 contains the value of the scalar feature extractedfrom thelth frame of the signal Optionally, the Weight series may contain the

weightsWl associated to each Raw(l) feature.

When a scaling operation is performed, a new SeriesOfScalar is generated

by grouping the original samples (see Figure 2.3) and calculating the

above-mentioned attributes The Raw attribute is absent in the scaled series descriptor.

Let us assume that theith scaled sample stands for the samples Raw(l) contained

betweenl = lLoi and l = lHii with:

Trang 10

where ratio is the scale ratio of theith scaled sample (i.e the number of original

samples it stands for) The corresponding Min and Max values are then defined as:

Mini = minlHiil=lLoiRawl and Maxi = maxlHiil=lLoiRawl (2.8)

The Mean value is given by:

if no sample weightsWl are specified in Weight If weights are present, the

Mean value is computed as:

In the same way, there are two computational methods for the Variance

depend-ing on whether the original sample weights are absent:

a SeriesOfVector descriptor to represent temporal series of feature vectors As

before, a series can be stored at the full original resolution or scaled The

attributes of a SeriesOfVector description are:

• vectorSize: is the number of elements of each vector in the series.

• Raw: may contain the original series of vectors when no scaling operation is applied It is only used if the Scaling flag is absent to store the entire series at

full resolution

Trang 11

• Weight: is an optional series of weights If this attribute is present, each weight

corresponds to a vector in the original series These parameters can be used

to control scaling in the same way as for the SeriesOfScalar description.

• Min, Max and Mean: are three real-valued matrices The number of rows is equal to the sum of numOfElements over the scaled series (i.e the number

of scaled vectors) The number of columns is equal to vectorSize Each row characterizes a scaled vector For a given scaled vector, a Min, Max and Mean

row vector is extracted from the corresponding group of vectors in the original

series The row vector in Min contains the minimum coefficients observed among the original vectors, the row vector in Max contains the maximum coefficients observed among the original vectors and the row vector in Mean

is the mean of the original vectors Each vector coefficient is averaged in the

same way as the Mean scalars in the previous section These attributes are absent if the Raw element is present.

• Variance: is a series of variance vectors whose size is set to vectorSize.

Each vector corresponds to a scaled vector Its coefficients are equal to thevariance computed within the corresponding group of original vectors This

computation may take the sample weights into account if the Weight attribute

is present This attribute is absent if the Raw element is present.

• Covariance: is a series of covariance matrices It is represented as a dimensional matrix: the number of rows is equal to the sum of numOfElements

three-parameters over the scaled series; the number of columns and number of pages

are both equal to vectorSize Each row is a covariance matrix describing a

given scaled vector It is estimated from the corresponding group of original

vectors (see formula below) This attribute is absent if the Raw element is

present

• VarianceSummed: is a series of summed variance coefficients Each coefficient

corresponds to a scaled vector For a given scaled vector, it is obtained by

summing the elements of the corresponding Variance vector (see formula below) This attribute is absent if the Raw element is present.

• MaxSqDist: is a series of maximum squared distance (MSD) coefficients For

each scaled vector, an MSD coefficient is estimated (see formula below),representing an upper bound of the distance between the corresponding group

of original vectors and their mean This attribute is absent if the Raw element

is present

• Random: is a series of vectors resulting from the selection of one vector at

random within each group of original vectors used for scaling This attribute

is absent if the Raw element is present.

• First: is a series of vectors resulting from the selection of the first vector in

each group of original vectors used for scaling This attribute is absent if the

Raw element is present.

• Last: is a series of vectors resulting from the selection of the last vector in

each group of original samples used for scaling This attribute is absent if the

Raw element is present.

Trang 12

As in the case of SeriesOfScalar, these attributes aim at summarizing a series of

vectors through scaling and/or rescaling operations

Initially, a series of vector LLD features is stored in the Raw attribute Each element Raw(l) (0 ≤ l ≤ L − 1 contains the vector extracted from the lth frame of

the signal Optionally, the Weight series may contain the weightsWl associated

to each vector

When a scaling operation is performed, a new SeriesOfVector is generated The Min, Max, Mean and Weight attributes of the scaled series are defined in the same way as for the SeriesOfScalar scaling operation described in Section 2.3.1

(the same formulae are applied with vectors instead of scalars) The elements of

the Covariance matrix of theith scaled sample are defined as:

andb and bare indexes of vector dimensions Raw( l b) and Mean(i b) are the bth coefficients of vectors Raw(l) and Mean(i) The VarianceSummed attribute

of theith scaled sample is defined as:

MaxSqDisti = maxlHiil=lLoiRawl − Meani2

2.3.3 Binary Series

The standard defines a binary form of the aforementioned SeriesOfScalar and

SeriesOfVector descriptors: namely, the SeriesOfScalarBinary and torBinary descriptors These descriptors are used to instantiate series of scalars

SeriesOfVec-or vectSeriesOfVec-ors with a unifSeriesOfVec-orm power-of-2 ratio The goal is to ease the comparison of

series with different scaling ratios, as the decimation required for the comparisonbetween two binary series is also a power of 2

2.4 BASIC DESCRIPTORS

The goal of the following two descriptors is to provide a simple and economicaldescription of the temporal properties of an audio signal

Trang 13

2.4.1 Audio Waveform

A simple way to get a compact description of the shape of an audio nal sn is to consider its minimum and maximum samples within successivenon-overlapping frames (i.e Lw= hopSize) For each frame, two values arestored:

sig-• minRange: the lower limit of audio amplitude in the frame.

• maxRange: the upper limit of audio amplitude in the frame.

The audio waveform (AWF) descriptor consists of the resulting temporal series

of these (minRange, maxRange) pairs The temporal resolution of the AWF is given by the hopSize parameter If desired, the raw signal can be stored in an AWF descriptor by setting hopSize to the sampling period 1/Fsofsn.The AWF provides an estimate of the signal envelope in the time domain

It also allows economical and straightforward storage, display or comparisontechniques of waveforms The display of the AWF description of a signal consists

in drawing for each frame a vertical line from minRange to maxRange The time axis is then labelled according to the hopSize information.

Figure 2.4 gives graphical representations of the series of basic LLDs extractedfrom the music excerpt used in Figure 2.2 We can see that the MPEG-7 AWFprovides a good approximation of the shape of the original waveform

Figure 2.4 MPEG-7 basic descriptors extracted from a music signal (cor anglais,44.1 kHz)

Trang 14

2.4.2 Audio Power

The audio power (AP) LLD describes the temporally smoothed instantaneouspower of the audio signal The AP coefficients are the average square of wave-form valuessn within successive non-overlapping frames (Lw= hopSize) The

AP coefficient of thelth frame of the signal is thus:

APl = 1

Nhop

Nhop−1 n=0

sn + lNhop2

0 ≤ l ≤ L − 1 (2.17)

whereL is the total number of time frames The AP allows us to measure theevolution of the amplitude of the signal as a function of time In conjunctionwith other basic spectral descriptors (described below), it provides a quickrepresentation of the spectrogram of a signal

An example of the AP description of a music signal is given in Figure 2.4.The AP is measured in successive signal frames and given as a function of time(expressed in terms of frame indexl) This provides a very simple representation

of the signal content: the power peaks correspond to the parts where the originalsignal has a higher amplitude

2.5 BASIC SPECTRAL DESCRIPTORS

The four basic spectral LLDs provide time series of logarithmic frequencydescriptions of the short-term audio power spectrum The use of logarithmicfrequency scales is supposed to approximate the response of the human ear.All these descriptors are based on the estimation of short-term power spectrawithin overlapping time frames This section describes the descriptors, based onthe notations and definitions introduced in Section 2.2 For reasons of clarity,the frame indexl will be discarded in the following formulae

2.5.1 Audio Spectrum Envelope

The audio spectrum envelope (ASE) is a log-frequency power spectrum that can

be used to generate a reduced spectrogram of the original audio signal It isobtained by summing the energy of the original power spectrum within a series

of frequency bands

The bands are logarithmically distributed (base 2 logarithms) between two

frequency edges loEdge (lower edge) and hiEdge (higher edge) The spectral

resolutionr of the frequency bands within the [loEdge,hiEdge] interval can be

chosen from eight possible values, ranging from 1/16 of an octave to 8 octaves:

r = 2joctaves−4 ≤ j ≤ +3 (2.18)

Trang 15

Both loEdge and hiEdge must be related to 1 kHz in the following way:

wherer is the resolution in octaves and n is an integer value

The default value of hiEdge is 16 kHz, which corresponds to the upper limit

of hearing The default value of loEdge is 62.5 Hz so that the default [loEdge,

hiEdge] range corresponds to an 8-octave interval, logarithmically centred at a

frequency of 1 kHz

Within the default [loEdge, hiEdge] range, the number of logarithmic bands

that corresponds to r is Bin= 8/r The low (loFb) and high (hiFb) frequencyedges of each band are given by:

loFb= loEdge × 2b−1r

hiFb= loEdge × 2br 1 ≤ b ≤ Bin (2.20)The sum of power coefficients in bandb loFb hiFb gives the ASE coefficientfor this frequency range The coefficient for the bandb is:

However, the repartition of the power spectrum coefficientsPk among thedifferent frequency bands can be a problem, particularly for the narrower low-frequency bands when the resolutionr is high It is reasonable to assume that apower spectrum coefficient whose distance to a band edge is less than half theFFT resolution (i.e less thanF/2) contributes to the ASE coefficients of bothneighbouring bands How such a coefficient should be shared by the two bands

is not specified by the standard A possible method is depicted in Figure 2.5.TheBinwithin-band band power coefficients are completed by two additional

values: the powers of the spectrum between 0 Hz and loEdge and between hiEdge

and the Nyquist frequency Fs/2 (provided that hiEdge < Nyquist frequency) These two values represent the out-of-band energy.

In the following,B = Bin+ 2 will describe the total number of coefficientsASEb 0 ≤ b ≤ B − 1 forming the ASE descriptor extracted from one frame

With loEdge and hiEdge default values, the dimension of an ASE can be chosen

betweenB = 3 Bin= 1 with the minimal resolution of 8 octaves and B = 130

Bin= 128 with the maximal resolution of 1/16 octave

The extraction of an ASE vector from a power spectrum is depicted in

Figure 2.6 with, as an example, the loEdge and hiEdge default values and a

Định dạng
Số trang	31
Dung lượng	596,42 KB