Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 30 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
30
Dung lượng
249,26 KB
Nội dung
Image Databases: Search and Retrieval of Digital Imagery
Edited by Vittorio Castelli, Lawrence D. Bergman
Copyright
2002 John Wiley & Sons, Inc.
ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic)
15 Multimedia Indexing
CHRISTOS FALOUTSOS
Carnegie Mellon University, Pittsburgh, Pennsylvania
15.1 INTRODUCTION
In this chapter we focus on the design of methods for rapidly searching a database
of multimedia objects, allowing us to locate objects that match a query object,
exactly or approximately. We want a method that is general and that can handle
any type of multimedia objects. Objects can be two-dimensional (2D) color
images, gray scale medical images in two-dimensional or three-dimensional (3D)
(e.g., MRI brain scans), one-dimensional (1D) time series, digitized voice or
music, video clips, and so on. A typical query-by-content is “in a collection of
color photographs, find ones with the same color distribution as a sample sunset
photograph.”
Specific applications include the following:
• Image databases [1] in which we would like to support queries on color
(Chapter 11), shape (Chapter 13), and texture (Chapter 12).
• Video databases [2,3].
• Financial, marketing, and production time series, such as stock prices, sales
numbers and so on. In such databases, typical queries would be “find
companies whose stock prices move similarly,” or “find other companies
that have sales patterns similar to our company,” or “find cases in the past
that resemble last month’s sales pattern of our product”[4].
• Scientific databases (Chapters 3 and 5), with collections of sensor data. In
this case, the objects are time series, or, more general, vector fields,thatis,
tuples of the form, < x,y,z,t,pressure, temperature, >. For example,
in weather data [5], geologic, environmental, and astrophysics databases,
and so on, we want to ask queries of the form “find previous days in which
the solar magnetic wind showed patterns similar to today’s pattern”tohelp
in predictions of the Earth’s magnetic field [6].
435
436 MULTIMEDIA INDEXING
• Multimedia databases, with audio (voice, music), video, and so on [7]. Users
might want to retrieve, for example, music scores or video clips that are
similar to provided examples.
• Medical databases (Chapter 4) in which 1D objects (e.g., ECGs), 2D images
(e.g., X rays), and 3D images (e.g., MRI brain scans) are stored. Ability to
rapidly retrieve past cases with similar symptoms would be valuable for
diagnosis and for medical and research purposes [8,9].
• Text and photographic archives [10], digital libraries [11,12] containing
ASCII text, bitmaps, gray scale, and color images.
• DNA databases [13] containing large collections of long strings (hundred
or thousand characters long) from a four-letter alphabet (A,G,C,T); a new
string has to be matched against the old strings to find the best candidates.
Searching for similar patterns in databases such as these are essential because it
helps in predictions, computer-aided medical diagnosis and teaching, hypothesis
testing and, in general, in “data mining” [14–16] and rule discovery.
Of course, the dissimilarity between two objects has to be quantified. Dissim-
ilarity is measured as a distance between feature vectors, extracted from the
objects to be compared. We rely on a domain expert to supply such a distance
function
D():
Definition 1. The distance (= dissimilarity) between two objects O
1
and O
2
is
denoted by
D(O
1
,O
2
). (15.1)
For example, if the objects are two (equal length) time series, the distance
D()
could be their Euclidean distance (sum of squared differences), whereas for
DNA sequences, the editing distance (smallest number of insertions, deletions,
and substitutions that are needed to transform the first string to the second) is
customarily used.
Similarity queries can been classified into two categories:
Whole Match. Given a collection of N objects O
1
,O
2
, ,O
N
and a query
object Q, we want to find those data objects that are within distance ε from
Q. Notice that the query and the objects are of the same type: for example,
if the objects are 512 ×512 gray scale images, so is the query.
Subpattern Match. Here, the query is allowed to return only part of the
objects being searched. Specifically, given N data objects (e.g., images)
O
1
,O
2
, ,O
N
, a query object Q and a tolerance ε, we want to identify
the parts of the data objects that match the query. If the objects are, for
example, 512 ×512 gray scale images (such as medical X-rays), the query
might be a 16 ×16 subpattern (e.g., a typical X-ray of a tumor).
Additional types of queries include “nearest neighbors” queries (e.g., “find the
five most similar stocks to IBM’s stock”) and “all pairs” queries or “spatial joins”
GEMINI: FUNDAMENTALS 437
(e.g., “report all the pairs of stocks that are within distance ε from each other”).
Both these types of queries can be supported by our approach: As we shall see,
we can reduce the problem into searching for multidimensional points that will be
organized into R-trees; in this case, nearest-neighbor search can be handled with
a branch-and-bound algorithm [17,18] and the spatial-join query can be handled
with recently developed, finely tuned algorithms [19].
For both “whole match” and “subpattern match,” the ideal method should
fulfill the following requirements:
• It should be fast. Sequential scanning and computing distances for each and
every object can be too slow for large databases.
• It should be correct. In other words, it should return all the qualifying
objects without missing any (i.e., no “false dismissals”). Notice that “false
alarms” are acceptable because they can be discarded easily through a post-
processing step.
• The ideal method should require a small amount of additional memory.
• The method should be dynamic. It should be easy to insert, delete, and
update objects.
The remainder of the chapter is organized as follows. Section 15.2 describes the
main ideas for “GEMINI,” a generic approach to indexing multimedia objects.
Section 15.3 shows the application of the approach for 1D time series indexing.
Section 15.4 focuses on indexing methods for shape, texture, and particularly,
color. Section 15.5 shows how to extend the ideas to handle subpattern matching.
Section 15.6 summarizes the chapter and lists problems for future research.
Appendix 15.6 gives some background material on past-related work, on image
indexing, and on spatial access methods (SAMs).
15.2 GEMINI: FUNDAMENTALS
To illustrate the basic concepts of indexing, we shall focus on “whole match”
queries. The problem is defined as follows:
• We have a collection of N objects: O
1
, O
2
, , O
N
;
• The distance and dissimilarity between two objects (O
i
,O
j
) is given by the
function
D(O
i
,O
j
)
• The user specifies a query object Q and a tolerance ε.
Our goal is to find the objects in the collection that are within distance ε of the
query object. An obvious solution is to apply sequential scanning: for each and
every object O
i
(1 ≤ i ≤ N), we can compute its distance from Q and report the
objects with distance
D(Q, O
i
) ≤ ε.
However, sequential scanning may be slow, for two reasons:
1. The distance computation might be expensive. For example, the editing
distance in DNA strings requires a dynamic-programming algorithm, which
438 MULTIMEDIA INDEXING
grows with the product of the string lengths (typically, in the hundreds or
thousands, for DNA databases);
2. The database size N might be huge.
Thus, we look for a faster alternative. The “GEMINI” (GEneric Multimedia
object INdexIng) approach is based on two ideas, each of which tries to avoid
the two disadvantages of sequential scanning:
• a “quick-and-dirty” test, to discard quickly the vast majority of nonqualifying
objects (possibly, allowing some false alarms);
• the use of SAM, to achieve faster-than-sequential searching, as suggested
by Jagadish [20].
This is best illustrated with an example. Consider a database of time series,
such as yearly stock price movements, with one price per day. Assume that the
distance function between two such series S and Q is the Euclidean distance
D(S, Q) ≡
365
i=1
(S[i] −Q[i])
2
1/2
,(15.2)
where S[i] stands for the value of stock S on the i-th day. Clearly, computing
the distance between two stocks will take 365 subtractions and 365 squarings.
The idea behind the “quick-and-dirty” test is to characterize a sequence with a
single number, which will help us discard many nonqualifying sequences. Such
a number could be, for example, the average stock price over the year. Clearly,
if two stocks differ in their averages by a large margin, they cannot be similar.
The converse is not true, which is exactly the reason we may have false alarms.
Numbers that contain some information about a sequence (or a multimedia object,
in general), will be referred to as “features” for the rest of this paper. A good
feature (such as the “average” in the stock prices example) will allow us to
perform a quick test, which will discard many items, using a single numerical
comparison for each.
If using a single feature is good, using two or more features might be even
better because they may reduce the number of false alarms, at the cost of making
the “quick-and-dirty” test a bit more elaborate and expensive. In our stock prices
example, additional features might include the standard deviation or some of the
discrete Fourier transform (DFT) coefficients, as we shall see in Section 15.3.
By using f features, we can map each object into a point in f -dimensional
(f -d) space. We shall refer to this mapping as F():
Definition 2. Let F() be the mapping of objects to f -d points, that is, F(O)
will be the f -d point that corresponds to object O.
This mapping provides the key to improving on the second drawback of sequential
scanning: by organizing these f -d points into a SAM, we can cluster them in a
GEMINI: FUNDAMENTALS 439
1
365
Sn
S1
.
.
3651
Feature 2
Feature 1
F(Sn)
F(S1)
e
Figure 15.1. Illustration of the basic idea: a database of sequences S1, Sn; each
sequence is mapped to a point in feature space; a query with tolerance ε becomes a
sphere of radius ε.
hierarchical structure, for example, an R
∗
-tree. In processing a query, we use the
R
∗
-tree to prune out large portions of the database that are not promising. Such
a structure will be referred to as an F-index (for “Feature index”). By using an
F-index, we do not even have to do the “quick-and-dirty” test on all of the f -d
points!
Figure 15.1 illustrates the basic idea: Objects (e.g., time series that are 365-
points long) are mapped into 2D points (e.g., using the average and standard
deviation as features). Consider the “whole-match” query that requires all the
objects that are similar to S
n
within tolerance ε: this query becomes an f -d
sphere in feature space, centered on the image F(S
n
) of S
n
. Such queries on
multidimensional points is exactly what R-trees and other SAMs are designed
to answer efficiently. More specifically, the search algorithm for a whole-match
query is as follows:
Algorithm 1. Search an F-index:
1. Map the query object Q into a point F(Q) in feature space;
2. Using the SAM, retrieve all points within the desired tolerance ε from
F(Q);
3. Retrieve the corresponding objects, compute their actual distance from Q,
and discard the false alarms.
Intuitively, an F-index has the potential to relieve both problems of the sequential
scan, presumably resulting in much faster searches.
However, the mapping F() from objects to f -d points must not distort the
distances. More specifically, let
D() be the distance function between two objects
and
D
feature
() be the distance between the corresponding feature vectors. Ideally,
440 MULTIMEDIA INDEXING
the mapping F() should preserve the distances exactly, in which case the SAM
will have neither false alarms nor false dismissals. However, preserving distances
exactly might be very difficult: for example, it is not obvious which features can
be used to match the editing distance between two DNA strings. Even if the
features are obvious, there might be practical problems: for example, we could
treat every stock price sequence as a 365-dimensional vector. Although in theory
a SAM can support an arbitrary number of dimensions, in practice they all suffer
from the “dimensionality curse” discussed in the survey appendix.
The crucial observation is that we can avoid false dismissals completely in
the “F-index” method if the distance in feature space never overestimates the
distance between two objects. Intuitively, this means that our mapping F() from
objects to points should make things look closer. Mathematically, let O
1
and O
2
be two objects (e.g., same-length sequences) with distance function D() (e.g., the
Euclidean distance) and F(O
1
), F(O
2
) be their feature vectors (e.g., their first
few Fourier coefficients), with distance function
D
feature
() (e.g., the Euclidean
distance, again). Then we have:
Lemma 1. To guarantee no false dismissals for whole-match queries, the feature
extraction function F() should satisfy the following formula:
D
feature
[F(O
1
), F (O
2
)] ≤ D(O
1
,O
2
)(15.3)
Proof. Let Q be the query object, O be a qualifying object, and ε be the toler-
ance. We want to prove that if the object O qualifies for the query, then it will
be retrieved when we issue a range query on the feature space. That is, we want
to prove that
D(Q, O) ≤ ε ⇒ D
feature
[F(Q),F(O)] ≤ ε(15.4)
However, this is obvious, because
D
feature
[F(Q),F(O)] ≤ D(Q, O) ≤ ε(15.5)
QED.
Notice that we can still guarantee no false dismissals, if
K
D
feature
[F(O
1
), F (O
2
)] ≤ D(O
1
,O
2
)(15.6)
where K is a constant. In this case, the only modification is that the query in
feature space should have a radius of ε/K. We shall need this generalization in
Section 15.4.
In conclusion, the approach to indexing multimedia objects for fast similarity
searching is as follows:
Algorithm 2. “GEMINI” approach:
1. Determine the distance function
D() between two objects;
1D TIME SERIES 441
2. Find one or more numerical feature-extraction functions, to provide a
“quick-and-dirty” test;
3. Prove that the distance in feature space lower-bounds the actual distance
D(), to guarantee correctness
4. Choose a SAM and use it to manage the f -d feature vectors.
In the next sections we show two case studies of applying this approach to 2D
color images and to 1D time series. We shall see that the philosophy of the
“quick-and-dirty” filter, in conjunction with the lower-bounding lemma, can lead
to solutions to two problems:
• The dimensionality curse (time series)
• The “cross talk” of features (color images)
For each case study we (1 ) describe the objects and the distance function,
(2 ) show how to apply the lower-bounding lemma, and (3 ) give experimental
results, on real or realistic data.
15.3 1D TIME SERIES
Here the goal is to search a collection of (equal length) time series to find the
ones that are similar to a desired series. For example, in a collection of yearly
stock price movements, we want to find the ones that are similar to IBM. For
the rest of the paper, we shall use the following notational conventions: If S and
Q are two sequences, then:
• Len(S) denotes the length of S;
• S[i : j] denotes the subsequence that includes entries in positions i
through j ;
• S[i] denotes the ith entry of sequence S;
• D(S, Q) denotes the distance of the two (equal length) sequences S and Q.
15.3.1 Distance Function
The first step in the GEMINI algorithm is to determine the distance measure
between two time series. This is clearly application-dependent. Several measures
have been proposed for 1D and 2D signals. In a recent survey for images (2D
signals), Brown [21] mentions that one of the typical similarity measures is the
cross-correlation (which reduces to the Euclidean distance, plus some additive
and multiplicative constants).
We chose the Euclidean distance because (1 ) it is useful in many cases
and (2 ) other similarity measures often can be expressed as the Euclidean
distance between feature vectors after some appropriate transformation [22]. As
442 MULTIMEDIA INDEXING
in Ref. [23], we choose the Euclidean distance because it is generally applicable,
and because other similarity measures can often be expressed as the Euclidean
distance between appropriately transformed feature vectors [22].
We denote the Euclidean distance between two sequences S and Q by
D(S, Q).
Additional and more elaborate distance functions, such as time-warping [24],
can also be handled [4] as long as we are able to extract appropriate features
from the time series.
15.3.2 Feature Extraction and Lower-Bounding
Having decided on the Euclidean distance as the dissimilarity measure, the next
step is to find some features that can lower-bound it. We would like a set of
features that preserve or lower-bound the distance and carry enough information
about the corresponding time series to limit the number of false alarms. The
second requirement suggests that we use “good” features, namely, features with
more discriminatory power. In the stock price example, a “bad” feature would
be, for example, the value during the first day: two stocks might have similar
first-day values, yet they may differ significantly from then on. Conversely, two
otherwise similar sequences, may agree everywhere, except for the first day’s
values.
A natural feature to use is the average. Additional features might include the
average of the first half, of the second half, of the first quarter, and so on. These
features resemble the first coefficients of the Hadamard transform [25]. In signal
processing, the most well-known transform is the Fourier transform, and, for
our case, the discrete Fourier transform (DFT). Before we describe the desirable
features of the DFT, we proceed with its definition and some of its properties.
15.3.3 Introduction to DFT
The n-point DFT [26,27] of a signal x = [x
i
], i = 0, ,n− 1isdefinedtobe
a sequence
X of n complex numbers X
F
, F = 0, ,n− 1, given by
X
F
= 1/
√
n
n−1
i=0
x
i
exp (−j 2πFi/n) F = 0, 1, ,n−1,(15.7)
where j is the imaginary unit j =
√
−1. The signal x can be recovered by the
inverse transform:
x
i
= 1/
√
n
n−1
F =0
X
F
exp (j 2πFi/n) i = 0, 1, ,n− 1,(15.8)
where X
F
is a complex number (with the exception of X
0
, which is a real, if
the signal x is real). The energy E(x) of a sequence x is defined as the sum of
1D TIME SERIES 443
energies (squares of the amplitude |x
i
|) at every point of the sequence:
E(x) ≡||x||
2
≡
n−1
i=0
|x
i
|
2
.(15.9)
A fundamental theorem for the correctness of our method is Parseval’s
theorem [27], which states that the DFT preserves the energy of a signal:
Theorem (Parseval). Let
X be the DFT of the sequence x. Then:
n−1
i=0
|x
i
|
2
=
n−1
F =0
|X
F
|
2
(15.10)
Because the DFT is a linear transformation [27] and the Euclidean distance
between two signals x and y is the Euclidean norm of their difference, Parseval’s
theorem implies that the DFT preserves the Euclidean distance also:
D(x, y) = D(
X,
Y). (15.11)
where
X and
Y are Fourier transforms of x and y, respectively.
Thus, if we keep the first f coefficients of the DFT as the features, we have
D
feature
(F (x), F(y)) =
f −1
F =0
|X
F
− Y
F
|
2
≤
n−1
F =0
|X
F
− Y
F
|
2
=
n−1
i=0
|x
i
− y
i
|
2
≡ D(x, y),
(15.12)
that is, the resulting distance in the f -d feature space will clearly underestimate
the distance of two sequences. Thus, according to Lemma 1, there will be no
false dismissals.
Note that the F-index approach can be applied with any orthonormal transform,
such as, the discrete cosine transform (DCT) [28], the wavelet transform [29],
and so on, because they all preserve the distance between the original and the
transformed space. In fact, our response time will improve with the ability of the
transform to concentrate the energy: the fewer the coefficients that contain most
of the energy, the fewer the false alarms, and the faster our response time. Thus,
the performance results presented next are pessimistic bounds; better transforms
will achieve even better response times.
We have chosen the DFT because it is the most well known, its code is readily
available (e.g., in the Mathematica package [30] or in “C” [31]), and it does a
good job of concentrating the energy in the first few coefficients. In addition, the
DFT has the attractive property that the amplitude of the Fourier coefficients is
444 MULTIMEDIA INDEXING
invariant under time shifts. Thus, using the DFT for feature extraction allows us
to extend our technique to finding similar sequences, while ignoring shifts.
15.3.4 Energy-Concentrating Properties of DFT
Having proved that keeping the first few DFT coefficients lower-bounds the
actual distance, we address the question of how good DFT is, that is, whether it
produces few false alarms. To achieve that, we have to argue that the first few
DFT coefficients will usually contain most of the information about the signal.
The worst-case signal for the method is white noise, in which each value
x
i
is completely independent of its neighbors x
i−1
and x
i+1
. The energy spec-
trum of white noise follows O(F
0
) [32], that is, it has the same energy in every
frequency. This is bad for the F -index because it implies that all the frequen-
cies are equally important. However, many real signals have a skewed energy
spectrum. For example, random walks (also known as brown noise or brownian
walks) exhibit an energy spectrum of O(F
−2
) [32] and therefore an amplitude
spectrum of O(F
−1
). Random walks follow the formula
x
i
= x
i−1
+ z
i
,(15.13)
where z
i
is noise, that is, a random variable. Stock movements and exchange
rates have been successfully modeled as random walks [33,34].
Figure 15.2 plots the movement of the exchange rate between the Swiss franc
and the U.S. dollar from August 7, 1990 to April 18, 1991 (30,000 measurements).
This data set is available through ftp from sfi.santafe.edu. Figure 15.3 shows the
amplitude of the Fourier coefficients and the 1/F line, in a doubly logarithmic
plot. Notice that, because it is a random walk, the amplitude of the Fourier
coefficients follow the 1/F line.
The mathematical argument for keeping the first few Fourier coefficients
agrees with the intuitive argument of the Dow Jones theory for stock price
500 1000 1500 2000 2500 3000
1.28
1.32
1.34
Figure 15.2. The Swiss franc exchange rate; August 7, 1990 to April 18, 1991 (first 3,000
values).