Tài liệu Cơ sở dữ liệu hình ảnh P15 pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	30
Dung lượng	249,26 KB

Nội dung

Image Databases: Search and Retrieval of Digital Imagery Edited by Vittorio Castelli, Lawrence D. Bergman Copyright  2002 John Wiley & Sons, Inc. ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic) 15 Multimedia Indexing CHRISTOS FALOUTSOS Carnegie Mellon University, Pittsburgh, Pennsylvania 15.1 INTRODUCTION In this chapter we focus on the design of methods for rapidly searching a database of multimedia objects, allowing us to locate objects that match a query object, exactly or approximately. We want a method that is general and that can handle any type of multimedia objects. Objects can be two-dimensional (2D) color images, gray scale medical images in two-dimensional or three-dimensional (3D) (e.g., MRI brain scans), one-dimensional (1D) time series, digitized voice or music, video clips, and so on. A typical query-by-content is “in a collection of color photographs, find ones with the same color distribution as a sample sunset photograph.” Specific applications include the following: • Image databases [1] in which we would like to support queries on color (Chapter 11), shape (Chapter 13), and texture (Chapter 12). • Video databases [2,3]. • Financial, marketing, and production time series, such as stock prices, sales numbers and so on. In such databases, typical queries would be “find companies whose stock prices move similarly,” or “find other companies that have sales patterns similar to our company,” or “find cases in the past that resemble last month’s sales pattern of our product”[4]. • Scientific databases (Chapters 3 and 5), with collections of sensor data. In this case, the objects are time series, or, more general, vector fields,thatis, tuples of the form, < x,y,z,t,pressure, temperature, >. For example, in weather data [5], geologic, environmental, and astrophysics databases, and so on, we want to ask queries of the form “find previous days in which the solar magnetic wind showed patterns similar to today’s pattern”tohelp in predictions of the Earth’s magnetic field [6]. 435 436 MULTIMEDIA INDEXING • Multimedia databases, with audio (voice, music), video, and so on [7]. Users might want to retrieve, for example, music scores or video clips that are similar to provided examples. • Medical databases (Chapter 4) in which 1D objects (e.g., ECGs), 2D images (e.g., X rays), and 3D images (e.g., MRI brain scans) are stored. Ability to rapidly retrieve past cases with similar symptoms would be valuable for diagnosis and for medical and research purposes [8,9]. • Text and photographic archives [10], digital libraries [11,12] containing ASCII text, bitmaps, gray scale, and color images. • DNA databases [13] containing large collections of long strings (hundred or thousand characters long) from a four-letter alphabet (A,G,C,T); a new string has to be matched against the old strings to find the best candidates. Searching for similar patterns in databases such as these are essential because it helps in predictions, computer-aided medical diagnosis and teaching, hypothesis testing and, in general, in “data mining” [14–16] and rule discovery. Of course, the dissimilarity between two objects has to be quantified. Dissim- ilarity is measured as a distance between feature vectors, extracted from the objects to be compared. We rely on a domain expert to supply such a distance function D(): Definition 1. The distance (= dissimilarity) between two objects O 1 and O 2 is denoted by D(O 1 ,O 2 ). (15.1) For example, if the objects are two (equal length) time series, the distance D() could be their Euclidean distance (sum of squared differences), whereas for DNA sequences, the editing distance (smallest number of insertions, deletions, and substitutions that are needed to transform the first string to the second) is customarily used. Similarity queries can been classified into two categories: Whole Match. Given a collection of N objects O 1 ,O 2 , ,O N and a query object Q, we want to find those data objects that are within distance ε from Q. Notice that the query and the objects are of the same type: for example, if the objects are 512 ×512 gray scale images, so is the query. Subpattern Match. Here, the query is allowed to return only part of the objects being searched. Specifically, given N data objects (e.g., images) O 1 ,O 2 , ,O N , a query object Q and a tolerance ε, we want to identify the parts of the data objects that match the query. If the objects are, for example, 512 ×512 gray scale images (such as medical X-rays), the query might be a 16 ×16 subpattern (e.g., a typical X-ray of a tumor). Additional types of queries include “nearest neighbors” queries (e.g., “find the five most similar stocks to IBM’s stock”) and “all pairs” queries or “spatial joins” GEMINI: FUNDAMENTALS 437 (e.g., “report all the pairs of stocks that are within distance ε from each other”). Both these types of queries can be supported by our approach: As we shall see, we can reduce the problem into searching for multidimensional points that will be organized into R-trees; in this case, nearest-neighbor search can be handled with a branch-and-bound algorithm [17,18] and the spatial-join query can be handled with recently developed, finely tuned algorithms [19]. For both “whole match” and “subpattern match,” the ideal method should fulfill the following requirements: • It should be fast. Sequential scanning and computing distances for each and every object can be too slow for large databases. • It should be correct. In other words, it should return all the qualifying objects without missing any (i.e., no “false dismissals”). Notice that “false alarms” are acceptable because they can be discarded easily through a post- processing step. • The ideal method should require a small amount of additional memory. • The method should be dynamic. It should be easy to insert, delete, and update objects. The remainder of the chapter is organized as follows. Section 15.2 describes the main ideas for “GEMINI,” a generic approach to indexing multimedia objects. Section 15.3 shows the application of the approach for 1D time series indexing. Section 15.4 focuses on indexing methods for shape, texture, and particularly, color. Section 15.5 shows how to extend the ideas to handle subpattern matching. Section 15.6 summarizes the chapter and lists problems for future research. Appendix 15.6 gives some background material on past-related work, on image indexing, and on spatial access methods (SAMs). 15.2 GEMINI: FUNDAMENTALS To illustrate the basic concepts of indexing, we shall focus on “whole match” queries. The problem is defined as follows: • We have a collection of N objects: O 1 , O 2 , , O N ; • The distance and dissimilarity between two objects (O i ,O j ) is given by the function D(O i ,O j ) • The user specifies a query object Q and a tolerance ε. Our goal is to find the objects in the collection that are within distance ε of the query object. An obvious solution is to apply sequential scanning: for each and every object O i (1 ≤ i ≤ N), we can compute its distance from Q and report the objects with distance D(Q, O i ) ≤ ε. However, sequential scanning may be slow, for two reasons: 1. The distance computation might be expensive. For example, the editing distance in DNA strings requires a dynamic-programming algorithm, which 438 MULTIMEDIA INDEXING grows with the product of the string lengths (typically, in the hundreds or thousands, for DNA databases); 2. The database size N might be huge. Thus, we look for a faster alternative. The “GEMINI” (GEneric Multimedia object INdexIng) approach is based on two ideas, each of which tries to avoid the two disadvantages of sequential scanning: • a “quick-and-dirty” test, to discard quickly the vast majority of nonqualifying objects (possibly, allowing some false alarms); • the use of SAM, to achieve faster-than-sequential searching, as suggested by Jagadish [20]. This is best illustrated with an example. Consider a database of time series, such as yearly stock price movements, with one price per day. Assume that the distance function between two such series S and Q is the Euclidean distance D(S, Q) ≡  365  i=1 (S[i] −Q[i]) 2  1/2 ,(15.2) where S[i] stands for the value of stock S on the i-th day. Clearly, computing the distance between two stocks will take 365 subtractions and 365 squarings. The idea behind the “quick-and-dirty” test is to characterize a sequence with a single number, which will help us discard many nonqualifying sequences. Such a number could be, for example, the average stock price over the year. Clearly, if two stocks differ in their averages by a large margin, they cannot be similar. The converse is not true, which is exactly the reason we may have false alarms. Numbers that contain some information about a sequence (or a multimedia object, in general), will be referred to as “features” for the rest of this paper. A good feature (such as the “average” in the stock prices example) will allow us to perform a quick test, which will discard many items, using a single numerical comparison for each. If using a single feature is good, using two or more features might be even better because they may reduce the number of false alarms, at the cost of making the “quick-and-dirty” test a bit more elaborate and expensive. In our stock prices example, additional features might include the standard deviation or some of the discrete Fourier transform (DFT) coefficients, as we shall see in Section 15.3. By using f features, we can map each object into a point in f -dimensional (f -d) space. We shall refer to this mapping as F(): Definition 2. Let F() be the mapping of objects to f -d points, that is, F(O) will be the f -d point that corresponds to object O. This mapping provides the key to improving on the second drawback of sequential scanning: by organizing these f -d points into a SAM, we can cluster them in a GEMINI: FUNDAMENTALS 439 1 365 Sn S1 . . 3651 Feature 2 Feature 1 F(Sn) F(S1) e Figure 15.1. Illustration of the basic idea: a database of sequences S1, Sn; each sequence is mapped to a point in feature space; a query with tolerance ε becomes a sphere of radius ε. hierarchical structure, for example, an R ∗ -tree. In processing a query, we use the R ∗ -tree to prune out large portions of the database that are not promising. Such a structure will be referred to as an F-index (for “Feature index”). By using an F-index, we do not even have to do the “quick-and-dirty” test on all of the f -d points! Figure 15.1 illustrates the basic idea: Objects (e.g., time series that are 365- points long) are mapped into 2D points (e.g., using the average and standard deviation as features). Consider the “whole-match” query that requires all the objects that are similar to S n within tolerance ε: this query becomes an f -d sphere in feature space, centered on the image F(S n ) of S n . Such queries on multidimensional points is exactly what R-trees and other SAMs are designed to answer efficiently. More specifically, the search algorithm for a whole-match query is as follows: Algorithm 1. Search an F-index: 1. Map the query object Q into a point F(Q) in feature space; 2. Using the SAM, retrieve all points within the desired tolerance ε from F(Q); 3. Retrieve the corresponding objects, compute their actual distance from Q, and discard the false alarms. Intuitively, an F-index has the potential to relieve both problems of the sequential scan, presumably resulting in much faster searches. However, the mapping F() from objects to f -d points must not distort the distances. More specifically, let D() be the distance function between two objects and D feature () be the distance between the corresponding feature vectors. Ideally, 440 MULTIMEDIA INDEXING the mapping F() should preserve the distances exactly, in which case the SAM will have neither false alarms nor false dismissals. However, preserving distances exactly might be very difficult: for example, it is not obvious which features can be used to match the editing distance between two DNA strings. Even if the features are obvious, there might be practical problems: for example, we could treat every stock price sequence as a 365-dimensional vector. Although in theory a SAM can support an arbitrary number of dimensions, in practice they all suffer from the “dimensionality curse” discussed in the survey appendix. The crucial observation is that we can avoid false dismissals completely in the “F-index” method if the distance in feature space never overestimates the distance between two objects. Intuitively, this means that our mapping F() from objects to points should make things look closer. Mathematically, let O 1 and O 2 be two objects (e.g., same-length sequences) with distance function D() (e.g., the Euclidean distance) and F(O 1 ), F(O 2 ) be their feature vectors (e.g., their first few Fourier coefficients), with distance function D feature () (e.g., the Euclidean distance, again). Then we have: Lemma 1. To guarantee no false dismissals for whole-match queries, the feature extraction function F() should satisfy the following formula: D feature [F(O 1 ), F (O 2 )] ≤ D(O 1 ,O 2 )(15.3) Proof. Let Q be the query object, O be a qualifying object, and ε be the tolerance. We want to prove that if the object O qualifies for the query, then it will be retrieved when we issue a range query on the feature space. That is, we want to prove that D(Q, O) ≤ ε ⇒ D feature [F(Q),F(O)] ≤ ε(15.4) However, this is obvious, because D feature [F(Q),F(O)] ≤ D(Q, O) ≤ ε(15.5) QED. Notice that we can still guarantee no false dismissals, if K D feature [F(O 1 ), F (O 2 )] ≤ D(O 1 ,O 2 )(15.6) where K is a constant. In this case, the only modification is that the query in feature space should have a radius of ε/K. We shall need this generalization in Section 15.4. In conclusion, the approach to indexing multimedia objects for fast similarity searching is as follows: Algorithm 2. “GEMINI” approach: 1. Determine the distance function D() between two objects; 1D TIME SERIES 441 2. Find one or more numerical feature-extraction functions, to provide a “quick-and-dirty” test; 3. Prove that the distance in feature space lower-bounds the actual distance D(), to guarantee correctness 4. Choose a SAM and use it to manage the f -d feature vectors. In the next sections we show two case studies of applying this approach to 2D color images and to 1D time series. We shall see that the philosophy of the “quick-and-dirty” filter, in conjunction with the lower-bounding lemma, can lead to solutions to two problems: • The dimensionality curse (time series) • The “cross talk” of features (color images) For each case study we (1 ) describe the objects and the distance function, (2 ) show how to apply the lower-bounding lemma, and (3 ) give experimental results, on real or realistic data. 15.3 1D TIME SERIES Here the goal is to search a collection of (equal length) time series to find the ones that are similar to a desired series. For example, in a collection of yearly stock price movements, we want to find the ones that are similar to IBM. For the rest of the paper, we shall use the following notational conventions: If S and Q are two sequences, then: • Len(S) denotes the length of S; • S[i : j] denotes the subsequence that includes entries in positions i through j ; • S[i] denotes the ith entry of sequence S; • D(S, Q) denotes the distance of the two (equal length) sequences S and Q. 15.3.1 Distance Function The first step in the GEMINI algorithm is to determine the distance measure between two time series. This is clearly application-dependent. Several measures have been proposed for 1D and 2D signals. In a recent survey for images (2D signals), Brown [21] mentions that one of the typical similarity measures is the cross-correlation (which reduces to the Euclidean distance, plus some additive and multiplicative constants). We chose the Euclidean distance because (1 ) it is useful in many cases and (2 ) other similarity measures often can be expressed as the Euclidean distance between feature vectors after some appropriate transformation [22]. As 442 MULTIMEDIA INDEXING in Ref. [23], we choose the Euclidean distance because it is generally applicable, and because other similarity measures can often be expressed as the Euclidean distance between appropriately transformed feature vectors [22]. We denote the Euclidean distance between two sequences S and Q by D(S, Q). Additional and more elaborate distance functions, such as time-warping [24], can also be handled [4] as long as we are able to extract appropriate features from the time series. 15.3.2 Feature Extraction and Lower-Bounding Having decided on the Euclidean distance as the dissimilarity measure, the next step is to find some features that can lower-bound it. We would like a set of features that preserve or lower-bound the distance and carry enough information about the corresponding time series to limit the number of false alarms. The second requirement suggests that we use “good” features, namely, features with more discriminatory power. In the stock price example, a “bad” feature would be, for example, the value during the first day: two stocks might have similar first-day values, yet they may differ significantly from then on. Conversely, two otherwise similar sequences, may agree everywhere, except for the first day’s values. A natural feature to use is the average. Additional features might include the average of the first half, of the second half, of the first quarter, and so on. These features resemble the first coefficients of the Hadamard transform [25]. In signal processing, the most well-known transform is the Fourier transform, and, for our case, the discrete Fourier transform (DFT). Before we describe the desirable features of the DFT, we proceed with its definition and some of its properties. 15.3.3 Introduction to DFT The n-point DFT [26,27] of a signal x = [x i ], i = 0, ,n− 1isdefinedtobe a sequence  X of n complex numbers X F , F = 0, ,n− 1, given by X F = 1/ √ n n−1  i=0 x i exp (−j 2πFi/n) F = 0, 1, ,n−1,(15.7) where j is the imaginary unit j = √ −1. The signal x can be recovered by the inverse transform: x i = 1/ √ n n−1  F =0 X F exp (j 2πFi/n) i = 0, 1, ,n− 1,(15.8) where X F is a complex number (with the exception of X 0 , which is a real, if the signal x is real). The energy E(x) of a sequence x is defined as the sum of 1D TIME SERIES 443 energies (squares of the amplitude |x i |) at every point of the sequence: E(x) ≡||x|| 2 ≡ n−1  i=0 |x i | 2 .(15.9) A fundamental theorem for the correctness of our method is Parseval’s theorem [27], which states that the DFT preserves the energy of a signal: Theorem (Parseval). Let  X be the DFT of the sequence x. Then: n−1  i=0 |x i | 2 = n−1  F =0 |X F | 2 (15.10) Because the DFT is a linear transformation [27] and the Euclidean distance between two signals x and y is the Euclidean norm of their difference, Parseval’s theorem implies that the DFT preserves the Euclidean distance also: D(x, y) = D(  X,  Y). (15.11) where  X and  Y are Fourier transforms of x and y, respectively. Thus, if we keep the first f coefficients of the DFT as the features, we have D feature (F (x), F(y)) = f −1  F =0 |X F − Y F | 2 ≤ n−1  F =0 |X F − Y F | 2 = n−1  i=0 |x i − y i | 2 ≡ D(x, y), (15.12) that is, the resulting distance in the f -d feature space will clearly underestimate the distance of two sequences. Thus, according to Lemma 1, there will be no false dismissals. Note that the F-index approach can be applied with any orthonormal transform, such as, the discrete cosine transform (DCT) [28], the wavelet transform [29], and so on, because they all preserve the distance between the original and the transformed space. In fact, our response time will improve with the ability of the transform to concentrate the energy: the fewer the coefficients that contain most of the energy, the fewer the false alarms, and the faster our response time. Thus, the performance results presented next are pessimistic bounds; better transforms will achieve even better response times. We have chosen the DFT because it is the most well known, its code is readily available (e.g., in the Mathematica package [30] or in “C” [31]), and it does a good job of concentrating the energy in the first few coefficients. In addition, the DFT has the attractive property that the amplitude of the Fourier coefficients is 444 MULTIMEDIA INDEXING invariant under time shifts. Thus, using the DFT for feature extraction allows us to extend our technique to finding similar sequences, while ignoring shifts. 15.3.4 Energy-Concentrating Properties of DFT Having proved that keeping the first few DFT coefficients lower-bounds the actual distance, we address the question of how good DFT is, that is, whether it produces few false alarms. To achieve that, we have to argue that the first few DFT coefficients will usually contain most of the information about the signal. The worst-case signal for the method is white noise, in which each value x i is completely independent of its neighbors x i−1 and x i+1 . The energy spectrum of white noise follows O(F 0 ) [32], that is, it has the same energy in every frequency. This is bad for the F -index because it implies that all the frequen- cies are equally important. However, many real signals have a skewed energy spectrum. For example, random walks (also known as brown noise or brownian walks) exhibit an energy spectrum of O(F −2 ) [32] and therefore an amplitude spectrum of O(F −1 ). Random walks follow the formula x i = x i−1 + z i ,(15.13) where z i is noise, that is, a random variable. Stock movements and exchange rates have been successfully modeled as random walks [33,34]. Figure 15.2 plots the movement of the exchange rate between the Swiss franc and the U.S. dollar from August 7, 1990 to April 18, 1991 (30,000 measurements). This data set is available through ftp from sfi.santafe.edu. Figure 15.3 shows the amplitude of the Fourier coefficients and the 1/F line, in a doubly logarithmic plot. Notice that, because it is a random walk, the amplitude of the Fourier coefficients follow the 1/F line. The mathematical argument for keeping the first few Fourier coefficients agrees with the intuitive argument of the Dow Jones theory for stock price 500 1000 1500 2000 2500 3000 1.28 1.32 1.34 Figure 15.2. The Swiss franc exchange rate; August 7, 1990 to April 18, 1991 (first 3,000 values).

Ngày đăng: 26/01/2014, 15:20

Xem thêm