Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 24 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
24
Dung lượng
156,04 KB
Nội dung
Image Databases: Search and Retrieval of Digital Imagery
Edited by Vittorio Castelli, Lawrence D. Bergman
Copyright
2002 John Wiley & Sons, Inc.
ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic)
10 Introduction to Content-Based
Image Retrieval — Overview of
Key Techniques
YING LI and C C. JAY KUO
University of Southern California, Los Angeles, California
X. WAN
U.S. Research Lab, San Jose, California
10.1 INTRODUCTION
Advances in modern computer and telecommunication technologies have led to
huge archives of multimedia data in diverse application areas such as medicine,
remote sensing, entertainment, education, and on-line information services. This
is similar to the rapid increase in the amount of alphanumeric data during the
early days of computing, which led to the development of database management
systems (DBMS). Traditional DBMSs were designed to organize alphanumeric
data into interrelated collections so that information retrieval and storage could
be done conveniently and efficiently. However, this technology is not well suited
to the management of multimedia information. The diversity of data types and
formats, the large size of media objects, and the difficulties in automatically
extracting semantic meanings from data are entirely foreign to traditional database
management techniques. To use this widely available multimedia information
effectively, efficient methods for storage, browsing, indexing, and retrieval [1,2]
must be developed. Different multimedia data types may require specific indexing
and retrieval tools and methodologies. In this chapter, we present an overview
of indexing and retrieval methods for image data.
Since the 1970s, image retrieval has been a very active research area within
two major research communities — database management and computer vision.
These research communities study image retrieval from two different angles. The
first is primarily text-based, whereas the second relies on visual properties of the
data [3].
261
262 INTRODUCTION TO CONTENT-BASED IMAGE RETRIEVAL
Text-based image retrieval can be traced back to the late 1970s. At that
time, images were annotated by key words and stored as retrieval keys in
traditional databases. Some relevant research in this field can be found in
Refs. [4,5]. Two problems render manual annotation ineffective when the size
of image databases becomes very large. The first is the prohibitive amount
of labor involved in image annotation. The other, probably more essential,
results from the difficulty of capturing the rich content of images using a small
number of key words, a difficulty which is compounded by the subjectivity
of human perception.
In the early 1990s, because of the emergence of large-scale image collec-
tions, content-based image retrieval (CBIR) was proposed as a way to overcome
these difficulties. In CBIR, images are automatically indexed by summarizing
their visual contents through automatically extracted quantities or features such
as color, texture, or shape. Thus, low-level numerical features, extracted by a
computer, are substituted for higher-level, text-based manual annotations or key
words. Since the inception of CBIR, many techniques have been developed along
this direction, and many retrieval systems, both research and commercial, have
been built [3].
Note that ideally CBIR systems should automatically extract (and index) the
semantic content of images to meet the requirements of specific application areas.
Although it seems effortless for a human being to pick out photos of horses
from a collection of pictures, automatic object recognition and classification are
still among the most difficult problems in image understanding and computer
vision. This is the main reason why low-level features such as colors [6–8],
textures [9–11], and shapes of objects [12,13] are widely used for content-based
image retrieval. However, in specific applications, such as medical or petroleum
imaging, low-level features play a substantial role in defining the content of
the data.
A typical content-based image retrieval system is depicted in Figure 10.1 [3].
The image collection database contains raw images for the purpose of visual
display. The visual feature repository stores visual features extracted from images
needed to support content-based image retrieval. The text annotation reposi-
tory contains key words and free-text descriptions of images. Multidimensional
indexing is used to achieve fast retrieval and to make the system scalable to large
image collections.
The retrieval engine includes a query interface and a query-processing unit.
The query interface, typically employing graphical displays and direct manipu-
lation techniques, collects information from users and displays retrieval results.
The query-processing unit is used to translate user queries into an internal form,
which is then submitted to the DBMS. Moreover, in order to gap the bridge
between low-level visual features and high-level semantic meanings, users are
usually allowed to communicate with the search engine in an interactive way.
We will address each part of this structure in more detail in later sections.
This chapter is organized as follows. In Section 10.2, the extraction and
integration of some commonly used features, such as color, texture, shape,
FEATURE EXTRACTION AND INTEGRATION 263
Feature extraction
Image
collection
User
Query interface
Query-processing
Multidimensional indexing
Visual
features
Text
annotation
Retrieval engine
Figure 10.1. An image retrieval system architecture.
object spatial relationship, and so on are briefly discussed. Some feature indexing
techniques are reviewed in Section 10.4. Section 10.5 provides key concepts of
interactive content-based image retrieval, and several main components of a
CBIR system are also discussed briefly. Section 10.6 introduces a new work item
of the ISO/MPEG family, which is called the “Multimedia Content Description
Interface” or MPEG-7 in short, which defines a standard to describe and define
multimedia content features and descriptors. Finally, concluding remarks are
drawn in Section 10.7.
10.2 FEATURE EXTRACTION AND INTEGRATION
Feature extraction is the basis of CBIR. Features can be categorized as general or
domain-specific. General features typically include color, texture, shape, sketch,
spatial relationships, and deformation, whereas domain-specific features are appli-
cable in specialized domains such as human face recognition or fingerprint
recognition.
Each feature may have several representations. For example, color histogram
and color moments are both representations of the image color feature. Moreover,
numerous variations of the color histogram itself have been proposed, each of
which differs in the selected color-quantization scheme.
264 INTRODUCTION TO CONTENT-BASED IMAGE RETRIEVAL
10.2.1 Feature Extraction
10.2.1.1 Color. Color is one of the most recognizable elements of image
content [14] and is widely used in image retrieval because of its invariance with
respect to image scaling, translation, and rotation. The key issues in color feature
extraction include the color space, color quantization, and the choice of similarity
function.
Color Spaces. The commonly used color spaces include RGB, YCbCr, HSV,
CIELAB, CIEL*u*v*, and Munsell spaces. The CIELAB and CIEL*u*v* color
spaces usually give a better performance because of their improved perceptual
uniformity with respect to RGB [15]. MPEG-7 XM V2 supports RGB, YCbCr,
HSV color spaces, and some linear transformation matrices with reference to
RGB [16].
Color Quantization. Color quantization is used to reduce the color resolution of
an image. Using a quantized color map can considerably decrease the computa-
tional complexity during image retrieval. The commonly used color-quantization
schemes include uniform quantization, vector quantization, tree-structured vector
quantization, and product quantization [17–19]. In MPEG-7 XM V2 [16], three
quantization types are supported: linear, nonlinear, and lookup table.
10.3 SIMILARITY FUNCTIONS
A similarity function is a mapping between pairs of feature vectors and a positive
real-valued number, which is chosen to be representative of the visual similarity
between two images. Let us take the color histogram as an example. There
are two main approaches to histogram formation. The first one is based on the
global color distribution across the entire image, whereas the second one consists
of computing the local color distribution for a certain partition of the image.
These two techniques are suitable for different types of queries. If users are
concerned only with the overall colors and their amounts, regardless of their
spatial arrangement in the image, then indexing using the global color distribution
is useful. However, if users also want to take into consideration the positional
arrangement of colors, the local color histogram will be a better choice.
A global color histogram represents an image I by an N-dimensional vector,
H(I ) = [H(I, j ), j = 1, 2, ,N], where N is the number of quantization colors
and H(I, j) is the number of pixels having color j . The similarity of two images
can be easily computed on the basis of this representation. The four common types
of similarity measurements are the L1 norm [20], the L2 norm [21], the color
histogram intersection [7], and the weighted distance metric [22]. The L1 norm
has the lowest computational complexity. However, it was shown in Ref. [23]
that it could produce false negatives (not all similar images are retrieved). The
L2 norm (i.e., the Euclidean distance) is probably the most widely used metric.
SIMILARITY FUNCTIONS 265
However, it can result in false positives (dissimilar images are retrieved). The
color histogram intersection proposed by Swain and Ballard [7] has been adopted
by many researchers because of its simplicity and effectiveness. The weighted
distance metric proposed by Hafner and coworkers [22] takes into account the
perceptual similarity between two colors, thus making retrieval results consistent
with human’s visual perception. Other weighted matrices for similarity measure
can be found in Ref. [24]. See Chapter 14 for a more detailed description of
these metrics.
Local color histograms are used to retrieve images on the basis of their
color similarity in local spatial regions. One natural approach is to partition
the whole image into several regions and then extract color features from each of
them [25,26]. In this case, the similarity of two images will be determined by the
similarity of the corresponding regions. Of course, the two images should have
same number of partitions with the same size. If they happen to have different
aspect ratios, then normalization will be required.
10.3.1 Some Color Descriptors
A compact color descriptor, called a binary representation of the image histogram,
was proposed in Ref. [27]. With this approach, each region is represented by a
binary signature, which is a binary sequence generated by a two-level quantiza-
tion of wavelet coefficients obtained by applying the two-dimensional (2D) Haar
transform to the 2D color histogram. In Ref. [28], a scalable blob histogram was
proposed, where the term blob denotes a group of pixels with homogeneous
color. One advantage of this descriptor is that images containing objects with
different sizes and shapes can be easily distinguished without color segmenta-
tion. A region-based image retrieval approach was presented in Ref. [29]. The
main idea of this work is to adaptively segment the whole image into sets of
regions according to the local color distribution [30] and then compute the simi-
larity on the basis of each region’s dominant colors, which are extracted by
applying color quantization.
Some other commonly used color feature representations in image retrieval
include color moments and color sets. For example, in Ref. [31], Stricker and
Dimai extracted the first three color moments from five partially overlapped
fuzzy regions. In Ref. [32], Stricker and Orengo proposed to use color moments
to overcome undesirable quantization effects. To speed up the retrieval process in
a very large image database, Smith and Chang approximated the color histogram
with a selection of colors (color sets) from a prequantized color space [33,34].
10.3.2 Texture
Texture refers to visual patterns with properties of homogeneity that do not
result from the presence of only a single color or intensity [35]. Tree barks,
clouds, water, bricks, and fabrics are examples of texture. Typical textural
features include contrast, uniformity, coarseness, roughness, frequency, density,
266 INTRODUCTION TO CONTENT-BASED IMAGE RETRIEVAL
and directionality. Texture features usually contain important information about
the structural arrangement of surfaces and their relationship to the surrounding
environment [36]. To date, a large amount of research in texture analysis has been
done as a result of the usefulness and effectiveness of this feature in application
areas such as pattern recognition, computer vision, and image retrieval.
There are two basic classes of texture descriptors: statistical model-based and
transform-based. The first approach explores the gray-level spatial dependence
of textures and then extracts meaningful statistics as texture representation. In
Ref. [36], Haralick and coworkers proposed the co-occurrence matrix represen-
tation of texture features, in which they explored the gray-level spatial depen-
dence of texture. They also studied the line-angle-ratio statistics by analyzing
the spatial relationships of lines and the properties of their surroundings. Inter-
estingly, Tamura and coworkers addressed this topic from a totally different
viewpoint [37]. They showed, on the basis of psychological measurements, that
six basic textural features were coarseness, contrast, directionality, line-likeness,
regularity, and roughness. This approach selects numerical features that corre-
spond to characteristics of the human visual system, rather than on statistical
measures of the data and, therefore, seems well suited to the retrieval of natural
images. Two well-known CBIR systems (the QBIC system [38] and the MARS
system [39,40]) adopted Tamura’s texture representation and made some further
improvements. Liu and Picard [10] and Niblack and coworkers [11,41] used
a subset of the above mentioned 6 features, namely contrast, coarseness, and
directionality models to achieve texture classification and recognition.
A human texture perception study, conducted by Rao and Lohse [42], indi-
cated that the three most important orthogonal dimensions are “repetitiveness,”
“directionality,” and “granularity and complexity.”
Some commonly used transforms for transform-based texture extractions are
the discrete cosine transform (DCT transform), the Fourier-Mellin transform,
Polar Fourier transform, and the Gabor and the wavelet transform. Alata and
coworkers [43] proposed classifying rotated and scaled textures by using the
combination of a Fourier-Mellin transform and a parametric 2D spectrum esti-
mation method called harmonic mean horizontal vertical (HMHV). Wan and
Kuo [44] extracted the texture features in the joint photographic experts group
(JPEG) compressed domain by analyzing AC coefficients of the DCT transform.
The Gabor filters proposed by Manjunath and Ma [45] offer texture descriptors
with a set of “optimum joint bandwidth.” A tree-structured wavelet transform
presented by Chang and Kuo [46] provides a natural and effective way to describe
textures that have dominant middle- or high-frequency subbands. In Ref. [47],
Nevel developed a texture feature–extraction method by matching the first and
the second-order statistics of wavelet subbands.
10.3.3 Shape
Two major steps are involved in shape feature extraction. They are object segmen-
tation and shape representation.
SIMILARITY FUNCTIONS 267
10.3.3.1 Object Segmentation. Image retrieval based on object shape is
considered to be one of the most difficult aspects of content-based image
retrieval because of difficulties in low-level image segmentation and the variety
of ways a given three-dimensional (3D) object can be projected into 2D
shapes. Several segmentation techniques have been proposed so far and include
the global threshold-based technique [21], the region-growing technique [48], the
split-and-merge technique [49], the edge-detection-based technique [41,50], the
texture-based technique [51], the color-based technique [52], and the model-
based technique [53]. Generally speaking, it is difficult to do a precise
segmentation owing to the complexity of the individual object shape, the
existence of shadows, noise, and so on.
10.3.3.2 Shape Representation. Once objects are segmented, their shape
features can be represented and indexed. In general, shape representations can be
classified into three categories [54]:
• Boundary-Based Representations (Based on the Outer Boundary of the
Shape). The commonly used descriptors of this class include the chain
code [55], the Fourier descriptor [55], and the UNL descriptor [56].
• Region-Based Representations (Based on the Entire Shape Region).
Descriptors of this class include moment invariants [57], Zernike
moments [55], the morphological descriptor [58], and pseudo-Zernike
moments [56].
• Combined Representations. We may consider the integration of several
basic representations such as moment invariants with the Fourier descriptor
or moment invariants with the UNL descriptor.
The Fourier descriptor is extracted by applying the Fourier transform to the
parameterized 1 D boundary. Because digitization noise can significantly affect
this technique, robust approaches have been developed such as the one described
in Ref. [54], which is also invariant to geometric transformations. Region-based
moments are invariant with respect to affine transformations of images. Details
can be found in Ref. [57,59,60]. Recent work in shape representation includes the
finite element method (FEM) [61], the turning function developed by Arkin and
coworkers [62], and the wavelet descriptor developed by Chuang and Kuo [63].
Chamfer matching is the most popular shape-matching technique. It was first
proposed by Barrow and coworkers [64] for comparing two collections of shape
fragments and was then further improved by Borgefors in Ref. [65].
Besides the aforementioned work in 2D shape representation, some research
has focused on 3D shape representations. For example, Borgefors and coworkers
[66] used binary pyramids in 3D space to improve the shape and the topology
preservation in lower-resolution representations. Wallace and Mitchell [67]
presented a hybrid structural or statistical local shape analysis algorithm for 3D
shape representation.
268 INTRODUCTION TO CONTENT-BASED IMAGE RETRIEVAL
10.3.4 Spatial Relationships
There are two classes of spatial relationships. The first class, containing topolog-
ical relationships, captures the relations between element boundaries. The second
class containing orientation or directional relationships captures the relative posi-
tions of elements with respect to each other. Examples of topological relationships
are “near to,” “within,” or “adjacent to.” Examples of directional relationships are
“in front of,” “on the left of,” and “on top of.” A well-known method to describe
spatial relationship is the attributed-relational graph (ARG) [68] in which objects
are represented by nodes, and an arc between two nodes represents a relationship
between them.
So far, spatial-based modeling has been widely addressed, mostly in the liter-
ature on spatial reasoning, for application areas such as geographic information
systems [69,70]. We can distinguish two main categories that are called qualita-
tive and quantitative spatial modeling, respectively.
A typical application of the qualitative spatial model to image databases,
based on symbolic projection theory, was proposed by Chang [7]; it allows a
bidimensional arrangement of a set of objects to be encoded into a sequential
structure called a 2D string. Because the 2D string structure reduces the matching
complexity from a quadratic function to a linear one, the approach has been
adopted in several other works [72,73].
Compared to qualitative modeling, quantitative spatial modeling can provide
a more continuous relationship between perceived spatial arrangements and their
representations by using numeric quantities as classification thresholds [74,75].
Lee and Hsu [74] proposed a quantitative modeling technique that enables the
comparison of the mutual position of a pair of extended regions. In this approach,
the spatial relationship between an observer and an observed object is represented
by a finite set of equivalence classes based on the dense sets of possible paths
leading from any pixel of one object to that of the other.
10.3.5 Features of Nonphotographic Images
The discussion in the previous section focused on features for indexing and
retrieving natural images. Nonphotographic images such as medical and satellite
images can be retrieved more effectively using special-purpose features, owing
to their special content and their complex and variable characteristics.
10.3.5.1 Medical Images. Medical images include diagnostic X-ray images,
ultrasound images, computer-aided tomographical images, magnetic resonance
images, and nuclear medicine images. Typical medical images contain many
complex, irregular objects. These exhibit a great deal of variability, due to differ-
ence in modality, equipment, procedure, and the patient [76]. This variability
poses a big challenge to efficient image indexing and retrieval.
Features suitable for medical images can be categorized into two basic classes:
text-based and content-based.
SIMILARITY FUNCTIONS 269
Text-Based Features. Because of the uniqueness of each medical image (for
example, the unique relationship between a patient and an X-ray image of his
or her lungs at a particular time), text-based features are widely used in some
medical image retrieval systems. This information usually includes the institution
name, the institution patient identifier, patient’s name and birth date, patient study
identifiers, modality, date, and time [76].
Usually, these features are incorporated into labels, which are digitally or
physically affixed to the images and then used as the primary indexing key in
medical imaging libraries.
Content-Based Features. Two commonly used content-based features are shape
and object spatial relationship, which are very useful in helping physicians locate
images containing the objects of their interest. In Ref. [76], Cabral and coworkers
proposed a new feature called anatomic labels. This descriptor is associated
with the anatomy and pathology present in the image and provides a means for
assigning (unified medical language system) (UMLS) labels to images or specific
locations within images.
10.3.5.2 Satellite Images. Recent advances in sensor and communication tech-
nologies have made it practical to launch an increasing number of space platforms
for a variety of Earth science studies. The large volume of data generated by the
instruments on the platforms has posed significant challenges for data transmis-
sion, storage, retrieval and dissemination. Efficient image storage, indexing, and
retrieval systems are required to make this vast quantity of data useful.
The research community has devoted a significant amount of effort to this
area [77–80]. In CBIR systems for satellite imagery, different image features
are extracted, depending on the type of satellite images and research purposes.
For example, in a system used for analyzing aurora image data [79], the
authors extract two types of features. Global features include the aurora area,
the magnetic flux, the total intensity and the variation of intensity, and radial
features along a radial line from geomagnetic north such as the average width
and the variation of width. In Ref. [77], shape and spatial relationship features
are extracted from a National Oceanographic and Atmospheric Adminstration
(NOAA) satellite image database. In a database system for the Earth observing
satellite image [80], Li and Chen proposed an algorithm to progressively
extract and compare different texture features, such as the fractal dimension,
coarseness, entropy, circular Moran autocorrelation functions, and spatial gray-
level difference (SGLD) statistics, between an image and a target template.
In Ref. [78], Barros and coworkers explored techniques for the exploitation of
spectral distribution information in a satellite image database.
10.3.6 Some Additional Features
Some additional features that have been used in the image retrieval process are
discussed in the following section.
270 INTRODUCTION TO CONTENT-BASED IMAGE RETRIEVAL
10.3.6.1 Angular Spectrum. Visual properties of an image are mainly related
to the largest objects it contains. In describing an object, shape, texture, and
orientation play a major role. In many cases, because shape can also be defined
in terms of presence and distribution of oriented subcomponents, the orientation
of objects within an image becomes a key attribute in the definition of similarity to
other images. On the basis of this assumption, Lecce and Celentano [81] defined
a metric for image classification in the 2D space that is quantified by signatures
composed of angular spectra of image components. In Ref. [8.2], an image’s
Fourier transform was analyzed to find the directional distribution of lines.
10.3.6.2 Edge Directionality. Edge directionality is another commonly used
feature. In Ref. [82], Lecce and Celentano detected edges within an image by
using the Canny algorithm [83] and then applied the Hough transform [84], which
transforms a line in Cartesian coordinate space to a point in polar coordinate
space, to each edge point. The results were then analyzed in order to detect main
directions of edges in each image.
10.3.7 Feature Integration
Experience shows that the use of a single class of descriptors to index an image
database does not generally produce results that are adequate for real applications
and that retrieval results are often unsatisfactory even for a research prototype.
A strategy to potentially improve image retrieval, both in terms of speed and
quality of results, is to combine multiple heterogeneous features.
We can categorize feature integration as either sequential or parallel. Sequential
feature integration, also called feature filtering, is a multistage process in which
different features are sequentially used to prune a candidate image set. In the
parallel feature-integration approach, several features are used concurrently in the
retrieval process. In the latter case, different weights need to be assigned appropri-
ately to different features, because different features have different discriminating
powers, depending on the application and specific task. The feature-integration
approach appears to be superior to using individual features and, as a conse-
quence, is implemented in most current CBIR systems. The original Query by
Image Content (QBIC) system [85] allowed the user to select the relative impor-
tance of color, texture, and shape. Smith and Chang [86] proposed a spatial
and feature (SaFe) system to integrate content-based features with spatial query
methods, thus allowing users to specify a query in terms of a set of regions with
desired characteristics and simple spatial relations.
Srihari [20] developed a system for identifying human faces in newspaper
photographs by integrating visual features extracted from images with texts
obtained from the associated descriptive captions. A similar system based on
textual and image content information was also described in Ref. [87]. Extensive
experiments show that the use of only one kind of information cannot produce
satisfactory results. In the newest version of the QBIC system [85], text-based
key word search is integrated with content-based similarity search, which leads