Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 28 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
28
Dung lượng
191,84 KB
Nội dung
Image Databases: Search and Retrieval of Digital Imagery
Edited by Vittorio Castelli, Lawrence D. Bergman
Copyright
2002 John Wiley & Sons, Inc.
ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic)
13 Shape Representation for
Image Retrieval
BENJAMIN B. KIMIA
Brown University, Providence, Rhode Island
13.1 INTRODUCTION
The human–machine interface is evolving at an incredible pace, surpassing the
traditional text-based boundaries. A driving force motivating this development
is the need to endow computers with capabilities that parallel our perceptual
abilities. Vision is arguably our most significant sense, giving rise to efforts to
empower computers to represent, process, understand, and act on visual imagery.
As a result, images are being generated at a mind-boggling pace from a variety
of sources. Terabytes of data are being generated in the form of aerial imagery,
surveillance images, mug shots, fingerprints, trademarks and logos, graphic illus-
trations, engineering line drawings, documents, manuals, medical images, images
from sports events, documentation of environmental resources in the form of
images, and entertainment industry photos and videos [1–7]. Clearly, the manage-
ment of such databases must rely on the perceptual and cognitive dimensions of
the visual space, namely, color, texture, shape, and so on. The basic premise is
that there exists qualitative aspects of images that can be used to retrieve images
without fully specifying them.
The use of shape as a cue is less developed than the use of color or texture,
mainly because of the inherent complexity of representing it. Yet, retrieval-by-
shape has the potential of being the most effective search technique in many
application fields. This chapter reviews and discusses the representation of shape
as a cue for indexing image databases. The central question is how complete or
partial information regarding a shape in an image can be represented so that it
can be easily extracted, matched, and retrieved. Specifically, five key items must
be addressed:
Image and Query Preparation. How are shapes extracted from images? The
segregation of figure from ground is rather straightforward in images that
345
346 SHAPE REPRESENTATION FOR IMAGE RETRIEVAL
are binary or have a bi-level histogram, but usually difficult otherwise. As
a consequence, a wide spectrum of shape-extraction techniques have been
developed, ranging from segmenting the image to extracting related lower-
level features, such as edges, that yield a partial representation of shape.
Query formulation and shape extraction are therefore inherently related.
The query-specification mechanism provided by the user interface (sketch
drawing, query-by-example, query-by-keyword, spatial-layout specification,
and so on) must closely match the shape extraction process, and, in
particular, emphasize the specific representation of shape used during the
search.
Shape Representation. How is shape represented? Is there “invariance” to
a class of transformations? Is the representation contour-based or region-
based? Is it based on local features or global attributes? Do parts play a role?
Is the spatial relationship among parts or features represented explicitly? Is
the representation multiscale?
Shape Similarity and Matching. How are the query and database items
matched? Is the matching based on geometric hashing, graph matching,
energy minimization, probabilistic formulation, and so on? How is the
similarity between two objects represented?
Indexing and Retrieval. How is the database organized? Are prototypes or
categories used? Do models guide the retrieval process?
Validation. How well does each approach perform in terms of accuracy and
precision? How efficient is the retrieval?
This chapter focuses on the second question, namely, the issue of shape
representation, although this necessarily requires a discussion of the remaining
items. We will begin by citing some examples in which indexing by shape content
is used, followed by a discussion on how the database of shapes and the related
image queries are prepared. Next, we will discuss the main issues pertaining
to shape representation. This is followed by a brief discussion of matching and
shape similarity as it pertains to the nature of the underlying representation.
13.2 WHERE IS INDEXING BY SHAPE RELEVANT?
Although it is inherently difficult to characterize and manipulate, shape is a
significant cue for describing objects. Despite the difficulty of capturing a
computational notion of shape, an increasing number of applications have used it
as a primary cue for indexing, (illustrated in Fig. 13.1) a few of which are now
briefly reviewed.
Trademarks and logos are often distinguished by their specific shapes. Patent
application offices must avoid duplication partly by checking the similarity
in shape with previously used forms. ARTISAN is an example of a
system that uses shape to retrieve trademarks [8–10]. Numerous shape-
representation techniques (described in Section 13.4) have been applied to
WHERE IS INDEXING BY SHAPE RELEVANT? 347
Figure 13.1. Examples of shapes for indexing into a database: trademarks and logos,
medical structures, drawings, fingerprints, face profiles, and signatures.
trademark and logo retrieval, including geometric invariant signatures [11],
string matching of the contour chain code [12], and combinations of
moment invariants and Fourier descriptors [13,14].
In the medical domain, shape is used as a cue to describe the similarity
of medical scans. Applications include detecting emphysema in high-
resolution CT scans of the lung [15,16], classifying deformations arising
from pathological changes as evident in dental radiographs (e.g., for
periapical disease), and retrieving tumors [17]. Several image query systems
supporting retrieval-by-shape have been developed [18,19].
Shape also plays a key role in the management of document databases.Sample
applications include the retrieval of architectural drawings, computer-
generated technical drawings [20], character bitmaps (e.g., Chinese
characters) [21], technical drawing of machine parts (e.g., aircraft parts),
clipart, and graphics.
Law-enforcement and security is another application area for retrieval of
images by shape. Fingerprint matching [22] is used in automatic personal
identification for criminal identification by law-enforcement agencies,
access control to restricted facilities, credit card user identification, and
other applications. The size of a fingerprint database is often very large, on
the order of hundreds of million fingerprint codes, and requires indexing
into terabytes of data.
Earth Science applications of retrieval-by-shape include indexing databases of
auroras [23].
348 SHAPE REPRESENTATION FOR IMAGE RETRIEVAL
Other applications include art and art history [24], electronic shopping, multi-
media systems for museums and archaeology, defense, entertainment, and so on.
13.3 IMAGE PREPARATION AND QUERY FORMULATION
The question of how images must be prepared prior to storage in a database, and
how queries can be formulated are both intimately connected with how shape is
used as an indexing mechanism.
In principle, a complete representation of a two-dimensional shape is
provided by its contour. The contour is a continuous curve in the plane,
and can specified by a large number of points. Clearly, such a voluminous
representation of shape cannot be effectively used for similarity retrieval, and
partial representations capturing its salient aspects are used in practice. These
partial representations range from very simple (for example, a shape can be
approximated by an ellipse and represented just by its elongation) to very complex
(for example, the contour could be approximated by a piecewise polynomial
representation). The specific application imposes requirements on the richness of
the representation.
When a complete description of shape is used in the indexing scheme,
the image must be segmented and entire shapes must be stored. This process
is quite straightforward when images contain binary or nearly binary shapes,
such as trademarks, logos, bitmaps of characters, signatures, clip art, designs,
drawings, graphics, and so on. In general, however, the task of figure-ground
segregation is formidable, as is evident from the relatively large “segmentation”
literature in computer vision and image processing. Nevertheless, in certain
domains automatic segmentation has been used. For example, Gunsel and
Tekalp [25] address the segmentation, or figure background separation problem,
by a combination of methods. A color histogram intersection method [26] is
used to eliminate database objects with significantly different color from the
query object. Boundaries are estimated using either the Canny edge detector [27]
or the graduated nonconvexity (GNC) algorithm [28,29].
As a result of the difficulty of figure-ground segregation, partial representations
are often used when application requirements permit. The most common methods
rely on edge content, which is indicative of shape boundary. A brief historical
sequence that samples these methods is presented here.
Hirata and Kato [30,31] performed a pixel-by-pixel edge-content comparison
of a query and shifted image blocks and used the resulting “edge similarity score”
to find the best match. Gray [32] evaluated this approach and concluded that its
fundamental weakness is the “pixel-by-pixel” nature of the comparison, which
produces multiple false matches. DelBimbo [33] introduced the notion of flexible
matching for indexing, which allows for significant deviations of the sketch from
the edge map. Rectangular regions of interest are identified for images containing
well-delineated objects, and a gradient-descent method detects object boundaries
from edge maps. Chan and coworkers [34] extend the pixel-by-pixel approach
to correlation of “curvelets” by grouping edge pixels into edge elements using
REPRESENTATION OF SHAPE 349
the Hough transform, by modeling grouped edges as curvelets using implicit
polynomial (IP) models [35], and by computing the similarity between a pair of
IP curvelets.
Other approaches augment the edge content by making the relative spatial
arrangement of edges explicit. This evolution from local models of edge content
to those that incorporate more of the global geometry, namely, deformable
templates, curves, and the inclusion of relative spatial arrangement, indicates
a move toward more complete descriptors of shape. Query formulation must
closely match the underlying shape representation: a query shape specified
by a user is necessarily an approximation of the shape the user is trying to
communicate. Thus, a neighborhood of shapes is implicitly being presented to
the system for a match, as determined by the underlying representation of the
query. The requirement of indexing robustness with variations in the underlying
representation of shape motivates the use of identical representations for the query
and for the indexing mechanism.
13.4 REPRESENTATION OF SHAPE
As mentioned in the previous section, only approximate representations of
shape are practically usable for image retrieval. There is clearly a trade-
off between the complexity of the representation and its ability to capture
different aspects of shape. However, the elusive nature of shape makes it almost
impossible to formally analyze this trade-off. As a consequence, shape has been
represented using a variety of descriptors such as moments, Fourier descriptors
(FD), geometric and algebraic invariants, polygons, polynomials, splines, strings,
deformable templates, skeletons, and so on, for both object recognition and for
indexing of image databases.
Each of these representations aims at capturing specific perceptually salient
dimensions of the qualitative aspects of shape. Because of the heterogeneous
nature of the aspects captured, it is not possible to compare different descriptors
outside the context of very specific applications.
Shape comparison is also a very difficult problem. It is well established
that neither mathematical descriptions based on differential geometry [36],
mathematical morphology [37], or statistics [38], nor formal metrics for shape
comparison [39,40], fully capture the salient aspects of shape. The key
observation is that shape, a construct of the projected object that is a perceptual
invariant of the object, is multifaceted.
Existing approaches can be organized according to the particular facets
that have been targeted in the representation. We specifically analyze several
dimensions; we distinguish first between methods that describe the boundary
and methods that describe the interior; we then contrast global and local
representations; we differentiate between composition-based and deformation-
based approaches; we discuss representations of shape at multiple scale; we
categorize shape representation by their completeness; and finally, we distinguish
between the descriptions of isolated shapes and of shape arrangements.
350 SHAPE REPRESENTATION FOR IMAGE RETRIEVAL
13.4.1 Boundary Versus Interior
Two large categories of shape descriptors can be identified: those capturing
the boundary (or contour appearance) and those characterizing the interior
region. Boundary representations emphasize the closed curve that surrounds the
shape. This curve has been described by numerous models, including chain
codes [41], polygons [42–46], circular arcs [9], splines [47–49], explicit and
implicit polynomials [35,50], and boundary Fourier descriptors. Alternately, a
boundary can be described by its features, for example, curvature extrema and
inflection points [51,52].
Interior descriptions of shape, on the other hand, emphasize the “material”
within the closed boundary. The interior has been modeled in a variety of ways,
including collections of primitives [53] (rectangles, disks, superquadrics, etc.),
deformable templates [54–56], by modes of resonance, skeletal models, or simply
as a set of points (as in mathematical morphology).
Each description, whether boundary-based or region-based, is intuitively
appealing and corresponds to a perceptually meaningful dimension. Clearly, each
representation is complete, and can be used as a basis to compute the other, that
is, by filling in the interior region or by tracing the boundary. Although the two
representations are interchangeable in the sense of information content, the issue of
which aspects of shape have been made explicit matters to the subsequent phases
of the computation. For example, in boundary-based models, features such as
curvature and arc length are immediately available; in region-based methods, the
explicit features are quite different and include spatial relationship among shape
features (for example, the shortest regional distance used in determining a neck).
Shape features that are represented explicitly will generally permit more efficient
retrieval when these particular features are queried. Because both contours and
interiors correspond to meaningful perceptual dimensions, an ideal representation
would include both, enabling a full range of queries. We now consider examples
utilizing either contours, interiors, or both, in their representation of shape.
13.4.1.1 Boundary Representations of Shape. Grosky and Mehrotra [6,57]
represent shape as an ordered set of boundary features, encoded as a polygonal
approximation. Shape similarity is the distance between two boundary feature
vectors. Eakins and coworkers [8–10] represent boundaries with circular polyarcs
and discontinuities. In the query-by-visual example (QVE) system [30] a
boundary-based approach is followed: edges are extracted, thinned, binarized,
andstoredina64× 64 binary-edge map. A user query, which is formulated as
a sketch, is similarly represented but viewed as a collection of 64 blocks (8 × 8).
The sketch is correlated with the edge map in each block, allowing for one to
four pixel horizontal and vertical shifts, thus effectively building some tolerance
against deformation and warping.
The approach in DelBimbo and coworkers [48] is one of matching user
sketches, which represent the boundaries of the object of interest. They argue that
straightforward correlation measures, such as those used in QVE [30], produce
good matches only when sketches are drawn exactly. In QVE, the lack of an exact
REPRESENTATION OF SHAPE 351
match between a sketch and a set of image edges is tolerated only to some extent
by allowing for limited horizontal and vertical shifts. In Ref. [48], the approach
relies on a different measure of similarity in which the sketches are allowed to
elastically deform. The sketch is deformed to adjust to the shape of target models;
the extent of the final match and the elastic deformation energy are used as a
measure of shape similarity. Specifically, the one-dimensional sketched template
is modeled by a second-order spline and parameterized by arc length. The sketch
is then allowed to act as an active contour (or snake) [58], namely, it is allowed
to deform to maximize the edge strength integral along the curve, at the same
time minimizing the strain and bending energies. These energies are typically
modeled by integrals of the first and second derivatives of the deformation
along the curve. Shape similarity is then measured as a combination of strain
and bending energy, edge strength function along the curve, curve complexity,
and correlation between certain functions classified by a back-propagation neural
network subject to appropriate training (Fig. 13.2). This approach is translation-
invariant, but requires template scaling and rotation.
Kliot and Rivlin [11] represent a binary shape via the local multivalued
invariant signatures of its boundary. First, edge contours are traced and described
as a set of geometric entities, such as circles, ellipses, and straight lines.
Then, the relative position of these geometric entities is described via a
containment tree in which each directed edge points to a curve contained
in the current curves. Finally, each curve is represented by an invariant
signature, which is essentially the derivative of the curve in a transform-invariant
parameterization [60,61].
The shape representation by Gunsel and Tekalp [25] uses edge features
obtained by either the Canny edge detector [27] or the graduated nonconvexity
(GNC) algorithm [28]. If boundaries are closed, the method organizes the edges
as B-splines [49,62]; otherwise, it represents them as a set of feature points. The
advantages of the B-spline representation are the reduction of data volume to a
small number of control points, affine invariance, and robustness to noise because
of inherent smoothing.
Figure 13.2. This figure from Ref. [59] illustrates the use of deformable models in
matching user-drawn sketches to shapes in images.
352 SHAPE REPRESENTATION FOR IMAGE RETRIEVAL
Jain and Vailaya [63] represent edge directions in a histogram, which is used
as a shape feature. An alternate representation of shape boundary is a series of
2D strings, as presented in Refs. [64–66].
13.4.1.2 Interior Representations of Shape. Jaggadish [67] represents a shape
by a fixed number of largest rectangles covering the shape. This allows a shape
to be represented by a finite number of numeric parameters, which are mapped
to a point in a multidimensional space, and indexed by a point-access method
(PAM, Chapter 14).
Pentland and Sclaroff propose a physically motivated modal representation
in which the low-order vibration modes of a shape are used as its
representation [68–70]. For a related approach, see Ref. [71].
A class of rather intuitive representations of shape relies on the axis of
symmetry between a pair of boundaries. The earliest use of this representation is
by Blum [72], who defined the medial axis as a locus of inscribed circles that are
maximal in size. The trace of this representation, typically known as a skeleton,
is usually represented by a graph and used in Refs. [73,74].
The symmetry set is the locus of bitangent circles; its definition is identical to
that of the medial axis minus the maximality condition. Thus, the medial axis is
a subset of the symmetry set. However, although it appears that the symmetry set
contains more information than the medial axis, the additional branches of the
symmetry set are in fact redundant. Furthermore, their presence creates numerous
difficulties for indexing when shapes undergo slight perturbations.
The shock set is another variant of the medial axis and is based on the
notion of propagation from boundaries, much like a “grassfire” initiated from
the boundaries of a field. Shocks are singularities that form from the collision
of fronts. These shocks flow along with the wave-front itself [39,75–77]. This
addition of a sense of flow or dynamics to each point of the medial axis
and grouping of monotonically flowing shocks into branches leads to a shock
graph, which is analogous to a skeletal graph, but is a finer partition of the
medial axis. The shock graph has been used for indexing and recognition of
shapes [74,78–84].
13.4.2 Local Versus Global
Shape can also be viewed either from a local or from a global perspective.
Many early models in indexing by shape content used features such as moments,
eccentricity, area, and so on, which are typically based on the entire shape and are
thus global. Similarly, Fourier descriptors of two-dimensional shape are global
descriptors. On the other hand, local representations restrict computations to small
neighborhoods of the shape. For example, a representation based on curvature
extrema and inflection points of the boundary is local.
Purely global representations are affected by variations, such as partial
occlusion and articulation, whereas purely local representations are sensitive to
noise. Ideally, our ability to focus on either facet implies that both must be
emphasized in the representation for successful and intuitive indexing by shape.
REPRESENTATION OF SHAPE 353
The binary edge map used in the query by image content system (QBIC) [4,85]
is an example of global shape representation. Here, edges are extracted (either
manually or automatically) and represented as a binary edge map from which
twenty-two global features are extracted (area, circularity, eccentricity, the major
axis, and a set of associated algebraic moments up to degree 8). A Karhunen
Loeve (KL) transform reduces the dimensionality of the feature space.
Transform-based methods are also typically global: Fourier descrip-
tors [86,87], frequency subband decomposition, coefficients of 2D Discrete
Wavelet transform (DFT) [88], Wavelet Transform [89], Karhunen-Loeve Trans-
form [19], and others all encode global measures.
Orientation radiograms [90] project an image onto an axis by integrating
image intensities along lines orthogonal to that axis. This results in a histogram
for each of the four or eight orientations of the axis used. This is a global
representation because local variations are not explicitly captured onto a profile
and are thus global.
Grosky and Mehrotra represent boundary features by a property vector, which
is matched using a string edit-distance similarity measure [6]. They use an m-way
search tree-based index structure to organize boundary features.
A few approaches cannot be easily characterized as either global or
local. These include local differential invariants [91] and semidifferential
invariants [61,92,93].
Shyu and coworkers [94] discuss and compare the utility of local and global
features in the context of a medical application [15].
Wang and coworkers [95] note the limited discrimination capability of global
features, on the one hand, and the noise sensitivity of local features, on the other.
They propose combining both and use two global features (shape elongation and
compactness) as a filter to eliminate the most dissimilar images to the query
template and then use local features to refine the search. Recall that elongation
of a shape is the ratio of the eigenvalues of the covariance matrix of the contour
points coordinates and compactness is the ratio of perimeter squared to area. Both
measures are invariant under Euclidean (i.e., rotation plus translation) and scaling
transformations. Wang and coworkers define a set of local features, referred to
as interest points, which are a small subset of the contour points derived by a
pairwise growing algorithm. First, a pair of contour points with maximal distance
from each other are selected. Then a second pair farthest from the line connecting
the first pair is chosen. The latter part of this process is repeated for each adjacent
pair of points until a sufficient number of interest points have been obtained.
Finally, the coordinates of the interest points are converted through a normalized
affine-invariant transformation [96].
13.4.3 Composition of Parts Versus Deformation
Shape can also be viewed either as the composition of simpler, elementary parts,
or as the deformation of simpler shapes.
In the “part-based view,” shapes are composed of simple components; for
example, a tennis racket is easily described as an elliptical head attached to a
354 SHAPE REPRESENTATION FOR IMAGE RETRIEVAL
rectangular handle, and a hand is seen as four fingers and one thumb attached
to a palm. Superquadrics [53] represent a rich space of shape primitives from
which to choose [97].
The partitioning can be based on either global fit or local evidence. An example
of global fit is the minimum description length (MDL) approach. Here, a shape
is represented as a combination of primitives selected from a collection; for each
combination, two quantities are computed: the fitting error, and the encoding
cost. The encoding cost (expressed in “bits”) is called description length,and
measures the complexity of describing the combination. The overall energy is
defined as an increasing function of both, fit error, and description length (e.g.,
a weighted average). Shape representation with the lowest energy is selected.
Representations with few simple parts have short-description length but can also
have a poor fit; complex representations better approximate the shape but have
long-description length. The method therefore selects one that optimizes a linear
combination of fit and description length [98].
Shape can also be decomposed into parts based on “local” evidence. Properties
of the boundary belong to this category. For example, the boundary can be
decomposed into codons along negative minima of its curvature [51,99–101]
or by taking into account regional properties, such as good continuation of
tangents [102]. The latter approach has been shown to produce parts that are
perceptually meaningful [103].
The “part-based” methodology is not universally applicable. Biological shapes,
such as the corpus collosum boundary in the brain, leaves, animal limbs, and
so on, are often best described as the deformation of a simpler shape. This
morph-based view has given rise to deformable templates [55,104–106], modal
representation [69], and so on.
Deformable templates are representations in which shape variability is
captured by allowable transformations of a template. Generally, two forms of
deformable shape models have been proposed, which differ, based on whether
the model itself or the deformation of the model is parameterized.
Parameterized (geometric) models use an underlying representation that has
a few variable parameters. For example, Yuille and coworkers [73] use conic
curve segments as templates for the eyes and the mouth in face recognition. The
parameters of the conic allow for shape variations. As another example, Staib and
Duncan [107] use elliptical Fourier descriptors to represent boundary templates.
Superquadrics provide yet another example of parameterized shape models [97].
Parametric-deformation approaches represent the object by fitting it to a fixed
template, using a set of allowable parametric deformations. For example, Jain
and coworkers [108] represent the template shape via a bitmap and impose a
probability distribution (a Bayesian prior) on the admissible mappings. Matching
then reduces to selecting the transformation that minimizes a Bayesian objective
function.
This class of methods also contains approaches based on skeletons
[21], deformable templates [47,48,108], the methods by Grenander and