Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 29 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
29
Dung lượng
259,87 KB
Nội dung
Image Databases: Search and Retrieval of Digital Imagery
Edited by Vittorio Castelli, Lawrence D. Bergman
Copyright
2002 John Wiley & Sons, Inc.
ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic)
8 Image Compression — A Review
SHEILA S. HEMAMI
Cornell University, Ithaca, New York
8.1 INTRODUCTION
Compression reduces the number of bits required to represent a signal. The
appropriate compression strategy is a function of the type of signal to be
compressed. Here, we focus on images, which can be single component (e.g.,
gray scale) or multiple component (e.g., three-component color or higher-
component remote sensing data). Each component can be considered to
be an “image” — a two-dimensional (2D) array of pixels, and this chapter
reviews the fundamentals of image compression as a 2D signal with specific
statistical characteristics. Application to multicomponent imagery is achieved by
separately compressing each component. Compression of the higher-dimensional
multicomponent data is possible, it is very uncommon, so we concentrate on
2D image compression.
Compression can be thought of as redundancy reduction; its goal is to eliminate
the redundancy in the data to provide an efficient representation that preserves
only the essential information. Compression can be performed in one of the
two regimes: lossless compression and lossy compression. Lossless compression
permits an exact recovery of the original signal and permits compression ratios
for images of not more than approximately 4 : 1. In lossy compression, the
original signal cannot be recovered from the compressed representation. Lossy
compression can provide images that are visually equivalent to the original at
compression ratios that range from 8 : 1 to 20 : 1, depending on the image
content. Incorporation of human visual system characteristics can be important in
providing high-quality lossy compression. Higher compression ratios are possible,
but they produce a visual difference between the original and the compressed
images.
A block diagram of a typical generic image-compression system is shown
in Figure 8.1 and consists of three components: pixel-level redundancy
211
212 IMAGE COMPRESSION — A REVIEW
Pixel-level
redundancy
reduction
Data
discarding
Input image
Compressed stream
x
w
Block 1 Block 2 Block 3
x
Bit-level
redundancy
reduction
Figure 8.1. Three components of an image-compression system.
reduction, data discarding, and bit-level redundancy reduction. A lossless image-
compression system omits the data-discarding step, and as such, lossless
compression results from redundancy reduction alone. A lossy algorithm uses
all three blocks, although extremely efficient techniques can produce excellent
results even without the third block. Although both compression types can be
achieved using simpler block diagrams (e.g., omitting the first block), these three
steps are required to produce state-of-the-art lossy image compression. Each of
these blocks is described briefly.
Pixel-level redundancy reduction performs an invertible mapping of the input
image into a different domain in which the output data w is less correlated than
the original pixels. The most efficient and widely used mapping is a frequency
transformation (also called a transform code), which maps the spatial informa-
tion contained in the pixels into a frequency space, in which the image data is
more efficiently represented numerically and is well matched to the human visual
system frequency response. Data discarding provides the “loss” in lossy compres-
sion and is performed by quantization of w to form x. Both statistical properties
of images and human visual system characteristics are used to determine how the
data w should be quantized while minimally impacting the fidelity of the images.
Fidelity can be easily measured numerically but such metrics do not necessarily
match subjective judgments, making visually pleasing quantization of image data
an inexact science. Finally, bit-level redundancy reduction removes or reduces
dependencies in the data x and is itself lossless.
Instead of studying the blocks sequentially, this chapter begins by describing
basic concepts in both lossless and lossy coding: entropy and rate-distortion
theory (RD theory). Entropy provides a computable bound on bit-level
redundancy reduction and hence lossless compression ratios for specific sources,
whereas RD theory provides a theory of lossy compression bounds. Useful for
understanding limits of compression, neither the concept of entropy nor RD
theory tell us how these bounds may be achieved and whether the computed
or theorized bounds are absolute bounds themselves. However, they suggest
that the desired lossless or lossy compression is indeed possible. Next, a brief
description of the human visual system is provided, giving an understanding
of the relative visual impact of image information. This provides guidance
in matching pixel-level redundancy reduction techniques and data-discarding
techniques to human perception. Pixel-level redundancy reduction is then
described, followed by quantization and bit-level redundancy reduction. Finally,
several standard and nonstandard state-of-the-art image compression techniques
are described.
ENTROPY — A BOUND ON LOSSLESS COMPRESSION 213
8.2 ENTROPY — A BOUND ON LOSSLESS COMPRESSION
Entropy provides a computable bound by which a source with a known probability
mass function can be losslessly compressed. For example, the “source” could be
the data entering Block 3 in Figure 8.1; as such, entropy does not suggest how
this source has been generated from the original data. Redundancy reduction prior
to entropy computation can reduce the entropy of the processed data below that
of the original data.
The concept of entropy is required for variable-rate coding, in which a code
can adjust its own bit rate to better match the local behavior of a source. For
example, if English text is to be encoded with a fixed-length binary code, each code
word requires log
2
27=5 bits/symbol (assuming only the alphabet and a space
symbol). However, letters such as “s” and “e” appear far more frequently than do
letters such as “x” and “j.” A more efficient code would assign shorter code words
to more frequently occurring symbols and longer code words to less frequently
occurring symbols, resulting in a lower average number of bits/symbol to encode
the source. In fact, the entropy of English has been estimated at 1.34 bits/letter [1],
indicating that substantial savings are possible over using a fixed-length code.
8.2.1 Entropy Definition
For a discrete source X with a finite alphabet of N symbols (x
0
, , x
N−1
)and
a probability mass function of p(x), the entropy of the source in bits/symbol is
given by
H(X) =−
N−1
n=0
p(x
n
) log
2
p(x
n
)(8.1)
and measures the average number of bits/symbol required to describe the source.
Such a discrete source is encountered in image compression, in which the acquired
digital image pixels can take on only a finite number of values as determined by
the number of bits used to represent each pixel.
It is easy to show (using the method of Lagrange multipliers) that the
uniform distribution achieves maximum entropy, given by H(X) = log
2
N.A
uniformly distributed source can be considered to have maximum randomness
when compared with sources having other distributions — each alphabet value
is no more likely than any other. Combining this with the intuitive English text
example mentioned previously, it is apparent that entropy provides a measure of
the compressibility of a source. High entropy indicates more randomness; hence
the source requires more bits on average to describe a symbol.
8.2.2 Calculating Entropy — An Example
An example illustrates the computation of entropy the difficulty in determining
the entropy of a fixed-length signal. Consider the four-point signal [3/4 1/4 0 0].
There are three distinct values (or symbols) in this signal, with probabilities 1/4,
214 IMAGE COMPRESSION — A REVIEW
1/4, and 1/2 for the symbols 3/4, 1/4, and 0, respectively. The entropy of the
signal is then computed as
H =−
1
4
log
2
1
4
−
1
4
log
2
1
4
−
1
2
log
2
1
2
= 1.5 bits/symbol. (8.2)
This indicates that a variable length code requires 1.5 bits/symbol on average to
represent this source. In fact, a variable-length code that achieves this entropy
is [10 11 0] for the symbols [3/4 1/4 0].
Now consider taking the Walsh-Hadamard transform of this signal (block-
based transforms are described in more detail in Section 8.5.2). This is an invert-
ible transform, so the original data can be uniquely recovered from the trans-
formed data. The forward transform is given by
1
2
1111
11−1 −1
1 −1 −11
1 −11−1
3/4
1/4
0
0
=
1/2
1/2
1/4
1/4
(8.3)
with a resulting entropy easily calculated as 1 bit/symbol. With a simple forward
transform before computing the entropy and an inverse transform to get the
original signal back from the coded signal, the entropy has been reduced by
0.5 bit/symbol. The entropy reduction achieved by a different signal representa-
tion suggests that measuring entropy is not as straightforward as plugging into
the mathematical definition; with an appropriate invertible signal representation,
the entropy can be reduced and the original signal still represented.
Although the entropy example calculations given earlier are simple to compute,
the results and the broader definition of entropy as the “minimum number of bits
required to describe a source” suggests that defining the entropy of an image is
not as trivial as it may seem. An appropriate pixel-level redundancy reduction
such as a transform can reduce entropy. Such redundancy reduction techniques
for images are discussed later in the chapter; however, it should be mentioned
that pixel transformation into the “right” domain can reduce the required bit rate
to describe the image.
8.2.3 Entropy Coding Techniques
Entropy coding techniques, also known as noiseless coding, lossless coding,
or data compaction coding, are variable-rate coding techniques that provide
compression at rates close to the source entropy. Although the source entropy
provides a lower bound, several of these techniques can approach this bound
arbitrarily closely. Three specific techniques are described.
Huffman coding achieves variable-length coding by assigning code words of
differing lengths to different source symbols. The code word length is directly
proportional to − log(f (x)),wheref(x) is the frequency of occurrence of the
symbol x, and a simple algorithm exists to design a Huffman code when the
RATE-DISTORTION THEORY 215
source symbol probabilities are known [2]. If the probabilities are all powers
of (1/2), then the entropy bound can be achieved exactly by a binary Huffman
code. Because a code word is assigned explicitly to each alphabet symbol, the
minimum number of bits required to code a single source symbol is 1. The
example variable-length code given in the previous section is a Huffman code.
In arithmetic coding, a variable number of input symbols are required to
produce each code symbol. A sequence of source symbols is represented by a
subinterval of real numbers within the unit interval [0,1]. Smaller intervals require
more bits to specify them; larger intervals require fewer. Longer sequences of
source symbols require smaller intervals to uniquely specify them and hence
require more bits than shorter sequences. Successive symbols in the input data
reduce the size of the current interval proportionally to their probabilities; more
probable symbols reduce an interval by a smaller amount than less probable
symbols and hence add fewer bits to the message. Arithmetic coding is more
complex than Huffman coding; typically, it provides a gain of approximately
10 percent more compression than Huffman coding in imaging applications.
Lempel-Ziv-Welch (LZW) coding is very different from both Huffman and
arithmetic coding in that it does not require the probability distribution of the
input. Instead, LZW coding is dictionary-based: the code “builds itself” from the
input data, recursively parsing an input sequence into nonoverlapping blocks of
variable size and constructing a dictionary of blocks seen thus far. The dictio-
nary is initialized with the symbols 0 and 1. In general, LZW works best on
large inputs, in which the overhead involved in building the dictionary decreases
as the number of source symbols increases. Because of its complexity and the
possibility to expand small data sets, LZW coding is not frequently used in image-
compression schemes (it is, however, the basis for the Unix utilities compress
and gzip).
8.3 RATE-DISTORTION THEORY — PERFORMANCE BOUNDS
FOR LOSSY COMPRESSION
Lossy compression performance bounds are provided by rate-distortion theory
(RD theory) and are more difficult to quantify and compute than the lossless
compression performance bound. RD theory approaches the problem of maxi-
mizing fidelity (or minimizing distortion) for a class of sources for a given bit
rate. This description immediately suggests the difficulties in applying such theory
to image compression. First, fidelity must be defined. Numerical metrics are easy
to calculate but an accepted numerical metric that corresponds to perceived visual
quality has yet to be defined. Secondly, an appropriate statistical description for
images is required. Images are clearly complex and even sophisticated statistical
models for small subsets of image types fail to adequately describe the source for
RD theory purposes. Nevertheless, the basic tenets of RD theory can be applied
operationally to provide improved compression performance in a system. The
aim of this section is to introduce readers to the concepts of RD theory and their
applications in operational rate distortion.
216 IMAGE COMPRESSION — A REVIEW
Rate, bits/pixel
Distortion
Figure 8.2. A sample rate-distortion curve.
A representative RD curve is shown in Figure 8.2 — as the rate increases,
distortion decreases, and vice versa. RD theory provides two classes of
performance bounds. Shannon theory, introduced by Claude Shannon seminal
works [3,4], provides performance bounds as the data samples to be coded are
grouped into infinitely long blocks. Alternatively, high-rate low-distortion theory
provides bounds for fixed block size as the rate approaches infinity. Although the
second class of bounds is more realistic for use in image compression (an image
has a finite number of pixels), both classes only provide existence proofs; they
are not constructive. As such, performance bounds can be derived, no instruction
is provided on designing a system that can achieve them. Hence how can RD
theory be applied to practical image compression?
First, consider a simple example. Suppose that an image-compression algorithm
can select among one of several data samples to add to the compressed stream. Data
sample 1 requires three bits to code and reduces the mean-squared error (MSE) of
the reconstructed image by 100; data sample 2 requires 7 bits to code and reduces
the MSE by 225. Which sample should be added to the data stream? A purely rate-
based approach would select the first sample — it requires fewer bits to describe.
Conversely, a purely distortion-based approach would select the second sample, as
it reduces the MSE by over twice the first sample. An RD-based approach compares
the trade-off between the two: the first sample produces an average decrease in MSE
per bit of 100/3 = 33.3 and the second produces an average decrease in MSE per
bit of 225/7 = 321/7. The RD-based approach selects data sample 1 as maximizing
the decrease in MSE per bit. In other words, the coefficient that yields a steeper
slope for the RD curve is selected.
The previous example demonstrates how RD theory is applied in practice; this
is generally referred to as determining operational (rather than theoretical) RD
curves. When dealing with existing compression techniques (e.g., a particular
transform coder followed by a particular quantization strategy, such as JPEG
or a zerotree-based wavelet coder), RD theory is reduced to operational rate
distortion — for a given system (the compression algorithm) and source model
(a statistical description of the image), which system parameters produce the
best RD performance? Furthermore, human visual system (HVS) characteristics
HUMAN VISUAL SYSTEM CHARACTERISTICS 217
must be taken into account even when determining operational RD curves. Blind
application of minimizing the MSE subject to a rate constraint for image data
following a redundancy-reducing transform code suggests that the average error
introduced into each quantized coefficient be equal, and that such a quantization
strategy will indeed minimize the MSE over all other step size selections at the
same rate. However, the human visual system is not equally sensitive to errors
at different frequencies, and HVS properties suggest that more quantization error
can be tolerated at higher frequencies to produce the same visual quality. Indeed,
images compressed with HVS-motivated quantization step sizes are of visually
higher quality than those compressed to minimize the MSE.
The operational RD curve consists of points that can be achieved using a
given compression system, whereas RD theory provides existence proofs only.
The system defines the parameters that must be selected (e.g., quantizer step
sizes), and a constrained minimization then solves for the parameters that will
provide the best RD trade-offs. Suppose there are N sources to be represent
using a total of R bits/symbol; these N sources could represent the 64 discrete
cosine transform (DCT) coefficients in a joint photographic experts group (JPEG)
compression system or the 10 subband coefficients in a three-level hierarchically
subband-transformed image (both the DCT and subband transforms are described
in Section 8.5). For each source, there is, an individual RD curve, which may
itself be operationally determined or generated from a model, that indicates that
source i will incur a distortion of D
ij
when coded at a rate of R
ij
at operating
point j . Then the operational RD curve is obtained by solving the following
constrained minimization:
For each source i, find the operating point y(i) that minimizes the distortion
D = f(D
1y(1)
,D
2y(2)
, ,D
Ny(N)
) such that
N
i=1
R
iy(i)
≤ R(8.4)
Most commonly, the distortion is additive and D =
N
i=1
D
iy(i)
. This constrained
minimization can be solved using the method of Lagrange multipliers. When
solved, the result is that to minimize the total distortion D, each source should
operate at a point on its RD curve such that the tangents to the curves are equal;
that is, the operating points have equal slopes. The minimization finds that slope.
Both RD theory and operational RD curves provide a guiding tenet for lossy
image compression: maximize the quality for a given bit rate. Although some
image-compression algorithms actually use RD theory to find operating param-
eters, more commonly, the general tenet is used without explicitly invoking the
theory. This yields theoretically suboptimal compression, but as Section 8.8 will
show, the performance of many compression algorithms is still an excellent event
without explicit RD optimization.
8.4 HUMAN VISUAL SYSTEM CHARACTERISTICS
Lossy image compression must discard information in the image that is not or
is only minimally visible, producing the smallest possible perceptible change
218 IMAGE COMPRESSION — A REVIEW
to the image. To determine what information can be discarded, an elementary
understanding of the HVS is required. Understanding the HVS has been a topic
of research for over a century, but here we review the points salient to providing
high-quality lossy image compression.
8.4.1 Vision and Luminance-Chrominance Representations
Light enters the eye through the pupil and strikes the retina at the back of the eye.
The retina contains two types of light receptors: cones and rods. Approximately
eight million cones located in the central portion of the retina and sensitive to
red, blue, or green light provide color vision under high-illumination conditions,
such as in a well-lighted room. Each cone is connected to its own nerve end,
providing high resolution in photopic (color) vision. Approximately 120 million
rods are distributed over the entire retina and provide vision at low-illumination
levels, such as a moonlit night (scotopic vision). A single nerve end is connected
to multiple rods; as such, resolution is lower and rods provide a general picture
of the field of view without color information. With midrange illumination, both
rods and cones are active to provide mesopic vision. Because most digital images
are viewed on well-lit displays, the characteristics of photopic vision are most
applicable to digital imaging and compression.
The HVS processes color information by converting the red, green, and blue
data from the cones into a luminance-chrominance space, with the luminance
channel having approximately five times the bandwidth of the chrominance
channel. Consequently, much more error in color (or chrominance) than in lumi-
nance information can be tolerated in compressed images. Color digital images
are often represented in a luminance-chrominance color space (one luminance
component, and two chrominance components). The chrominance components
are often reduced in size by a factor of 2 in each dimension through low-pass
filtering followed by downsampling; these lower-resolution chrominance compo-
nents are then compressed along with the full-size luminance component. In
decompression, the chrominance components are upsampled and interpolated to
full size for display. No noticeable effects are seen when this is applied to natural
images, and it reduces the amount of chrominance information by a factor of 4
even before the compression operation.
Because understanding and representation of color is itself a well-developed
science, the remainder of this section will focus on the HVS characteristics for
monochrome (gray scale) images on which most lossy image-compression algo-
rithms rely. The resulting characteristics are typically also applied to chrominance
data without modification.
8.4.2 The Human Contrast Sensitivity Function and Visibility Thresholds
Studies on perception of visual stimuli indicate that many factors influence the
visibility of noise in a degraded image when compared with the original. These
factors are functions of the image itself and include the average luminance of
the image, the spatial frequencies present in the image, and the image content.
HUMAN VISUAL SYSTEM CHARACTERISTICS 219
Because images can have widely varying average luminances and content, the first
and third factors are more difficult to include in an image-compression algorithm
for general use. However, all images can be decomposed into their frequency
content and the HVS sensitivities to different frequencies can be incorporated
into an algorithm.
The human contrast sensitivity function (CSF) is a well-accepted, experi-
mentally obtained description of spatial frequency perception and plots contrast
sensitivity versus spatial frequency. A common contrast measure is the Michelson
contrast, given in terms of minimum and maximum luminances in the stimulus
l
min
and l
max
as C = (l
max
− l
min
)/(l
max
+ l
min
). The visibility threshold (VT) is
defined as the contrast at which the stimulus can be perceived, and the contrast
sensitivity is defined as 1/VT. The units of spatial frequency are cycles/degree,
where a cycle refers to a full period of a sinusoid, and degrees are a measure
of visual range, where the visual field is described by 180
◦
. Spatial frequency is
a function of viewing distance, so the same sinusoid represents a higher spatial
frequency at a larger viewing distance. The CSF has been determined experi-
mentally with stimuli of sinusoidal gratings at differing frequencies. Figure 8.3
plots a representative CSF. This function peaks around 5–10 cycles/degree and
exhibits exponential falloff — humans are exponentially less sensitive to higher
frequencies.
The CSF represents measured sensitivities to a simple stimuli that is single
frequency. Although images can be decomposed into individual frequencies, they
in general consist of many such frequencies. Factors influencing the VTs for
a complex stimuli include luminance masking, in which VTs are affected by
background luminance and contrast masking, in which VTs are affected for one
image component in the presence of another. Contrast masking is sometimes
informally referred to as texture masking and includes the interactions of different
frequencies. The combination of luminance and contrast masking is referred to as
spatial masking. When spatial masking is exploited, more compression artifacts
can be hidden in appropriate parts of the an image. For example, an observer is
less likely to see compression artifacts in a dark, textured region, when compared
Spatial frequency, cycles/degree
1/visibility threshold
1 5 10 50
Figure 8.3. The human contrast sensitivity function.
220 IMAGE COMPRESSION — A REVIEW
with artifacts in a midgray flat region. However, fully exploiting spatial masking
in compression is difficult because it is fairly image-dependent.
The CSF is nonorientation-specific. A HVS model that incorporates orientation
frequency is the multichannel model [5], which asserts that the visual cortex
contains sets of neurons (called channels) tuned to different spatial frequencies
at different orientations. Although the multichannel model is itself not typically
applied directly to obtaining VTs, it is used as an argument for using wavelet
transforms (described in Section 8.5).
8.4.3 Application to Image Compression
Although the CSF can be mapped directly to a compression algorithm (see [6–8]
for details), it is more common to experimentally measure the VTs for basis
functions of the transform used in an image-compression algorithm. These VTs
are then translated to quantizer step sizes such that quantization-induced distortion
in image components will be below the measured VTs. Such an application
assumes that results from individual experimental stimuli such as a single basis
function or band-pass noise add independently, so that the measured VTs are
equally valid when all transform coefficients in an image are simultaneously
quantized. This approach is argued to be valid when all individual distortions are
subthreshold that is, they are all below the experimentally measured VTs. Such an
application is image-independent — only measured VTs are used in determining a
quantization strategy. Quantization can be made image-dependent by modifying
the VTs by incorporating various spatial masking models [9].
Roughly speaking, then, a good image-compression algorithm will discard
more higher frequencies than lower frequencies, putting more compression arti-
facts in the frequencies to which the eye is less sensitive. Note that if the
data-discarding step is uniform quantization, this result is in conflict with that
from RD theory, which indicates that all frequencies should be quantized with the
same step size. As such, a perceptually weighted distortion measure is required
to obtain the best-looking images when using an RD technique.
8.5 PIXEL-BASED REDUNDANCY REDUCTION
In this section, techniques for redundancy reduction in images are examined.
These include predictive coding and transform coding. The high redundancy
present in image pixels can be quantified by the relatively high correlation coef-
ficients and can be intuitively understood by considering predicting a single pixel.
High correlation intuitively means that given a group of spatially close pixels and
an unknown pixel in that group, the unknown pixel can be predicted with very
little error from the known pixels. As such, most of the information required
to determine the unknown pixel is contained in the surrounding pixels and the
unknown pixel itself contains relatively little information that is not represented in
the surrounding pixels. Predictive coding exploits this redundancy by attempting