Tài liệu Cơ sở dữ liệu hình ảnh P8 ppt

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	29
Dung lượng	259,87 KB

Nội dung

Image Databases: Search and Retrieval of Digital Imagery Edited by Vittorio Castelli, Lawrence D. Bergman Copyright  2002 John Wiley & Sons, Inc. ISBNs: 0-471-32116-8 (Hardback); 0-471-22463-4 (Electronic) 8 Image Compression — A Review SHEILA S. HEMAMI Cornell University, Ithaca, New York 8.1 INTRODUCTION Compression reduces the number of bits required to represent a signal. The appropriate compression strategy is a function of the type of signal to be compressed. Here, we focus on images, which can be single component (e.g., gray scale) or multiple component (e.g., three-component color or higher- component remote sensing data). Each component can be considered to be an “image” — a two-dimensional (2D) array of pixels, and this chapter reviews the fundamentals of image compression as a 2D signal with specific statistical characteristics. Application to multicomponent imagery is achieved by separately compressing each component. Compression of the higher-dimensional multicomponent data is possible, it is very uncommon, so we concentrate on 2D image compression. Compression can be thought of as redundancy reduction; its goal is to eliminate the redundancy in the data to provide an efficient representation that preserves only the essential information. Compression can be performed in one of the two regimes: lossless compression and lossy compression. Lossless compression permits an exact recovery of the original signal and permits compression ratios for images of not more than approximately 4 : 1. In lossy compression, the original signal cannot be recovered from the compressed representation. Lossy compression can provide images that are visually equivalent to the original at compression ratios that range from 8 : 1 to 20 : 1, depending on the image content. Incorporation of human visual system characteristics can be important in providing high-quality lossy compression. Higher compression ratios are possible, but they produce a visual difference between the original and the compressed images. A block diagram of a typical generic image-compression system is shown in Figure 8.1 and consists of three components: pixel-level redundancy 211 212 IMAGE COMPRESSION — A REVIEW Pixel-level redundancy reduction Data discarding Input image Compressed stream x w Block 1 Block 2 Block 3 x Bit-level redundancy reduction Figure 8.1. Three components of an image-compression system. reduction, data discarding, and bit-level redundancy reduction. A lossless image- compression system omits the data-discarding step, and as such, lossless compression results from redundancy reduction alone. A lossy algorithm uses all three blocks, although extremely efficient techniques can produce excellent results even without the third block. Although both compression types can be achieved using simpler block diagrams (e.g., omitting the first block), these three steps are required to produce state-of-the-art lossy image compression. Each of these blocks is described briefly. Pixel-level redundancy reduction performs an invertible mapping of the input image into a different domain in which the output data w is less correlated than the original pixels. The most efficient and widely used mapping is a frequency transformation (also called a transform code), which maps the spatial information contained in the pixels into a frequency space, in which the image data is more efficiently represented numerically and is well matched to the human visual system frequency response. Data discarding provides the “loss” in lossy compression and is performed by quantization of w to form x. Both statistical properties of images and human visual system characteristics are used to determine how the data w should be quantized while minimally impacting the fidelity of the images. Fidelity can be easily measured numerically but such metrics do not necessarily match subjective judgments, making visually pleasing quantization of image data an inexact science. Finally, bit-level redundancy reduction removes or reduces dependencies in the data x and is itself lossless. Instead of studying the blocks sequentially, this chapter begins by describing basic concepts in both lossless and lossy coding: entropy and rate-distortion theory (RD theory). Entropy provides a computable bound on bit-level redundancy reduction and hence lossless compression ratios for specific sources, whereas RD theory provides a theory of lossy compression bounds. Useful for understanding limits of compression, neither the concept of entropy nor RD theory tell us how these bounds may be achieved and whether the computed or theorized bounds are absolute bounds themselves. However, they suggest that the desired lossless or lossy compression is indeed possible. Next, a brief description of the human visual system is provided, giving an understanding of the relative visual impact of image information. This provides guidance in matching pixel-level redundancy reduction techniques and data-discarding techniques to human perception. Pixel-level redundancy reduction is then described, followed by quantization and bit-level redundancy reduction. Finally, several standard and nonstandard state-of-the-art image compression techniques are described. ENTROPY — A BOUND ON LOSSLESS COMPRESSION 213 8.2 ENTROPY — A BOUND ON LOSSLESS COMPRESSION Entropy provides a computable bound by which a source with a known probability mass function can be losslessly compressed. For example, the “source” could be the data entering Block 3 in Figure 8.1; as such, entropy does not suggest how this source has been generated from the original data. Redundancy reduction prior to entropy computation can reduce the entropy of the processed data below that of the original data. The concept of entropy is required for variable-rate coding, in which a code can adjust its own bit rate to better match the local behavior of a source. For example, if English text is to be encoded with a fixed-length binary code, each code word requires  log 2 27=5 bits/symbol (assuming only the alphabet and a space symbol). However, letters such as “s” and “e” appear far more frequently than do letters such as “x” and “j.” A more efficient code would assign shorter code words to more frequently occurring symbols and longer code words to less frequently occurring symbols, resulting in a lower average number of bits/symbol to encode the source. In fact, the entropy of English has been estimated at 1.34 bits/letter [1], indicating that substantial savings are possible over using a fixed-length code. 8.2.1 Entropy Definition For a discrete source X with a finite alphabet of N symbols (x 0 , , x N−1 )and a probability mass function of p(x), the entropy of the source in bits/symbol is given by H(X) =− N−1  n=0 p(x n ) log 2 p(x n )(8.1) and measures the average number of bits/symbol required to describe the source. Such a discrete source is encountered in image compression, in which the acquired digital image pixels can take on only a finite number of values as determined by the number of bits used to represent each pixel. It is easy to show (using the method of Lagrange multipliers) that the uniform distribution achieves maximum entropy, given by H(X) = log 2 N.A uniformly distributed source can be considered to have maximum randomness when compared with sources having other distributions — each alphabet value is no more likely than any other. Combining this with the intuitive English text example mentioned previously, it is apparent that entropy provides a measure of the compressibility of a source. High entropy indicates more randomness; hence the source requires more bits on average to describe a symbol. 8.2.2 Calculating Entropy — An Example An example illustrates the computation of entropy the difficulty in determining the entropy of a fixed-length signal. Consider the four-point signal [3/4 1/4 0 0]. There are three distinct values (or symbols) in this signal, with probabilities 1/4, 214 IMAGE COMPRESSION — A REVIEW 1/4, and 1/2 for the symbols 3/4, 1/4, and 0, respectively. The entropy of the signal is then computed as H =− 1 4 log 2 1 4 − 1 4 log 2 1 4 − 1 2 log 2 1 2 = 1.5 bits/symbol. (8.2) This indicates that a variable length code requires 1.5 bits/symbol on average to represent this source. In fact, a variable-length code that achieves this entropy is [10 11 0] for the symbols [3/4 1/4 0]. Now consider taking the Walsh-Hadamard transform of this signal (block- based transforms are described in more detail in Section 8.5.2). This is an invertible transform, so the original data can be uniquely recovered from the transformed data. The forward transform is given by 1 2     1111 11−1 −1 1 −1 −11 1 −11−1         3/4 1/4 0 0     =     1/2 1/2 1/4 1/4     (8.3) with a resulting entropy easily calculated as 1 bit/symbol. With a simple forward transform before computing the entropy and an inverse transform to get the original signal back from the coded signal, the entropy has been reduced by 0.5 bit/symbol. The entropy reduction achieved by a different signal representation suggests that measuring entropy is not as straightforward as plugging into the mathematical definition; with an appropriate invertible signal representation, the entropy can be reduced and the original signal still represented. Although the entropy example calculations given earlier are simple to compute, the results and the broader definition of entropy as the “minimum number of bits required to describe a source” suggests that defining the entropy of an image is not as trivial as it may seem. An appropriate pixel-level redundancy reduction such as a transform can reduce entropy. Such redundancy reduction techniques for images are discussed later in the chapter; however, it should be mentioned that pixel transformation into the “right” domain can reduce the required bit rate to describe the image. 8.2.3 Entropy Coding Techniques Entropy coding techniques, also known as noiseless coding, lossless coding, or data compaction coding, are variable-rate coding techniques that provide compression at rates close to the source entropy. Although the source entropy provides a lower bound, several of these techniques can approach this bound arbitrarily closely. Three specific techniques are described. Huffman coding achieves variable-length coding by assigning code words of differing lengths to different source symbols. The code word length is directly proportional to − log(f (x)),wheref(x) is the frequency of occurrence of the symbol x, and a simple algorithm exists to design a Huffman code when the RATE-DISTORTION THEORY 215 source symbol probabilities are known [2]. If the probabilities are all powers of (1/2), then the entropy bound can be achieved exactly by a binary Huffman code. Because a code word is assigned explicitly to each alphabet symbol, the minimum number of bits required to code a single source symbol is 1. The example variable-length code given in the previous section is a Huffman code. In arithmetic coding, a variable number of input symbols are required to produce each code symbol. A sequence of source symbols is represented by a subinterval of real numbers within the unit interval [0,1]. Smaller intervals require more bits to specify them; larger intervals require fewer. Longer sequences of source symbols require smaller intervals to uniquely specify them and hence require more bits than shorter sequences. Successive symbols in the input data reduce the size of the current interval proportionally to their probabilities; more probable symbols reduce an interval by a smaller amount than less probable symbols and hence add fewer bits to the message. Arithmetic coding is more complex than Huffman coding; typically, it provides a gain of approximately 10 percent more compression than Huffman coding in imaging applications. Lempel-Ziv-Welch (LZW) coding is very different from both Huffman and arithmetic coding in that it does not require the probability distribution of the input. Instead, LZW coding is dictionary-based: the code “builds itself” from the input data, recursively parsing an input sequence into nonoverlapping blocks of variable size and constructing a dictionary of blocks seen thus far. The dictionary is initialized with the symbols 0 and 1. In general, LZW works best on large inputs, in which the overhead involved in building the dictionary decreases as the number of source symbols increases. Because of its complexity and the possibility to expand small data sets, LZW coding is not frequently used in image- compression schemes (it is, however, the basis for the Unix utilities compress and gzip). 8.3 RATE-DISTORTION THEORY — PERFORMANCE BOUNDS FOR LOSSY COMPRESSION Lossy compression performance bounds are provided by rate-distortion theory (RD theory) and are more difficult to quantify and compute than the lossless compression performance bound. RD theory approaches the problem of maximizing fidelity (or minimizing distortion) for a class of sources for a given bit rate. This description immediately suggests the difficulties in applying such theory to image compression. First, fidelity must be defined. Numerical metrics are easy to calculate but an accepted numerical metric that corresponds to perceived visual quality has yet to be defined. Secondly, an appropriate statistical description for images is required. Images are clearly complex and even sophisticated statistical models for small subsets of image types fail to adequately describe the source for RD theory purposes. Nevertheless, the basic tenets of RD theory can be applied operationally to provide improved compression performance in a system. The aim of this section is to introduce readers to the concepts of RD theory and their applications in operational rate distortion. 216 IMAGE COMPRESSION — A REVIEW Rate, bits/pixel Distortion Figure 8.2. A sample rate-distortion curve. A representative RD curve is shown in Figure 8.2 — as the rate increases, distortion decreases, and vice versa. RD theory provides two classes of performance bounds. Shannon theory, introduced by Claude Shannon seminal works [3,4], provides performance bounds as the data samples to be coded are grouped into infinitely long blocks. Alternatively, high-rate low-distortion theory provides bounds for fixed block size as the rate approaches infinity. Although the second class of bounds is more realistic for use in image compression (an image has a finite number of pixels), both classes only provide existence proofs; they are not constructive. As such, performance bounds can be derived, no instruction is provided on designing a system that can achieve them. Hence how can RD theory be applied to practical image compression? First, consider a simple example. Suppose that an image-compression algorithm can select among one of several data samples to add to the compressed stream. Data sample 1 requires three bits to code and reduces the mean-squared error (MSE) of the reconstructed image by 100; data sample 2 requires 7 bits to code and reduces the MSE by 225. Which sample should be added to the data stream? A purely rate- based approach would select the first sample — it requires fewer bits to describe. Conversely, a purely distortion-based approach would select the second sample, as it reduces the MSE by over twice the first sample. An RD-based approach compares the trade-off between the two: the first sample produces an average decrease in MSE per bit of 100/3 = 33.3 and the second produces an average decrease in MSE per bit of 225/7 = 321/7. The RD-based approach selects data sample 1 as maximizing the decrease in MSE per bit. In other words, the coefficient that yields a steeper slope for the RD curve is selected. The previous example demonstrates how RD theory is applied in practice; this is generally referred to as determining operational (rather than theoretical) RD curves. When dealing with existing compression techniques (e.g., a particular transform coder followed by a particular quantization strategy, such as JPEG or a zerotree-based wavelet coder), RD theory is reduced to operational rate distortion — for a given system (the compression algorithm) and source model (a statistical description of the image), which system parameters produce the best RD performance? Furthermore, human visual system (HVS) characteristics HUMAN VISUAL SYSTEM CHARACTERISTICS 217 must be taken into account even when determining operational RD curves. Blind application of minimizing the MSE subject to a rate constraint for image data following a redundancy-reducing transform code suggests that the average error introduced into each quantized coefficient be equal, and that such a quantization strategy will indeed minimize the MSE over all other step size selections at the same rate. However, the human visual system is not equally sensitive to errors at different frequencies, and HVS properties suggest that more quantization error can be tolerated at higher frequencies to produce the same visual quality. Indeed, images compressed with HVS-motivated quantization step sizes are of visually higher quality than those compressed to minimize the MSE. The operational RD curve consists of points that can be achieved using a given compression system, whereas RD theory provides existence proofs only. The system defines the parameters that must be selected (e.g., quantizer step sizes), and a constrained minimization then solves for the parameters that will provide the best RD trade-offs. Suppose there are N sources to be represent using a total of R bits/symbol; these N sources could represent the 64 discrete cosine transform (DCT) coefficients in a joint photographic experts group (JPEG) compression system or the 10 subband coefficients in a three-level hierarchically subband-transformed image (both the DCT and subband transforms are described in Section 8.5). For each source, there is, an individual RD curve, which may itself be operationally determined or generated from a model, that indicates that source i will incur a distortion of D ij when coded at a rate of R ij at operating point j . Then the operational RD curve is obtained by solving the following constrained minimization: For each source i, find the operating point y(i) that minimizes the distortion D = f(D 1y(1) ,D 2y(2) , ,D Ny(N) ) such that N  i=1 R iy(i) ≤ R(8.4) Most commonly, the distortion is additive and D =  N i=1 D iy(i) . This constrained minimization can be solved using the method of Lagrange multipliers. When solved, the result is that to minimize the total distortion D, each source should operate at a point on its RD curve such that the tangents to the curves are equal; that is, the operating points have equal slopes. The minimization finds that slope. Both RD theory and operational RD curves provide a guiding tenet for lossy image compression: maximize the quality for a given bit rate. Although some image-compression algorithms actually use RD theory to find operating parameters, more commonly, the general tenet is used without explicitly invoking the theory. This yields theoretically suboptimal compression, but as Section 8.8 will show, the performance of many compression algorithms is still an excellent event without explicit RD optimization. 8.4 HUMAN VISUAL SYSTEM CHARACTERISTICS Lossy image compression must discard information in the image that is not or is only minimally visible, producing the smallest possible perceptible change 218 IMAGE COMPRESSION — A REVIEW to the image. To determine what information can be discarded, an elementary understanding of the HVS is required. Understanding the HVS has been a topic of research for over a century, but here we review the points salient to providing high-quality lossy image compression. 8.4.1 Vision and Luminance-Chrominance Representations Light enters the eye through the pupil and strikes the retina at the back of the eye. The retina contains two types of light receptors: cones and rods. Approximately eight million cones located in the central portion of the retina and sensitive to red, blue, or green light provide color vision under high-illumination conditions, such as in a well-lighted room. Each cone is connected to its own nerve end, providing high resolution in photopic (color) vision. Approximately 120 million rods are distributed over the entire retina and provide vision at low-illumination levels, such as a moonlit night (scotopic vision). A single nerve end is connected to multiple rods; as such, resolution is lower and rods provide a general picture of the field of view without color information. With midrange illumination, both rods and cones are active to provide mesopic vision. Because most digital images are viewed on well-lit displays, the characteristics of photopic vision are most applicable to digital imaging and compression. The HVS processes color information by converting the red, green, and blue data from the cones into a luminance-chrominance space, with the luminance channel having approximately five times the bandwidth of the chrominance channel. Consequently, much more error in color (or chrominance) than in luminance information can be tolerated in compressed images. Color digital images are often represented in a luminance-chrominance color space (one luminance component, and two chrominance components). The chrominance components are often reduced in size by a factor of 2 in each dimension through low-pass filtering followed by downsampling; these lower-resolution chrominance components are then compressed along with the full-size luminance component. In decompression, the chrominance components are upsampled and interpolated to full size for display. No noticeable effects are seen when this is applied to natural images, and it reduces the amount of chrominance information by a factor of 4 even before the compression operation. Because understanding and representation of color is itself a well-developed science, the remainder of this section will focus on the HVS characteristics for monochrome (gray scale) images on which most lossy image-compression algorithms rely. The resulting characteristics are typically also applied to chrominance data without modification. 8.4.2 The Human Contrast Sensitivity Function and Visibility Thresholds Studies on perception of visual stimuli indicate that many factors influence the visibility of noise in a degraded image when compared with the original. These factors are functions of the image itself and include the average luminance of the image, the spatial frequencies present in the image, and the image content. HUMAN VISUAL SYSTEM CHARACTERISTICS 219 Because images can have widely varying average luminances and content, the first and third factors are more difficult to include in an image-compression algorithm for general use. However, all images can be decomposed into their frequency content and the HVS sensitivities to different frequencies can be incorporated into an algorithm. The human contrast sensitivity function (CSF) is a well-accepted, experimentally obtained description of spatial frequency perception and plots contrast sensitivity versus spatial frequency. A common contrast measure is the Michelson contrast, given in terms of minimum and maximum luminances in the stimulus l min and l max as C = (l max − l min )/(l max + l min ). The visibility threshold (VT) is defined as the contrast at which the stimulus can be perceived, and the contrast sensitivity is defined as 1/VT. The units of spatial frequency are cycles/degree, where a cycle refers to a full period of a sinusoid, and degrees are a measure of visual range, where the visual field is described by 180 ◦ . Spatial frequency is a function of viewing distance, so the same sinusoid represents a higher spatial frequency at a larger viewing distance. The CSF has been determined experimentally with stimuli of sinusoidal gratings at differing frequencies. Figure 8.3 plots a representative CSF. This function peaks around 5–10 cycles/degree and exhibits exponential falloff — humans are exponentially less sensitive to higher frequencies. The CSF represents measured sensitivities to a simple stimuli that is single frequency. Although images can be decomposed into individual frequencies, they in general consist of many such frequencies. Factors influencing the VTs for a complex stimuli include luminance masking, in which VTs are affected by background luminance and contrast masking, in which VTs are affected for one image component in the presence of another. Contrast masking is sometimes informally referred to as texture masking and includes the interactions of different frequencies. The combination of luminance and contrast masking is referred to as spatial masking. When spatial masking is exploited, more compression artifacts can be hidden in appropriate parts of the an image. For example, an observer is less likely to see compression artifacts in a dark, textured region, when compared Spatial frequency, cycles/degree 1/visibility threshold 1 5 10 50 Figure 8.3. The human contrast sensitivity function. 220 IMAGE COMPRESSION — A REVIEW with artifacts in a midgray flat region. However, fully exploiting spatial masking in compression is difficult because it is fairly image-dependent. The CSF is nonorientation-specific. A HVS model that incorporates orientation frequency is the multichannel model [5], which asserts that the visual cortex contains sets of neurons (called channels) tuned to different spatial frequencies at different orientations. Although the multichannel model is itself not typically applied directly to obtaining VTs, it is used as an argument for using wavelet transforms (described in Section 8.5). 8.4.3 Application to Image Compression Although the CSF can be mapped directly to a compression algorithm (see [6–8] for details), it is more common to experimentally measure the VTs for basis functions of the transform used in an image-compression algorithm. These VTs are then translated to quantizer step sizes such that quantization-induced distortion in image components will be below the measured VTs. Such an application assumes that results from individual experimental stimuli such as a single basis function or band-pass noise add independently, so that the measured VTs are equally valid when all transform coefficients in an image are simultaneously quantized. This approach is argued to be valid when all individual distortions are subthreshold that is, they are all below the experimentally measured VTs. Such an application is image-independent — only measured VTs are used in determining a quantization strategy. Quantization can be made image-dependent by modifying the VTs by incorporating various spatial masking models [9]. Roughly speaking, then, a good image-compression algorithm will discard more higher frequencies than lower frequencies, putting more compression artifacts in the frequencies to which the eye is less sensitive. Note that if the data-discarding step is uniform quantization, this result is in conflict with that from RD theory, which indicates that all frequencies should be quantized with the same step size. As such, a perceptually weighted distortion measure is required to obtain the best-looking images when using an RD technique. 8.5 PIXEL-BASED REDUNDANCY REDUCTION In this section, techniques for redundancy reduction in images are examined. These include predictive coding and transform coding. The high redundancy present in image pixels can be quantified by the relatively high correlation coefficients and can be intuitively understood by considering predicting a single pixel. High correlation intuitively means that given a group of spatially close pixels and an unknown pixel in that group, the unknown pixel can be predicted with very little error from the known pixels. As such, most of the information required to determine the unknown pixel is contained in the surrounding pixels and the unknown pixel itself contains relatively little information that is not represented in the surrounding pixels. Predictive coding exploits this redundancy by attempting

Ngày đăng: 21/01/2014, 18:20

Xem thêm