CSP1HT based scalable video codec for layered video streaming

CSPIHT BASED SCALABLE VIDEO CODEC FOR LAYERED VIDEO STREAMING FENG WEI (B. Eng. (Hons) , Xi’an Jiaotong University) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2003 ACKNOWLEDGEMENT I would like to express my gratitude to all those who gave me the possibility to complete this thesis. First of all, I would like to extend my sincere gratitude to my two supervisors, A/P Ashraf A. Kassim and Dr. Tham Chen Khong, for their insightful guidance throughout my project and their valuable time and inputs on this thesis. They have helped and encouraged me in numerous ways, especially when my progress was slow. I am grateful to my three seniors -- Mr. Lee Wei Siong, Mr. Tan Eng Hong and Mr. See Toh Chee Wah, who has provided me much information and many helpful discussions. Their assistance was vital to this project. I wish to thank all the friends and fellow students in the Vision and Image Processing lab, especially the lab officer Mr. Francis Hoon. They have been wonderful companies to me for these two years. Last but not least, I wish to thank my boyfriend Huang Qijie for his support all the way along. Almost all of my progress was made when he is by my side. i TABLE OF CONTENTS ACKNOWLEDGEMENT……………………………………….……..…………......i TABLE OF CONTENTS……………………………………….……..………….......ii LIST OF FIGURES.…………………………………………….……..……………..iv LIST OF TABLES…………………………………………………….….……….…vii SUMMARY……………………………………………………………..…………...viii CHAPTER 1 INTRODUCTION……………………………..…………….…….......1 CHAPTER 2 IMAGE AND VIDEO CODING………………..…………….……....5 2.1 Transform Coding…………………………………...……..……………………..5 2.1.1 Linear Transform…..…….…………………………...…..…………………….6 2.1.2 Quantization……………….……………………………..………………….….7 2.1.3 Arithmetic Coding……………….……….………………......………………...8 2.1.4 Binary Coding………………….…………………………...….…..………….10 2.2 Video Compression Using MEMC………………………………..…………….10 2.3 Wavelet Based Image and Video Coding…………………………...…..………12 2.3.1 Discrete Wavelet Transform………………………………………….……….13 2.3.2 EZW Coding …………….………………………….……………...…………16 2.3.3 SPIHT Coding Scheme…………………………….……………...………..…18 2.3.4 Scalability……………………………………….………………..…………...23 2.4 Image and Video Coding Standards………………………………..…………..25 CHAPTER 3 VIDEO STREAMING AND NETWORK QoS………..…….……..25 3.1 Video Streaming Models……………………………………………….….…….25 3.2 Characteristics and Challenges of Video Streaming…………………..………26 3.3 Quality of Service……………………………………………………….…..……27 3.3.1 Definition of QoS ……...….…………………………….……………..……...27 3.3.2 IntServ Framework………….…………………………….………….….…….28 3.3.3 DiffServ Framework………….…………………………….…………...……..31 3.4 Layered Video Streaming……………………………………………..………...33 CHAPTER 4 Layered 3D-CSPIHT CODEC……………………….……..……….36 4.1 CSPIHT and 3D-CSPIHT Video Coder ………..……..….……...…..……….. 36 4.2 Limitations of Original 3-D CSPIHT Codec……………..……..…..………….41 4.3 Layered 3D-CSPIHT Video Codec………………………….…….…...……….42 4.3.1 Overview of New Features………………………………….…….…………43 4.3.2 Layer IDs…………………………………………………….…….………...44 4.3.3 Production of Multiresolutional Scalable Bit Streams………………………46 ii 4.3.4 How the Codec Functions in the Network………………………….………54 4.3.5 Layered 3D-CSPIHT Algorithm……………………………….………...…57 CHAPTER 5 PERFORMANCE DATA………………………………….….…….59 5.1 Coding Performance Measurements………………………………….……….59 5.2 PSNR Performance of the layered 3D-CSPIHT Codec………….….………..60 5.3 Coding Time and Compression Ratio………………………………….……...70 CHAPTER 6 CONCLUSIONS……………………………………………….……71 REFERENCES…………………………..…………………………………….……74 iii SUMMARY A layered scalable codec based on the 3-D Color Set Partitioning in Hierarchical Trees (3D-CSPIHT) coder is presented in this thesis. The layered 3D-CSPIHT codec introduces layering of encoded bit streams to support layered scalable video streaming. It restricts the significance criteria of the original 3D-CSPIHT coder to generate separate bit streams comprised of cumulative layers. Layers are defined according to resolution subbands. The layered 3D-CSPIHT codec incorporates a new sorting algorithm to produce multi-resolution scalable bit streams, and a specially designed layer ID to identify the layer that a particular data packet belongs to. By doing so, decoding of lossy data is achieved. The layered 3D-CSPIHT codec is tested using both high motion and low motion standard QCIF video sequences at 10 frames per second. It is compared against the original 3D-CSPIHT and the 2D-CSPIHT video coder in terms of PSNR, encoding time and compression ratio. In the luminance plane, the original 3D-CSPIHT and the 2D-CSPIHT give better PSNR than the layered 3D-CSPIHT. While in the chrominance planes, they give similar PSNR results. The layered 3D-CSPIHT also costs more in computational time and provides less compressed bit streams, because of the expense incurred by incorporating the layer ID. However, encoded video data is very likely to encounter loss in real network transmission. When decoding lossy data, the layered 3D-CSPIHT codec outperforms the original 3D-CSPIHT significantly. iv LIST OF TABLES Table 2.1 Image and video compression standards……………………..…………….24 Table 4.1 Resolution options……………………………………………….…………47 Table 4.2 LIP, LIS, LSP state after sorting at bit plane 2 (original CSPIHT)……...…50 Table 4.3 LIP, LIS, LSP state after sorting at bit plane 1 (original CSPIHT)………...51 Table 4.4 LIP, LIS, LSP state after sorting at bit plane 0 (original CSPIHT)………...51 Table 4.5 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 1 effective)………………………………………………………………………………52 Table 4.6 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 1 effective)………………………………………………………………………………52 Table 4.7 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 1 effective)………………………………………………………………………………52 Table 4.8 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 2 effective)………………………………………………………………………………53 Table 4.9 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 2 effective)………………………………………………………………………………53 Table 4.10 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 2 effective)……………………………………………………………………………....53 Table 5.1 Average PSNR (dB) at 3 different resolutions…………………………..…61 Table 5.2 Encoding time (in second) of the original and layered codec……...………70 v LIST OF FIGURES Fig. 1.1 A typical video streaming system…………..…………………………………2 Fig. 2.1 Encoding model………………………………..………………………………5 Fig. 2.2 Decoding model……………………………….....……………………………6 Fig. 2.3 Binary coding model……………………………..………………….……….10 Fig. 2.4 Block matching motion estimation………………..……….…………………11 Fig. 2.5 1-D DWT decomposition…………………………..………………………...14 Fig 2.6 Dyadic DWT decomposition of an image……………..…….………………..14 Fig 2.7 Subbands after 3-level dyadic wavelet decomposition………..……………...15 Fig. 2.8 2-level DWT decomposed Barbara image……………………..…………….15 Fig. 2.9 Spatial Orientation Tree for EZW………………………………..…………..17 Fig. 2.10 Spatial Orientation Tree of SPIHT………………………………………….18 Fig. 2.11 SPIHT coding algorithm…………………………………………..………..25 Fig. 3.1 Unicast video streaming……………………………………………..……….25 Fig. 3.2 Multicast video streaming……………………………………………..……..26 Fig. 3.3 IntServ architecture……………………………………………………..……29 Fig. 3.4 Leaky bucket regulator…………………………………………………...…..30 Fig. 3.5 An example of the DiffServ network……………………………………..….32 Fig. 3.6 DiffServ inter-domain operations………………….………………..……..…33 Fig. 3.7 Principle of a layered codec………………………...………….………….....35 Fig. 4.1 CSPIHT SOT (2-D) ……………………………………………….……...….37 Fig. 4.2 CSPIHT video encoder …………………………………………………..….37 Fig. 4.3 CSPIHT video decoder……………………………...……………….……….38 Fig. 4.4 3D-CSPIHT STOT …………………………………….…………...……..…39 Fig. 4.5 3D-CSPIHT video encoder……………………………..……………….…...40 vi Fig. 4.6 3D-CSPIHT video decoder……………………………..…….………..…….41 Fig. 4.7 Confusion when decode lossy data using original 3D-CSPIHT decoder…....41 Fig. 4.8 Network scenario considered for design of the layered codec………..…......43 Fig. 4.9 The bit stream after layer ID is added…………………………………..……45 Fig. 4.10 Resolution layers in layered 3D-CSPIHT……………………..……………47 Fig. 4.11 Progressively transmitted and decoded layers ……………………..…...….47 Fig. 4.12 (a) An example video frame after DWT transform ………………..…….…49 Fig. 4.12 (b) SOT for Fig. 4.14 (a)……...………………………………...……..……49 Fig. 4.13 Bit stream structure of the layered 3D-CSPIHT coder…………….....….….55 Fig. 4.14 Flowchart of the layered decoder algorithm…………………………...……56 Fig. 4.15 Layered 3D-CSPIHT algorithm………………………………………….…58 Fig. 5.1 Frame by frame PSNR results on (a) foreman and (b) container sequences at 3 different resolutions………………………………………………………………..….61 Fig. 5.2 Rate distortion curve of the layered 3D-CSPIHT codec..………………..…..62 Fig. 5.3 PSNR (dB) comparison of the original and the layered codec in (a) luminance plane, (b) Cb plane and (c) Cr plane for the foreman sequence……………………....63 Fig. 5.4 Frame 1 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original…………………………………………………………...64 Fig. 5.5 Frame 58 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original…………………………………………………………...64 Fig. 5.6 Frame 120 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original…………………………………………………………...65 Fig. 5.7 Frame 190 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original…………………………………………………………...65 Fig. 5.8 Comparison on carphone sequence………………………………………….66 vii Fig. 5.9 Comparison on akiyo sequence……………………………………………....67 Fig. 5.10 Manually formed incomplete bit streams...…………………………………68 Fig. 5.11 Reconstruction of frame (a)(b)1, (c)(d)5, (e)(f)10 of the foreman sequence ……………………………………………………………………………………..….69 viii CHAPTER 1 INTRODUCTION With the emergence of increasing demand of rich multimedia information on the Internet, video streaming has become popular in both academia and industry. Video streaming technology enables real time or on-demand distribution of video resources over the network. Compressed video data are transmitted by a server application, and received and displayed in real time by the corresponding client applications. These applications normally start to display the video as soon as a certain amount of data arrives at the client’s buffer, thus providing downloading and viewing of the video simultaneously. A typical video streaming system consists of five core functional blocks, i.e., coding module, network sender, network receiver, decoding module and video renderer. As shown in Fig. 1.1, raw video data will undergo compression in the coding module to reduce the data load in the network. The compressed video is then transmitted by the sender to the client on the other side of the network, where a decoding procedure is performed to reconstruct the video for the renderer to display. Video streaming is advantageous because a user does not have to wait until the whole file to arrive before he can see the video. Besides, video streaming leaves no physical files on the clients’ computer. 1 Encoder Raw video Sender Renderer Decoder Compressed video Network Receiver Fig. 1.1 A typical video streaming system The challenge of video streaming lies in the highly delay-sensitive characteristic of video applications. Video/audio data need to arrive on time to be useful. Unfortunately, current Internet service is best effort (BE) and guarantees no delay bound. Delay sensitive applications need a new service model in which they can ask for higher assurance or priority from the network. Research in network Quality of Service (QoS) aims to investigate and provide such service models. Technical details of QoS include control protocols such as the Resource Reservation Protocols (RSVP), and individual building blocks such as traffic policing, buffer management and admission control [1]. Layered scalable streaming is one of the QoS supportive video streaming mechanisms that provide both efficiency and flexibility. The basic idea of layered scalable streaming is to encode raw video into multiple layers that can be separately transmitted, cumulatively received and progressively decoded [2]-[4]. Clients obtain a preferred video quality by subscribing to different layers and combining these layers into different bit streams. Base layer of the video stream must be received for any other layers to be useful, and each additional layer improves the video quality. As network clients always differ significantly in their capacities and preferences, layered scalable streaming is efficient in that it is able to 2 deliver one video stream over the network, while at the same time it enables the clients to receive a video that is specially “shaped” for each of them. Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec. Recent subband coding algorithms based on the Discrete Wavelet Transform (DWT) support scalability. The DWT based Set Partitioning in Hierarchical Trees (SPIHT) scheme [5] [6] for coding of monochrome images has yielded desirable results despite its simplicity in implementation. The Color SPIHT (CSPIHT) [7]-[9] improves the SPIHT and achieves comparable compression results to SPIHT in color image coding. In the area of video compression, interest is focused on the removal of temporal redundancy. The use of 3-D subband coding schemes is one of the successful solutions. Karlsson and Vetterli implemented a 3-D subband coding system in [10] by generalized the common 2-D filter banks to 3-D subband analysis and synthesis. As one of the embedded 3-D subband coding algorithms that follow it, 3D-CSPIHT [11] is an extension of the CSPIHT coding scheme for video coding. The above coding schemes achieve satisfactory PSNR performance; however, they have been designed from a pure compression point of view, which render problems for their direct application to a QoS enabled streaming system. In this project, we extended the 3D-CSPIHT codec to address these problems and enable it to produce layered bit streams that are suitable for layered video streaming. 3 The rest of this thesis is organized as follows: In chapter 2 we provide background information in image/video compression, and in chapter 3 we discuss related research in multimedia communications and network QoS. The details of our extension of the 3D-CSPIHT codec, called layered 3D-CSPIHT video codec, are presented in chapter 4. We analyze performance of the layered codec in chapter 5. Finally, in chapter 6 we conclude this thesis. 4 CHAPTER 2 IMAGE AND VIDEO CODING This chapter begins with an overview of transform coding for still images and video coding using motion compensation. Then wavelet based image and video coding is introduced and the subband coding techniques are described in detail. Finally, current image and video coding standards are briefly summarized. 2.1 Transform Coding A typical transform coding system comprises of forward transform, quantization and entropy coding, as shown in Fig. 2.1. First, a reversible linear transform is used to reduce redundancy between adjacent pixels, i.e., the inter-pixel redundancy, in an image. After that, the image undergoes the quantization stage to reduce psychovisual redundancy. Lastly, the quantized image goes through entropy coding which aims to reduce coding redundancy. Transform coding is a core technique recommended by JPEG and adopted by H. 261, H.263, and MPEG 1/2/4. The corresponding decoding procedure is depicted in Fig. 2.2. We will discuss the three encoding stages in this section. Input signal Transform Quantization Entropy coding Compressed signal Fig. 2.1 Encoding model 5 Compressed signal Entropy decoding Inverse transform Reconstructed signal Fig. 2.2 Decoding model 2.1.1 Linear Transforms Transform coding exploits the inter-pixel redundancy of an image by mapping the image to the transform domain using a reversible linear transform. For most natural images, a significant number of coefficients will have small magnitudes after the transform. These coefficients therefore can be coarsely quantized or entirely discarded without causing much image degradation [12]. There is no information loss during the transform process, and the number of coefficients produced is equal to the number of pixels transformed. Transform itself does not directly reduce the amount of data required to represent the image. However, a set of transform coefficients are obtained in this way, which makes the inter-pixel redundancies of the input image more accessible for compression in later stages of the encoding process [12]. Defining the input signal x=[x1 , x2 , …, xN ]T as a vector of data samples with standard basis {a1 , a2 , …, aN } of an N-dimensional Euclidean space, we obtain: N x = ∑ xn a n (2.1) n =1 where A=[ a1 , a2 , …, aN ] is an identity matrix of size N × N. A different set of basis [ b1 , b2 , …, bN ] can be used to represent x as N x = ∑ ynb n (2.2) n =1 with yn being the co-ordinates of x with respect to bn ( n ∈{1,2,..., N } ). 6 Let B=[ b1 , b2 , …, bN ] and y=[ y1 , y2 , …, yN ]T, we have x= By (2.3) y= Tx (2.4) Rearrange equation (2.3), we get where T= B-1. Equation (2.4) then defines one-dimensional linear transform from vector x to y. The goal of the transform process is to de-correlate the pixels or to pack signal energy into as few as possible transform coefficients. However, not all linear transforms are optimal in this sense. Only the whitening transform (viz. Karhunen-Loeve transform (KLT), Hotelling transform or the method of principal components) [13], in which the eigenvectors of the input covariance matrix form the basis functions, de-correlates the input signal or image and is optimal in sense of energy compaction. However, KLT is seldom used in practice because it is data dependent, which causes high expense in computation. Instead, other near-optimal transforms such as the discrete cosine transform (DCT) is normally selected in practical transform coding systems because it provides a good compromise between energy compaction ability and computational complexity [14]. 2.1.2 Quantization After transform process, quantization is used to reduce the accuracy of the transform coefficients according to a pre-established fidelity criterion [14]. The effect of compression is achieved in this way. Quantization is an irreversible process. 7 Quantization is the mapping from the source data vector x to a code word rk = Q[x] in a code book { rk ; 1 ≤ k ≤ L}. The criterion to choose the proper code word is to reduce the expected distortion due to quantization with respect to a particular probability density distribution of the data. Assume the probability density function of x is f(x). The expected distortion can be formulated as: N 2 D = ∑ ∫ x − rk I ( x, rk ) f x ( x)dx (2.5) k =1 where 1 Q[ x] = rk ; I ( x, rk ) =  0 otherwise. (2.6) is an indicator function. 2.1.3 Arithmetic Coding In the final stage of transform coding, a symbol coder is used to create code to represent the output from the quantization process. In most cases, the quantized data is mapped to a set of variable-length code. It assigns the shortest code to the output value that occurs most frequently, and thereby reduces the coding redundancy and saves the amount of data that is required to represent the quantized data set. The following information theory provides the basic tools to deal with information representation quantitatively. Let {a1 , a2 ,...ai ,...ak } be a set of symbols from a memoryless source of messages, each with a known probability of occurrence, denoted as p(ai ) . The amount of information imparted by the occurrence of the symbol ai in the message is: I (ai ) = − log 2 p(ai ) (1 ≤ i ≤ k ) (2.7) 8 where the unit of information is bit for logarithm of base 2. The entropy of the message source is then defined as k H = −∑ p (a j ) log 2 p (a j ) (2.8) j =1 Entropy specifies the average information content (per symbol) of the messages generated by the source [14] and gives the minimum amount of bits (average) required to encode all the symbols in the system. Entropy coding aims to encode a given set of symbols with the minimum number of bits required so as to approach the entropy of the system. Examples of entropy coding include Huffman coding, run length coding and arithmetic coding. We give some details on arithmetic coding in the following. Arithmetic coding is a variable length coding based on the frequency of each character or symbol. It is suitable to encode a long stream of symbols or long messages. In arithmetic coding, probabilities of all code words sum up to unity. The events in the data set are arranged in an interval between 0 and 1. Each code word probability can be related to a subdivision of this interval. The algorithm for arithmetic coding then works as follows: i) Begin with a current interval [L, H) initialized to [0, 1); ii) For each incoming event, the current interval is subdivided into subintervals proportional to their probabilities of occurrence, one for each possible event; iii) Select the subinterval corresponding to the incoming event, make it the new current interval and go back to step 1. Arithmetic coding reduces the information that needs to be transmitted to a single number within the final interval, which is identified after the whole data set is encoded. 9 The arithmetic decoder, with the knowledge of occurrence probability of the different events and the number received, then maps the intervals identified and scales the intervals accordingly to decode the data set. 2.1.4 Binary Coding Binary coding is lossless, and is a necessary step in any coding system. The process of binary coding is shown in Fig. 2.3. Binary encoding Symbol ai Codeword ci bit length li Probability table pi Fig. 2.3 Binary coding model Denote the bit rate produced by such a binary coding system as R. According to Fig. 2.3, we have R= ∑ p(a )l (a ) ai ∈ A i i (2.9) 2.2 Video Compression Using MEMC Unlike still image compression, video compression attempts to exploit the temporal redundancy. There are two types of coding categorized according to the type of redundancy being exploited, i.e., intraframe coding and interframe coding. In intraframe coding, each frame is coded separately using still image compression methods such as transform coding, while interframe coding uses spatial redundancies 10 and motion compensation to exploit temporal redundancy of the video sequence. This is done by predicting a new frame from its previous frame, thus the original frame to code is reduced to the prediction error or residual frame [15]. We do this because prediction errors have smaller energy than the original pixel values and therefore can be coded with fewer bits. Those regions with high motion or scene changes will be coded directly using transform coding. Video compression system is evaluated using three criteria: reconstruction quality, compression rate and complexity. The method used to predict a frame from its previous one is called Motion Estimation (ME) or Motion Compensation (MC) [16] [17]. MC uses the motion vectors to eliminate or reduce the effects of motion, while ME computes motion vectors to carry on the displacement information of a moving object. Normally the two terms are often referred to as MEMC. reference frame actual frame motion vector actual block prediction block Fig. 2.4 Block matching motion estimation MEMC is normally done at macro block (MB) (16x16 pixels) level independently in order to reduce computation complexity, which is called the Block Matching Algorithm. In the Block Matching Algorithm (Fig. 2.4), a video frame is divided into macro 11 blocks. Each pixel within the block is assumed to have the same amount of translational motion. Motion estimation is achieved by doing block matching between a block in the current frame and a similar matching block within a search window in the reference frame. A two-dimensional displacement vector or motion vector (MV) is then obtained by finding the displaced co-ordinate of the match block to the reference frame. The best prediction is found by minimizing a matching criterion such as the Sum of Absolute Difference (SAD). SAD is defined as: M N SAD = ∑∑ Bi , j ( x, y ) − BI −u , j −v ( x, y ) (2.10) x =1 y =1 where Bi , j ( x, y ) represents the pixel with coordinate (x,y) in a MxN block from the current frame at spatial location (i,j), while BI −u , j −v ( x, y ) represents the pixel with coordinate (x,y) in the candidate matching block from the reference frame at spatial location (i,j) displaced by vector (u,v). 2.3 Wavelet Based Image and Video Coding This section provides a brief overview of wavelet based image and video coding [18][22]. The Discrete Wavelet Transform (DWT) is introduced and the subband coding schemes including the Embedded Zerotree Wavelet (EZW) and the Set Partitioning in Hieratical Tree (SPIHT) are discussed in detail. In the last sub-section, the concept of scalability is introduced. 12 2.3.1 Discrete Wavelet Transform The Discrete Wavelet Transform (DWT) is an invertible linear transform that decomposes a signal into a set of orthogonal functional basis called wavelets. The fundamental idea behind DWT is to present each frequency component as a resolution matched to its scale, so that a signal can be analyzed at various levels of scales or resolutions. In the field of image and video coding, DWT performs decomposition of video frames or residual frames into a multi-resolution subband representation. We denote the wavelet basis as − j 2 φ( j ,k ) ( x ) = 2 φ ( 2 − j x − k ) (2.11) where variables j and k are integers that are the scale and location index indicating the wavelet's width and position, respectively. They are used to scale or “dilate” φ (x) or the mother function to generate wavelets. The DWT transform pair is then defined as f ( x) =  ∞   ∑ (c j ,k φ j ,k ( x))  ∑ j = −∞ k = −∞  ∞ (2.12) ∞ c j ,k == ∫ φ j ,k ( x)* f ( x)dx (2.13) −∞ where f (x) is the signal to be decomposed, and c j,k is the wavelet coefficient. To span the data domain at different resolutions, we use equation (2.14): W ( x) = N −2 ∑ (−1) k = −1 k ck +1φ (2 x + k ) (2.14) W(x) is called the scaling function for the mother function φ (x) . 13 input vector L 2 H 2 aj cj L 2 2 H aj+1 ... cj+1 ... Fig. 2.5 1-D DWT decomposition L input image L 2 LL 2 2 H L 2 LH HL 2 H 2 H HH Rows Columns Fig 2.6 Dyadic DWT decomposition of an image In real applications, the DWT is often performed on a vector whose length is an integer power of 2. As Fig. 2.5 shows, the process of 1-D DWT computation comprises of a series of filtering and sub-sampling operations. H and L denote high and low-pass filters respectively, ↓ 2 denotes down-sampling by a factor of 2. Elements aj are passed on to the next step of the DWT and elements cj are the final wavelet coefficients obtained from the DWT. The 1-D DWT can be extended to 2-D for image and video processing. In this case, filtering and sub-sampling are first performed along all the rows of the image and then all the columns. 2-D DWT is called dyadic DWT. 1-level dyadic DWT results in four different resolution subbands, namely the LL, LH, HL and the HH subbands. The decomposition process is shown in Fig. 2.6. The LL subband contains the low frequency image and can be further decomposed by 2-level or 3-level 14 dyadic DWT. Fig. 2.7 depicts the subbands of an image decomposed using a 3-level dyadic DWT. Fig. 2.8 shows the Barbara image after 2- level decomposition. LL HL3 LH3 HH3 HL2 HL1 LH2 HH2 LH1 HH1 Fig 2.7 Subbands after 3-level dyadic wavelet decomposition Fig. 2.8 2-level DWT decomposed Barbara image The advantage of DWT is that it has versatile time frequency localization. This is because DWT has shorter basis functions for higher frequencies, and longer basis functions for lower frequencies. The DWT has an important advantage over traditional 15 Fourier Transform in that it can analyze signals containing discontinuities and sharp spikes. 2.3.2 EZW Coding Scheme Good energy compaction property has attracted huge research interest on DWT based image and video coding schemes. The main challenge of wavelet-based coding is to achieve an efficient structure to quantize and code the wavelet coefficients in the transform domain. Lewis and Knowles defined a spatial orientation tree (SOT) structure [23] - [27] and Shapiro then made use of the SOT concept and introduced the Embedded Zerotree Wavlet (EZW) encoder [28] in 1993. The idea is further improved by Said and Pearlman by modifying the EZW SOT structure. Their new structure is called Set Partitioning in Hierarchical Trees (SPIHT). A brief discussion on the EZW scheme is provided in this section and a detailed description on SPIHT is provided in the next section. Shapiro’s EZW coder contains 4 key steps: i) the discrete wavelet transform; ii) subband coding using the EZW SOT structure (Fig. 2.9); iii) entropy coded successive-approximation quantization; iv) adaptive arithmetic coding. A zerotree is actually a SOT which has no significant coefficients with respect to a given threshold. For simplicity, the image in Fig. 2.9 is transformed using a 2-level DWT. However, in most situations, a 3-level DWT is applied to ensure better 16 reconstruction quality. As shown in Fig. 2.9, the image is divided into 7 subbands after the 2-level wavelet transform. Nodes in the lowest subband will each have 3 children nodes with one in each of its neighborhood subband. Its children, in turn, will each have 4 children nodes which reside in the same spatial location of the correspondent higher subband. Thus, all the nodes are linked in the SOTs, a searching through these SOT trees will then be performed to have the significant coefficients found and coded with a higher priority. The core of the EZW encoder, step ii), is based on three concepts, i.e., comparison of coefficient magnitudes to a series of decreasing thresholds representing the current bit plane, ordered bit plane coding of refinement bits and exploitation of the correlation across subbands in the transform domain. Fig. 2.9 Spatial Orientation Tree for EZW EZW coding scheme is proved competitive with virtually all known compression techniques in performance, while still generating a fully embedded bit stream. It utilizes both the bit plane coding and the zerotree concept. 17 2.3.3 SPIHT Coding Scheme Pearlman’s SPIHT coder is an enhancement to the EZW coder. Basically, SPIHT is also a sorting algorithm which tries to code wavelet coefficients according to priority defined by their significance with respect to a certain threshold. This is achieved by tracking down the SPIHT SOT and comparing the coefficients against the given threshold. SPIHT scheme inherits the basic concepts of the EZW, except that it uses a modified SOT called SPIHT SOT (Fig. 2.10). * Fig. 2.10 Spatial Orientation Tree of SPIHT The SPIHT SOT structure is designed according to the observation that if a coefficient magnitude in a certain node of a SOT does not exceed a given threshold, it is very likely that none of the nodes in the same location in the higher subbands will exceed that threshold. The SPIHT SOT naturally defines this spatial relationship using a hierarchical pyramid. Each node is identified by the coordinate of the pixel and its magnitude is the correspondent absolute value of that pixel. As Fig. 2.10 shows, each node has either no or 4 offspring, which are located at the same spatial orientation in the next finer level of the pyramid. The 4 offspring always form 2 × 2 adjacent pixel groups. The nodes in the lowest subband of the image or the highest level of the 18 pyramid will be the roots of the SOT. There is a slight difference in the offspring branching rule for the tree roots, i.e., in each 2 × 2 group, the upper left node will be childless. Thus, the wavelet coefficients are organized in hierarchical trees with nodes in common orientation across all subbands linked in one same SOT. This will allow us to predict a coefficient’s significance according to the magnitude of its parent node later. We use the symbols in Said and Pearlman’s paper to denote the coordinates and the sets: • O(i,j) denotes set of coordinates of all offspring of node (i,j); • D(i,j) denotes set of coordinates of all descendants of the node (i,j); • H denotes all nodes in the lowest subband, inclusive of the childless nodes; • L(i,j) denotes the set of coordinates of all non-direct descendants of the node (i,j), i.e., L(i,j)=D(i,j)-O(i,j); Now we are able the express the SOT descendants branching rule by equation (2.15): O(i,j) = {(2i, 2j), (2i+1, 2j), (2i, 2j+1), (2i+1,2j+1)} (2.15) After the sets are defined, the set partitioning rule is used to create new partitions in order to effectively predict and code significant nodes. A magnitude test is performed on each partitioned subset to determine its significance. If significant, the subset will be further partitioned into new subsets, and the magnitude test will again be applied to the new subsets until each individual significant coefficient is identified. Note that an individual coefficient is significant when it is larger than the current threshold and insignificant otherwise. To make a significant set, at least one descendant must be 19 significant on an individual basis. We denote the transformed coefficients as ci,j, the pixel sets as τ and use the following function to define the relationship between magnitude comparisons and message bits: { } 1, if max ci , j ≥ 2 n ( i , j )∈τ S (τ ) =  0, otherwise (2.16) The set partitioning rule is then defined as follows: i) The initial partition is formed with sets {(i,j)} and D(i,j), for all (i,j) ∈ H; ii) If D(i,j) is significant, then it is partitioned into L(i,j) plus the four singleelement sets with (k,l) ∈ O(i,j); iii) If L(i,j) is significant, then it is partitioned into the four sets D(k,l) with (k,l) ∈ O(i,j). Following the SOT structure and the set partitioning rule, an image that has large coefficients at the SOT roots and zero or very small coefficients in higher level of the SOTs will need very little sorting and partitioning of the pixel sets. This property reduces the computational complexity greatly and allows for a better reconstruction of the image. In implementation, the SPIHT coding algorithm uses 3 ordered lists to store the significance information, i.e.: i) list of insignificant pixels (LIP) ii) list of significant pixels (LSP) iii) list of insignificant sets (LIS) 20 In case of LIP and LSP, the coordinates of the pixels will be stored in the list. In case of LIS, however, the list contains two types of coordinates categorized according to which set it represents. If an entry represents the set D(i,j), we say it is a type A entry; if it represents L(i,j), we say it is a type B entry. To initialize the coding algorithm, the maximum coefficient in the image is identified and the initial bit plane is assigned a value of n=log2(max(i,j){|ci,j|}. The threshold value is then obtained by computing 2n. Also, the LIS and LIP lists are initialized with pixel coordinates in the highest subband. The set portioning rule is then applied to the LIP and LIS lists to judge the significance status of the pixels or sets. This is called the sorting pass. Thereafter, the refinement pass will go through the LSP list to code bits necessary to enhance the precision of the significant coefficients from the previous sorting pass by a bit position. Thus, we fulfill coding under the first bit plane. To continue, the bit plane is decreased by 1 and the sorting and refinement passes are reexecuted in the next iteration. This process is repeated until the bit plane is reduced to zero or a user given bit budget runs out. Fig. 2.11 demonstrates the above coding algorithm. Note that in step 2.2), the entries added to the end of the LIS list are evaluated before that same sorting pass ends. So, step 2.2) not only sorts the original initialized entries, but also sorts the entries that are being added to the LIS list. The SPIHT coder improves the performance of the EZW coder by 0.3-0.6 dB. This gain is mostly due to the fact that the original zerotree algorithms allow special symbols only for single zerotrees, while there are often other sets of zeros in reality. In particular, SPIHT coder provides symbols for combinations of parallel zerotrees. Moreover, SPIHT produces a fully embedded bit stream that can be precisely 21 controlled in bit rate. SPIHT coder is very fast and has a low computational complexity. Both EZW and SPIHT belong to subband coding schemes, and both exploit the correlation between subbands through the SOT. 1) Initialization: 1.1) Output n= log2 (max (i, j) {| ci, j |}; 1.2) Set the LSP, LIP and LIS as empty lists Add coordinates (i,j) ∈ H to the LIP and those with descendants to the LIS as TYPE A entries. 2) Sorting Pass: 2.1) For each entry (i,j) in LIP do: 2.1.1) Output S(i,j); 2.1.2) If S(i,j)=1 then -Move (i,j) to LSP; -Output sign of ci,j; 2.2) For each entry (i,j) in LIS do: 2.2.1) If entry is TYPE A then -Output S(D(i,j)); -If S(D(i,j))=1 then +For each offspring (k,l) of (i,j) do: -Output S(k,l); -if S(k,l)=1 then add (k,l) to LSP and output sign of ck,l; -if S(k,l)=0 then add (k,l) to end of LIP; +If L(i,j)!=0 then move (i,j) to end of LIS as TYPE B and go step 2.2.2); otherwise remove (i,j) from LIS; 2.2.2) If the entry is TYPE B then -Output S(L(i,j)); -If S(L(i,j))=1 then +Add each element in L(i,j) to end of LIS as TYPE A; +Remove (i,j) from LIS; 3) Refinement Pass: For each entry (i,j) in LSP, except those from the last sorting pass: -Output the nth most significant bit of ci,j; 4) Quantization-Step Update: Decrease n by 1 and go to step 2. Fig. 2.11 SPIHT coding algorithm 22 2.3.4 Scalability One advantage of the SPIHT image and video coder is the bit rate scalability. Scalability is the degree to which video and image formats can be sized in systematic proportions for distribution over communication channels of varying capacities [29]. In other words, it measures how flexible an encoded bit stream is. Scalable image and video coding has received considerable attention from the research community due to the diversity of the communication networks and network users. There are three basic types of scalability, and they refine video quality along three different dimensions, i.e.: • Temporal scalability or temporal resolution/frame rate • Spatial scalability or spatial resolution • SNR scalability or amplitude resolution Each type of scalable coding provides scalability of one dimension of the video sequence. Multiple types of scalability can be combined to provide scalability along multiple dimensions. In real applications, being temporal scalable often means supporting different frame rates, while spatial and SNR scalability means video of different spatial resolution and visual quality respectively. One common method of providing scalability is to apply subband decomposition on the video sequences. Thus, the full resolution video can be decoded using both the low pass and high pass subbands, while half resolution video can be decoded using only the low pass subband. The resulting half resolution video sequences can be passed through further subband decomposition to create quarter resolution video, and so on. We will use this concept in chapter 4. 23 2.4 Image and Video Coding Standards ISO/IEC and ITU-T have been heavily involved in the standardization of image, audio and video coding as international organizations. Specifically, ISO/IEC focuses on video storage, broadcast video and video streaming applications while ITU-T caters to real time video applications. Current video standards mainly comprise of the ISO MPEG family and the ITU-T H.26x family. Table 2.1 [30] provides an overview of these standards and their applications. JPEG and JPEG 2000 are also listed as still image coding standards for reference. Standard JPEG JPEG 2000 MPEG-1 Application Still image compression Improved still image compression Video on CD Bit Rate Variable Variable 1.5Mbps MPEG-2 Digital Television, Video on DVD 2-20Mpbs MPEG-4 Object-based coding Interactive video 28-1024kbps H.261 Video conferencing over ISDN Variable H.263 Video conferencing over Internet and PSTN Wireless video conferencing >=33kbps H.26L Improved video compression 10-100kpbs Table 2.1 Image and video compression standards 24 CHAPTER 3 VIDEO STREAMING AND NETWORK QoS In this chapter some fundamentals in video streaming are provided. The network Quality of Service (QoS) is defined and two frameworks, i.e., the Integrated Services (IntServ) and the Differentiated Services (DiffServ) are discussed in detail. Finally, principles of layered video streaming are provided. 3.1 Video Streaming Models Unicast and multicast are the two models of video streaming. Unicast is the communication between a single sender and a single receiver. As shown in Fig. 3.1, the sender sends individual copies of video streams to each client even when some of the clients require the same video resource. Unicast is also called point-to-point communication because there seems to be a non-shared connection from the server to each client. client1 harddisk client2 z client3 server encoder client4 client5 Fig. 3.1 Unicast video streaming 25 By contrast, communication between a single sender and multiple receivers is called multicast or point-to-multi-points communication. In multicast scenario (Fig. 3.2), the sender sends only one copy of the required video over the network. It is then routed to several destinations by the network switches or routers. A client receives the video stream by tuning in to a multicast group in its neighborhood. When the clients are multiple groups, the video will be duplicated and branched at fork points, as shown at router R1 (Fig. 3.2). client1 R2 harddisk client2 z server R1 R3 client3 R4 encoder client4 webcam client5 Fig. 3.2 Multicast video streaming 3.2 Characteristics and Challenges of Video Streaming Video streaming is real time applications. Unlike traditional data-oriented applications such as email, ftp, and web browsing, video streaming applications is highly delaysensitive and need the data to arrive on time to be useful. As such, the service requirements of video streaming applications differ significantly from those of traditional data-oriented applications. To satisfy these requirements is a great challenge under today’s Internet. 26 First of all, the best effort (BE) services that current Internet provides are far from sufficient for a real time application as video streaming. Under the BE service model, there is no guarantee on delay bound or loss rate. When the data load is heavy on the Internet, delivery results could be very unacceptable. On the other side, video streaming requires timely and, to some extent, correct delivery. We must ensure that the streaming is still viable with a decreased quality in time of congestion. Second, client machines on the Internet normally vary significantly in their computing, display and memory capabilities. In most cases, these heterogeneous clients will require video of different qualities. It is obviously inefficient to delivery the same video stream to all clients seperately. Instead, streaming in response to particular requests from each individual client is desirable. As a conclusion, suitable coding and streaming strategies are needed to support efficient real time video streaming. Scalable video coding and new network service model which supports network Quality of Service are developed to address the above challenges. We will discuss details of QoS in the next section. 3.3 Quality of Service 3.3.1 Definition of QoS The current Internet provides one single class of service, the BE service. BE service generally treats all data as one service class and provides priority to no particular data or users. This is not enough for the new real time applications such as video streaming. We must modify the Internet and provide more service options, which can, to some 27 extent, keep the service quality up to a certain level that has been previously agreed on by the user and the network. This new service model is provided by support of network Quality of Service. Quality of Service (QoS) is the ability of a network element, such as the video streaming server/client and the switches/routers in the network, to have a certain level of assurance that its traffic and service requirements can be satisfied [31] [32]. QoS enables a network to deliver a data flow end-to-end with a guaranteed delay bound and bit rate required by user applications. There are generally two approaches to support QoS. One is fine-grained approaches, which provide QoS to individual applications or flows, the other is coarse-grained approaches, which aggregate data of traffic into classes, and provide QoS to these classes. Currently, the Internet Engineering Task Force (IETF) has developed QoS frameworks through both approaches, namely, the Integrated Services (IntServ) framework as example of fine-grained approaches, and the Differentiated Services (DiffServ) framework as example of coarse-grained approaches. 3.3.2 IntServ Framework IntServ is a per-flow based QoS framework with dynamic resource reservation [31] [32]. It provides QoS for specific flows by reserving resource requested by that flow through signaling between the host and network routers using the Resource Reservation Protocol (RSVP) [33]. As shown in Fig. 3.3, RSVP acts as a messenger between a particular application flow and various network mechanisms. These mechanisms are individual functional blocks that work together with the RSVP to 28 determine or assist in the reservation. Packet classifier classifies a packet into an appropriate QoS class. Policy control is then used to examine the packet to see whether it has administrative permission to make the requested reservation. For final reservation success, however, admission control must also be passed to ensure that the desired resources can be granted without affecting the QoS previously requested by and admitted to other flows. Finally, the packet is scheduled to enter the network at a proper time by the packet scheduler in the primary data forwarding path. Router Application Host Resource Reservation Policy Control Resource Reservation Routing Admission Control Classifier Packet Scheduler ... Policy Control Admission Control Classifier Packet Scheduler ... Fig. 3.3 IntServ architecture The IntServ architecture adds two service classes to the existing BE model, i.e., guaranteed service and controlled load service. Basic concept of guaranteed service can be described using a linear flow regulator called leaky bucket regulator (Fig. 3.4). Suppose there are b tokens in the bucket, and new tokens are being filled in at a rate of r tokens/sec. Before filtering by the regulator, packets are being thrown in with a variable rate. Under the filtering, however, they must wait at the input queue for the same amount of tokens to be available before they 29 can further proceed into the network. Obviously, such a regulator allows flows of a maximum burst rate of b tokens/sec and an average rate of r tokens/sec to pass. Thus, it confines the traffic to a network to b+rt tokens over an interval of t seconds. r tokens/sec b tokens variable rate packets input ... ... remove token to network input queue Fig. 3.4 Leaky bucket regulator To invoke the service, a router needs to be informed of the traffic and the reservation characteristics, denoted by Tspec and Rspec respectively. Tspec contains the following parameters: • p = peak rate of flow (bytes/s) • b = bucket depth (bytes) • r = token bucket rate (bytes/s) • m = minimum policed unit (bytes) • M = maximum datagram size (bytes) Rspec contains the following parameters: • R = bandwidth, i.e., service rate (bytes/s) • S = slack term (ms) 30 Guaranteed service promises a maximum delay for a flow, provided that the flow conforms to its specified traffic parameters. This service model aims to support applications with hard real time requirements. Unlike guaranteed service, controlled-load service provides no rigid delay or loss guarantees. Instead, it provides a QoS similar to BE service in an under utilized network, with almost no loss or delay. When the network is overloaded, it tries to share the bandwidth among multiple streams in a controlled way to manage approximately the same level of QoS. Controlled-load service is intended to support applications that can tolerate reasonable amount of delay and loss. 3.3.3 DiffServ Framework IntServ provides fine-grained QoS guarantees using the Tspec message. However, introducing Tspec for each flow may be too expensive in implementation. Besides, incremental deployment is only possible for controlled-load service, while it is difficult to realize guaranteed service across the network. Therefore, there is a need for more flexible service models to allow for more qualitative definitions of service distinctions. The solution is DiffServ, which aims to develop architecture for providing scalable and flexible service differentiation. Generally, the DiffServ architecture comprises of 4 key concepts: • DiffServ Domain; • Per Hop Behaviors (PHB) for forwarding; • Packet classification; • Traffic conditioning including metering, marking, shaping and policing. 31 DiffServ exploits edge-core distinction for scalability. As shown in Fig. 3.5, packet classification and traffic conditioning are done at the edge routers, where packets are marked in its Differentiated Services field (DS field) to identify the behavior aggregate it belongs to. At the core routers, forwarding is done very quickly according to the PHB associated with the DS mark. leaf router core router Fig. 3.5 An example of the DiffServ network It is necessary that we clearly define PHB here for further understanding of the DiffServ framework. A PHB is the description of the externally observable forwarding behavior of a DiffeServ node applied to a particular DiffServ behavior aggregate [34]. Until now, the IETF has defined 2 PHBs, i.e., the Expedited Forwarding (EF) PHB and the Assured Forwarding (AF) PHB. EF PHB promises services like a “virtual leased line” or “premium service”. AF PHB provides several class categories, with each class allocated different levels of forwarding assurances. These assurances in turn guarantee that a packet will be forwarded timely within the timeframe previously agreed on. 32 destination source BB BB egress router leaf router core routers ingress router DiffServ domain Fig. 3.6 DiffServ inter-domain operations Between different domains, Bandwidth Brokers (BB) are used to make arrangements on agreements (Fig. 3.6). The leaf router polices and marks incoming flows, and the egress edge router shapes aggregates. In the DiffServ domain, the ingress edge router is used to classify, police and mark the aggregates. BB performs admission control, manages network resources and configures leaf and edge routers. 3.4 Layered Video Streaming As discussed above, the current Internet is unreliable, which contradicts the QoS requirements of video streaming applications. In addition, the heterogeneity of receivers makes it difficult to achieve efficiency and flexibility in multicast video streaming. To address these problems, scalable video streaming [35]-[37] is introduced. Scalable video streaming comprises of two related parts: scalable video coding and adaptive QoS support from the network and end systems. 33 One approach to accomplish scalability is to split and distribute the video information over a number of layers, including a base layer and several enhancement layers. This technique is referred to as layered video coding [38]-[44]. Layered video coding produces multiple bit streams that can be progressively decoded. It is especially suitable for receiver-driven layered multicast, where the clients consciously select the layers they need from a group of common stream layers, and combine them into different bit streams for the decoder to produce video of different quality. Layered scalable video coding was first suggested by Ghanbari [45]. After that, it was widely studied. Early groundwork include Taubman and Zakhor’s multi-rate 3-D subband coding [46], and Vishwanath and Chou’s wavelet and Hierarchical Vector Quantization (HVQ) combined coder [47] for interactive multicast network scenarios. Khansari et al modified the h.261 (p*64) standard algorithm, and proposed an approach to encoding video data into two priority streams, thereby enabling the transmission of video data over wireless links to be switched between two bit rates [48]. McCanne et al developed a hybrid DCT/wavelet-based video coder and introduced the concept of Receiver-driven Layered Multicast (RLM) into literature [49]. MPEG-4 based layered video coding method has also been studied, such as MPEG-4 video coder using both a prediction-based base layer and a fine-granular enhancement layer in [50]. Layered scalable video coding makes it possible to represent the raw video in a layered format, which can be progressively transmitted and decoded. Fig. 3.7 is an illustration of such a layered codec. 34 layered encoder layered decoder layered decoder + + layered decoder Fig. 3.7 Principle of a layered codec Network QoS support is another key mechanism in scalable video streaming. Principles of QoS have been discussed in detail in the previous sections. Particularly, layered scalable video streaming is used to realize a better allocation of bandwidth in time of congestion. The basic idea of layered streaming is to encode raw video into multiple layers of cumulative bit streams that can be separately transmitted and progressively decoded. Packets from different layers will be marked with different drop priorities in the network. Clients consciously select different combinations of layers to receive a video of the quality they prefer. Layered mulicast streaming will be discussed in detail in chapter 4. 35 CHAPTER 4 LAYERED 3D-CSPIHT CODEC In this chapter, we propose a layered multi-resolution scalable video codec based on the 3-D Color Set Partitioning in Hierarchical Trees (3D-CSPIHT) scheme, called layered 3D-CSPIHT. The original CSPIHT and 3D-CSPIHT video coder are introduced and limitations with regard to layered multicast video streaming are discussed before presenting the layered 3D-CSPIHT codec. 4.1 CSPIHT and 3D-CSPIHT Video Coder To effectively code color images, SPIHT has been extended to Color SPIHT (CSPIHT). Like the SPIHT scheme, the CSPIHT is essentially an algorithm for sorting wavelet coefficients across subbands. The coefficients in the transform domain are linked by an extended SOT structure that is then partitioned such that the coefficients are divided into sets defined by the level of the most significant bit in a bit-plane representation of their magnitudes [51]. Significant bits are coded with higher priority under certain bit budget constraints, thus creating a rate controllable bit stream. In the luminance plane, the SPIHT algorithm is used to define the SOT while in chrominance planes, the EZW structure is adopted. To efficiently link the nodes in the luminance plane to those in the chrominance planes and still produce an embedded bit stream, the childless nodes in the luminance plane are linked to the root nodes in the chrominance planes. Fig. 4.1 depicts the SOT structure of the CSPIHT for image coding. 36 Y (SPIHT structure) Cb Cr (EZW structure) Fig. 4.1 CSPIHT SOT (2-D) To code video sequences, the CSPIHT integrates block motion estimation into the coding system. The overall CSPIHT structure includes four functional blocks, namely, the DWT, the block motion estimation scheme, the CSPIHT kernel and the binary/arithmetic coding module (Fig. 4.2). coding feedback Rate Control Intra-frame Inter-frames Video In CSPIHT Coding Kernel DWT Binary/Arithmetic Coding Bit Stream Huffman Coding Motion Vector Stream Error Frames Block MEMC Motion Vectors Fig. 4.2 CSPIHT video encoder The incoming video sequences are separately coded with the first frame coded under the intra-frame mode and the rest under the inter-frame mode. The intra-frame is 37 passed directly to the 2-D DWT for filtering. The transformed image first undergoes encoding in the CSPIHT kernel and then goes through binary or arithmetic coding to produce encoded bit streams. All the inter-frames, however, are passed into block MEMC, resulting in a series of error frames and motion vectors. The error frames are then coded following exactly the intra-frame coding procedure and the motion vectors are coded using Huffman coding scheme to reduce statistical redundancy. The transmitted bits are constantly counted and updated to the CSPIHT kernel via the coding feedback (Fig. 4.2). CSPIHT kernel compares this information against a user given bit budget and stops the coding procedure when the budget is used up. This allows for precise rate control. If the bit budget is not met until the bit plane reduces to zero, the coding halts automatically. The corresponding CSPIHT video decoder is illustrated in Fig 4.3. Bit Stream Binary/Arithmetic Decoding Inverse CSPIHT Inverse DWT + Video Out + MCME Prediction Motion Vector Stream Huffman Decoding Fig. 4.3 CSPIHT video decoder The CSPIHT video coder has also been extended to 3-D for the compression of video sequences. 3D-CSPIHT encoder takes a group of frames (GOF), which normally consists of 16 frames, as input. 3-D DWT will then be applied, where we do temporal filtering in addition to horizontal and vertical filtering at each stage of the wavelet 38 decomposition. Finally, the wavelet coefficients will be linked using a Spatial Temporal Orientation Tree (STOT) (Fig. 4.4) and coded in the 3D-CSPIHT kernel. Fig. 4.4 3D-CSPIHT STOT The 3-D STOT is obtained by a straightforward extension of the 2D-CSPIHT SOT to 3-D. The roots in the luminance STOT consists of 2×2×2 cubes in the lowest subband where there is one childless node. This childless node is then linked as in the CSPIHT to the roots of the chrominance STOTs. Denoting the coordinates of each coefficient by (i,j,k), the previously defined SPIHT set denotation is extended as follows : • O(i,j,k) represents the set containing all the offspring of node (i,j,k); 39 • D(i,j,k) represents the set containing all the descendants of node (i,j,k); • H represents the set of all the STOT roots; • L(i,j,k) = D(i,j,k) - O(i,j,k). We can then express the 3-D descendents branching rule using equation (4.1): O(i,j,k) = { (2i,2j,2k), (2i,2j+1,2k), (2i,2j,2k+1), (2i,2j+1,2k+1), (2i+1,2j,2k), (2i+1,2j+1,2k), (2i+1,2j,2k+1), (2i+1,2j+1,2k+1)} (4.1) The 3D-CSPIHT set partitioning rule is defined as follows: i) The initial partition is formed with sets {(i,j,k)} and D(i,j,k), for all (i,j,k) ∈ H; ii) If D(i,j,k) is significant, then it is partitioned into L(i,j,k) plus the four singleelement sets with (l,m,n) ∈ O(i,j,k); iii) If L(i,j,k) is significant, then it is partitioned into the four sets D(l,m,n) with (l,m,n) ∈ O(i,j,k). Following the above extensions, the 3D-CSPIHT encoder (Fig. 4.5) and decoder (Fig. 4.6) are very similar to the CSPIHT encoder and decoder. Feedback Rate Control Video In 3-D DWT 3-D CSPIHT Coding Kernel Binary/Arithmetic Coding Bit Stream Fig. 4.5 3D-CSPIHT video encoder 40 Bit Stream In Binary/Arithmetic Coding Inverse 3-D DWT 3-D CSPIHT Coding Kernel Video In Fig. 4.6 3D-CSPIHT video decoder 4.2 Limitations of the Original 3D-CSPIHT Codec 3D-CSPIHT video codec provides satisfactory results for color image and video coding in terms of PSNR, but there are some limitations which prevent it from being directly incorporated with layered video streaming systems. First of all, the original 3D-CSPIHT aims at precise rate control and sorting of coefficients terminates when a user defined bit budget is used up. The decoder must know the same bit budget information to decode a particular stream. For video reconstruction at a different bit-rate, the encoder needs to run again with the desired bit budget. However, in layered streaming systems, it is not appropriate to control bit rate by re-encoding. Instead, the encoder is expected to produce a fully coded video stream. Video quality is controlled at the client side by subscribing to multicast groups that carry the desirable layers. block header lost data GOF1 GOF2 stream at the server Data Data stream at the decoder Data Data GOF1 GOF2 GOF3 Data Data GOF3 GOF4 Data Data Data data confused data GOF4 Fig. 4.7 Confusion when decode incomplete data using original 3D-CSPIHT decoder 41 Furthermore, as discussed in chapter 3, a layered bit stream is assumed for layered video streaming. We achieve this in the layered 3D-CSPIHT codec by re-sorting the wavelet coefficients with different priorities according to pre-defined resolution layers [52]-[56]. Finally, since data loss is very possible over the network, the decoder must be able to decode differently truncated versions as well as incomplete versions of the original encoded stream (stream before network transmission). Fig. 4.7 illustrates the confusion when decoding incomplete data under the original 3D-CSPIHT scheme. The first stream in Fig. 4.7 is the encoded video data sent from the network server, and the second is the one that arrives at the decoder. As shown, the loss of one packet in GOF2 renders the decoder unable to correctly decode all data that arrive after it. To overcome this problem, additional flags are needed to enable the decoder to identify the beginning of new layers and GOFs. 4.3 Layered 3D-CSPIHT Video Codec In this section we present the Layered 3D-CSPIHT codec which overcomes the limitations of the 3D-CSPIHT highlighted above so that it can be applied to a layered video transmission system. New features of the layered codec and the layer ID functions are presented. The production of the multi-resolution scalable bit stream is then discussed, and how the layered codec functions in the network is explained. Finally, in the last sub-section, the layered 3D-CSPIHT algorithm is provided. 42 4.3.1 Overview of New Features Our main consideration is how to cooperate with the network elements and enable the decoder to decode incomplete bit streams. We suppose the video streaming system is real time and multicast. (Fig. 4.8) multi document receiver decoder player sender encoder layer 1 layer 2 server layer 7 ... ... ... PDA .. . desktop PC laptop PC Fig. 4.8 Network scenario considered for design of the layered codec Layered multicast enables receiver-based subscription of multicast groups that carry incrementally decodable video layers. In a layered multicast system, the server sends only one video stream and the receivers/clients can choose different resolutions, sizes, and frame rates for display, resulting in different bit rates. As shown in Fig 4.8, the encoder is executed offline. The resulting stream is separated into layers according to resolution and stored separately so that they can be sent over different multicast groups. The server must make various decisions including the number of layers to be sent, the layers to be discarded if bandwidth is not adequate for all layers and etc. The server is able to do this because it has information about the network status including the 43 available bandwidth and the congestion level. Heterogeneous clients subscribe to the layers they want based on the capacity of the client machines and user requests. Users may not always want to see the best quality video even if they could because it takes more time and costs more. In Fig. 4.8 for example, the PDA client only subscribes to the first layer of each GOF, the laptop PC client subscribes to the first two layers while the powerful desktop PC client subscribes to all seven layers. Under such network scenario, we need a codec which provides layered and progressively decodable streams, so that the encoding can be executed offline and the clients are able to control their own video qualities from one copy of fully encoded bit stream. Also, to accommodate possible loss during network delivery, the decoder needs to know which layer the incoming data belong to. Due to the above considerations, we extended the original 3D-CSPIHT codec so that it can be used in layered video streaming systems by removing the bit budget parameter so that all coefficients are coded and by incorporating the following new features: • A new sorting algorithm that produces resolution/frame rate scalable bit streams in a layered format; • A specially designed layer ID in the encoded bit stream that identifies the layer that a particular data packet belongs to. 4.3.2 Layer IDs The layer ID is designed to act as a layer identifier, which tells the beginning positions of new layers. It must be unique and result in minimal overhead for any combination of video data. Synchronization bits consisting of k consecutive ‘1’s, i.e., ‘1111…11’ are introduced as the layer ID at the beginning of each layer. 44 ID layer1 ID 10... 11...1 01110011...1010100110111011...101011...101011...1001101..... 11...1 k k-1 k-1 k-1 layer2 . . . . . . 1 1 0 0 1 1 1 1 header k-1 data added added added added Fig. 4.9 The bit stream after layer ID is added To make the ID unique, occurrences of k consecutive ‘1’s in the video data stream are extended by inserting a ‘0’ bit or layer ID protecting bit after k -1 ‘1’s so that the sequence becomes ‘1111…101’. If the data bit after k-1 consecutive ‘1’s is ‘0’, an additional ‘0’ will still be added to protect the original ‘0’ from being removed at the decoder (Fig. 4.9). Once the video stream is received at the decoder, layer ID protecting bits are removed from occurrences of ‘1111…10’ while conducting normal 3D-CSPIHT decoding. A good value of k is one that results in the smallest overhead. If k is too large, the layer ID will be a high overhead while if it is too small many ‘0’s have to be inserted. From simulation conducted by examining the resulting compression ratios with different reasonable values of k, k=8 was found to be a good choice. Different layers use the same ID by maintaining an ID counter that counts the number of times the layer ID is captured. For example, the decoder knows that the successive data belongs to layer 3 when it detects the layer ID for the third time. ID counter is reset to zero as soon as it reaches the maximum number of layers. Our layered codec has 7 layers as it uses 3-level spatial and temporal wavelet transform. In a congestion adaptive network, the network elements can consciously select less important data to discard when congestion occurs. In our proposed codec, different 45 layers have different priorities according to their resolution. The Layer ID is essential as it enables network elements and decoders to identify the beginning position of a new layer or the boundary between two layers. Knowing the layer boundaries not only enables decoding of video streams that have experience packet loss, but also helps QoS marking when unicast streaming is needed. 4.3.3 Production of Multi-Resolution Scalable Bit Streams As stated in chapter 3, layered video streaming systems demand a layered video codec. This is achieved in the layered 3D-CSPIHT video coder by re-arranging the wavelet coefficients to produce multi-resolution scalable bit streams. The encoded bit stream is divided into progressively decodable layers according to their resolution levels. Fig. 4.10 depicts the relationship of the subbands and the layers for one GOF. The total of 22 subbands are divided into 7 layers that are coded in the order of layer 1, layer 2, up to layer 7. Subbands of the same pattern are aggregated as one layer. Fig. 4.11 shows how the layers are sent over different multicast groups. As more layers are added on, the video quality will improve in both spatial and temporal resolution. Table 4.1 shows the 7 resolution options which provide different combinations of spatial resolutions and frame rates. To re-sort the transform coefficients according to layer priorities, the significance criteria of the original 3D-CSPIHT needs to be restricted. In the original 3D-CSPIHT, there are four types of significance criteria: 46 layer6 layer4 layer7 layer3 layer1 layer2 layer5 Fig. 4.10 Resolution layers in the layered 3D-CSPIHT layer 7 multicast group 7 layer 6 multicast group 6 layer 5 multicast group 5 layer 4 multicast group 4 layer 3 multicast group 3 layer 2 multicast group 2 layer 1 multicast group 1 GOF1 GOF2 Fig. 4.11 Progressively transmitted and decoded layers Layers 1 1+2 1+3 1+2+3+4 1+2+3+4+5 1+2+3+4+6 All 7 layers Spatial resolution Low Medium Low Medium High Medium High Frame rate Low Low Medium Medium Medium High High Table 4.1 Resolution options 47 i) A node in the LIP list is significant if its magnitude is larger than the current threshold; ii) A type A entry is significant if one of its descendants is larger in magnitude than the current threshold; iii) Type A offspring is significant if its magnitude is larger than the current threshold; iv) A type B entry is significant if one of its indirect or non-offspring descendants is larger in magnitude than the current threshold. In the layered 3D-CSPIHT, a set entry is significant when it satisfies the above criteria and when it is in an effective subband, i.e., a subband that is currently being coded. However, the significance criteria for individual nodes remain the same as in the original 3D-CSPIHT, because previous subband effectiveness checks on the LIS sets have prevented any non-effective node from being selected as future significant node candidates. We now compare the original 3D-CSPIHT and the layered 3D-CSPIHT sorting on an example video frame to derive the extended significance criteria. For simplicity, we suppose that the maximum bit plane is 2. Fig. 4.12 (a) shows a decomposed video frame using a 3-level wavelet transform. The DWT divides the 16 × 16 image into seven subbands, which comprises of three layers, shown in different colors. Except the childless root A, roots B, C and D and their descendants are organized into three SOTs according to the 3D-CSPIHT descendents branching rule (Fig. 4.12 (b)). 48 A* B B C D B C C D 1 2 C3 C4 C C 11 12 1 3 1 D3 C 21 B B D B 2 B 4 B 2 11 13 31 B 12 B B B B 14 22 B 23 B 32 B 21 B 41 B 33 B 34 B 43 B 44 C D D D D 11 12 21 22 C 13 C 14 C 23 C 24 D 13 D 14 D 23 D 24 C 31 C 32 C 41 C 42 D 31 D 32 D 41 D42 C C C C D D D D 33 34 43 44 33 34 43 {B12} {B21} {B22} {B13} {B14} {B23} {B24} {B31} {B32} {B41} {B42} {B33} {B34} {B43} 42 D4 22 {B11} 24 {B44} 44 {C11} {C12} {C21} {C22} {D11} {D12} {D21} {D22} {C13} {C14} {C23} {C24} {D13} {D14} {D23} {D24} {C31} {C32} {C41} {C42} {D31} {D32} {D41} {D42} {C33} {C34} {C43} {C44} {D33} {D34} {D43} {D44} legend childless node : significant at bit plane 2 : significant at bit plane 1 A*: B C 22 {D11} : D111, D112, D113, D114 : layer 1 : layer 2 : layer 3 Fig. 4.12 (a) An example video frame after DWT transforms B B2 B1 B3 B11 B12 B13 B14 B4 B31 B32 B33 B34 B21 B22 B23 B24 {B11}{B12}{B13}{B14} B41 B42 B43 B44 {B31}{B32}{B33}{B34} {B21}{B22}{B23}{B24} {B41}{B42}{B43}{B44} Fig. 4.12 (b) SOT for Fig. 4.12 (a) 49 At initialization of the original 3D-CSPIHT algorithm, nodes from the lowest subband (i.e., A, B, C and D) are added to the LIP list and those with descendants (i.e., B, C, D) are added to the LIS list as type A entries. The bit plane is set to 2. Sorting then begins with significance checks on the LIP nodes. As assumed in Fig. 4.12 (a), only node B is significant at bit plane 2, therefore B is moved to the LSP list. The LIP sorting then terminates because no other significant nodes are present. In the LIS sorting, we first determine the significance of type A entries B, C and D. There are significant descendants for all of them, therefore their offspring B1, B2 up to D4 are coded and moved to the LIP or LSP list accordingly. Meanwhile, nodes B, C and D are moved to the end of the LIS list as type B entries. The processing of LIS entries continues with significance checks on these newly added type B entries (i.e., nodes B, C and D). As entry D has no significant non-offspring descendants, it remains in the list, while B and C are removed and their offspring are added to the end of the list as new type A entries. These entries are processed recursively as above until no significant entry is present in the LIS list. Table 4.2 shows the initial state and the final state of the LIP and LIS sorting at bit plane 2. We underline an entry to indicate that it is a type B entry, and use ‘~’ to substitute nodes or entries in the previous column of the same list. Initialization State after LIP sorting State after LIS sorting LIP ABCD ACD ~B3C1C2C3D1D2D3B21B23B24B42C42C44 LIS BCD ~ LSP Φ B DB3C1C2C3B14B21B23B24B42C42C43C44 BB1B2B4C4D4B11B12B13B14B22B41B43B44C41C43 {B11}{B12}{B13}{B22}{B41}{B43}{B44}{C41} Table 4.2 LIP, LIS, LSP state after sorting at bit plane 2 (original 3D-CSPIHT) 50 To continue the sorting, bit plane is reduced by 1 and the same process is carried out in the LIP and LIS lists. Sorting terminates when the bit plane is reduced to 0. The sorting results at bit plane 1 and 0 are shown in Table 4.3 and Table 4.4. State after LIP sorting State after LIS sorting LIP DB3C3D2D3B23C44 ~ C13 LIS DB3C1C2C3B14B21B23B24B42C42C43C44 DB3C3B14B23C42C43C44C12C13 LSP ~ACC1C2D1B21B24B42C42 ~C11C12C14C21C22C23C24{B21}{B24}{B42} {C11}{C14}{C21}{C22}{C23}{C24} Table 4.3 LIP, LIS, LSP state after sorting at bit plane 1 (original 3D-CSPIHT) State after LIP sorting State after LIS sorting LIP Φ Φ LIS DB3C1C2C3B14B21B23B24B42C42C43C44 Φ LSP ~DB3C3D2D3B23C44 ~B31-B44C31-C44{B14}{B23}{C42}{C44}{C12}{C13} D11-D44 {B31}-{B34}{C31}-{C34}{D11}-{D44} Table 4.4 LIP, LIS, LSP state after sorting at bit plane 0 (original 3D-CSPIHT) There is no layering in the original 3D-CSPIHT algorithm; therefore, the above described sorting process does not conduct checks of subband effectiveness. In the layered 3D-CSPIHT, however, subband effectiveness checks are necessary to confine sorting within an effective subband. At initialization, layer 1 is set as the effective layer. In the LIP sorting, only node B has significant descendants, and the descendants are in the currently effective layer (layer 1). Therefore, node B is moved to the LSP list. The LIS sorting then begins by examining type A entries (i.e., B, C and D) in the LIS list. Each of them has significant descendants in layer 1, so they are coded as in the original 3D-CSPIHT. When examination of type B entries is conducted, however, it is found that although significant non-offspring descendants are present for entry B and C, none of them 51 resides in the current effective layer. Therefore, the final result of LIS sorting at bit plane 2 is different from that of the original 3D-CSPIHT. (Table 4.5) Initialization State after LIP sorting State after LIS sorting LIP ABCD ACD ACDB3C1C2C3D1D2D3 LIS BCD ~ BCD LSP Φ B BB1B2B4C4D4 Table 4.5 LIP, LIS, LSP state after sorting at bit plane 2 (Layered 3D-CSPIHT, layer 1 effective) The bit plane is then reduced by 1 and the sorting continues. Significant LIP nodes are now found effective and moved to the LSP list as in the original 3D-CSPIHT, while significant LIS nodes remains non-effective, resulting in no change in the LIS list (Table 4.6). At bit plane 0, all LIP nodes and no LIS nodes are identified as significant (Table 4.7). State after LIP sorting State after LIS sorting LIP DB3C3D2D3 ~ LIS BCD ~ LSP BB1B2B4C4D4ACC1C2D1 ~ Table 4.6 LIP, LIS, LSP state after sorting at bit plane 1 (Layered 3D-CSPIHT, layer 1 effective) State after LIP sorting State after LIS sorting LIP Φ Φ LIS BCD ~ LSP BB1B2B4C4D4ACC1C2D1DB3C3D2D3 ~ Table 4.7 LIP, LIS, LSP state after sorting at bit plane 0 (Layered 3D-CSPIHT, layer 1 effective) 52 Initialization State after LIP sorting State after LIS sorting LIP Φ Φ B21B23B24B42C42C44 LIS BCD ~ DB3C1C2C3B1B2B4C4 LSP BB1B2B4C4D4ACC1C2D1DB3C3D2D3 ~ ~ B11B12B13B14B22B41B43B44C41C43 Table 4.8 LIP, LIS, LSP state after sorting at bit plane 2 (Layered 3D-CSPIHT, layer 2 effective) The effective layer is then updated to layer 2 and the bit plane is reset to 2. Those noneffective descendants of entry B and C become effective, resulting in the coding of all significant coefficients in layer 2 (Table 4.8). Coding results of the LIP and LIS sorting at bit plane 1 and 0 are shown in Table 4.9 and Table 4.10. LIP State after LIP sorting B23C44 LIS DB3C1C2C3B1B2B4C4 DB3C3B1B2B4C4C1C2 LSP ~B21B24 B42 C42 ~C11C12C14C21C22C23C24 State after LIS sorting B23C44C13 Table 4.9 LIP, LIS, LSP state after sorting at bit plane 1 (Layered 3D-CSPIHT, layer 2 effective) LIP State after LIP sorting Φ State after LIS sorting Φ LIS DB3C3B1B2B4C4C1C2 ~ LSP ~B23C44C13 ~ Table 4.10 LIP, LIS, LSP state after sorting at bit plane 0 (Layered 3D-CSPIHT, layer 2 effective) From the above analysis, it is easily found that any node that is in a non-effective layer can not enter the LIP list as a future candidate of significant node because of previous subband effectiveness checks on the LIS entries. As a conclusion, subband 53 effectiveness checks are necessary only when determining significance of type A and type B entries, i.e.: • A type A LIS entry is significant if at least one of its effective descendants is larger in magnitude than the current threshold; • A type B LIS entry is significant if at least one of its effective non-offspring descendants is larger in magnitude than the current threshold. 4.3.4 How the Codec Functions in the Network The layered 3D-CSPIHT solves the problem mentioned in section 4.2 (Fig. 4.7). Fig. 4.13 is a detailed version of the first stream in Fig. 4.7. Each GOF is re-sorted and separated into 7 resolution layers. Layer ID is slotted between every two layers at the encoding stage. As layer ID is designed to be a unique binary code, the decoder can easily identify it while ‘reading’ the stream. When a packet in layer 6 is lost (dark area in Fig. 4.13), the decoder will stop decoding the current layer (layer 6) on detecting the subsequent layer ID. Thus, the confusion in Fig. 4.7 is avoided and correct decoding of the subsequent layers after the lost packet is realized. If the lost packet is not the last packet in a layer, the decoder will have to wait until the next layer ID before it can conduct correct decoding. In the network, block headers and layer IDs should be marked with the lowest drop precedence. That is to say, correspondent QoS support from the network is expected to ensure that layer ID is safely transmitted. The layered 3D-CSPIHT codec relies on the layer ID to support decoding when corruption or loss of the encoded bit stream occurs. Therefore, layer ID itself must be protected from corruption or loss. As stated in chapter 3, layered video streaming system comprises of two related parts: a layered 54 codec and network QoS support. The focus of the layered 3D-CSPIHT codec is to provide the required layered codec to work with the network providing QoS support. It is reasonable to make the assumption that the layer ID will be safely transmitted when certain QoS is guaranteed from the network. layer1 layer2 ...... layer7 layer1 layer2 layer3 GOF2 GOF1 block header ... layer ID data lost data Fig. 4.13 Bit stream structure of the layered 3D-CSPIHT coder Fig. 4.14 is a flowchart of the layered 3D-CSPIHT decoder. Before the normal 3DCSPIHT sorting is carried out, the incoming stream is inspected to detect the presence of layer ID sequences. Any data between two layer IDs are considered to be in the same layer as the first layer ID. Layer serial numbers are stored in a variable called id_cnt. Decoder switches to the layer identified by the id_cnt value upon detection of each layer ID sequence. When id_cnt is 0, it is assumed that the successive stream belongs to layer 7. When the unicast streaming is required, layer IDs also act as signaling to the network routers or switches. Unlike multicast streaming, unicast streaming does not provide multiple channels for data from different layers and it is very possible that different layers get mixed up during the transmission. For example, a layer 1 packet will seem to be no difference from a layer 7 packet to a router in the network. On the other hand, in layered streaming algorithms, networks rely on packet marking to provide QoS service. Packets from different resolution layers will be marked with different priorities since 55 they contribute differently to the reconstruction. Hence, it is highly desirable for different layers in the encoded bit stream to be easily identified by network routers. read in stream Yes stream header? No start of a GOF Yes block header? No layer id? start of a new layer within one GOF No Yes Yes process layer 7 No id_cnt++ id_cnt==7? id_cnt==0? No process layer [id_cnt] Yes id_cnt=0 Fig. 4.14 Flowchart of the layered decoder algorithm In the original 3D-CSPIHT coder, the encoding is transparent to the network. In other words, once encoded, network elements (e.g., servers, routers and clients) are not able to know how a particular chunk of bits will contribute to reconstruction of the compressed video. In the layered 3D-CSPIHT bit stream, the layer IDs are used to inform the network routers which layer the data being currently processed belong to. By doing this, the router is able to drop packets according to their layer priorities when the network is congested. 4.3.5 Layered 3D-CSPIHT Algorithm 56 Our layered 3D-CSPIHT algorithm is similar to the original 3D-CSPIHT algorithm except for the following: i) coefficients are re-sorted by redefining the criterion for a node to be significant; ii) the layer ID is inserted in the encoded bit stream between consecutive layers; iii) additional zeros are inserted to protect the uniqueness of the layer ID. A bit-counter is used to keep track of the number of ‘1’ bits. At initialization stage, the bit-counter is reset to zero and subbands belonging to layer 1 are marked as effective subbands. Next, the layered 3D-CSPIHT sorting pass is conducted. Nodes in the LIP list are coded as in the original 3D-CSPIHT algorithm, while subband effectiveness is checked when judging significance of entries in the LIS list. As explained in section 4.3.3, subband effectiveness checks are not necessary in the LIP list because the check done in the LIS list prevents nodes from non-effective subbands from entering the LIP. A special step, called layer ID protecting, is carried out during the sorting pass in the layered 3D-CSPIHT whenever a ‘1’ is output to the encoded bit stream. In layer ID protecting, we increment the bit-counter by 1. When bit-counter reaches k-1, a‘0’ will be added to the encoded bit stream to prevent the occurrence of ‘1111…11’. Also, layer effectiveness must be updated to the next layer, and layer ID must be written to the encoded bit stream at end of coding each layer. The entire layered 3D-CSPIHT algorithm is listed in Fig. 4.15. 57 1) Initialization: 1.1) Output n= log2 (max (i, j,k) {| ci, j,k |}; 1.2) Set subbands belonging to layer 1 as effective subbands; 1.3) Set bit-counter to 0; 1.4) Set the LSP, LIP and LIS as empty lists and add coordinates (i,j,k) in the first subband to the LIP and those with descendants to the LIS as TYPE A entries. 2) Sorting Pass: 2.1) For each entry (i,j,k) in LIP do: 2.1.1) Check for significance (two conditions must both be satisfied); 2.1.2) If significant then -Output ONE and execute step (4); -Output sign of ci,j,k; +If positive, execute step (4); -Move (i,j,k) to LSP; (add to LSP and remove from LIP. If insignificant then -Output ZERO and move to next node in LIP. 2.2) For each entry (i,j,k) in LIS do: 2.2.1) If entry is TYPE A then 2.2.1.1) Output significance of descendents; 2.2.1.2) If one of the descendents is significant then -Output ONE; -For each offspring (k,l,m) of (i,j,k) do +Output significance; +If significant then /Output ONE; /Encode sign bit, if positive, execute step (4); /Add (k,l,m) to LSP; +If insignificant, then /Output ZERO; /Add (k,l,m) to end of LIP; -If further descendents are present, then +Move (i,j,k) to end of LIS as TYPE B entry; +Remove (i,j,k) from LIS; +Go to step (2.2.2); 2.2.1.3) If none descendent is significant then -Output ZERO and move to next node in LIS; 2.2.2) If the entry is TYPE B then 2.2.2.1) Output significance; 2.2.2.2) If significant then -Output ONE and execute step (4); -Add all offspring of (i,j,k) to end of LIS as TYPE A ; -Remove (i,j,k) from LIS; 2.2.2.3) If insignificant then -Output ZERO and move to next node in LIS; 3) Refinement Pass: For each entry (i,j,k) in LSP, except those from the last sorting pass: -Output the nth most significant bit of ci,j,k; 4) Layer ID Checking: Increment bit-counter by 1; If bit-counter is k-1 then -Output ZERO and reset bit-counter to ZERO; 5) Quantization-Step Update: Decrement n by 1 go to step 2. 6) Layer Effectiveness Update and Layer ID Writing: If n=0, update effective subbands to the next layer, write layer ID and go to step (1). Fig. 4.15 Layered 3D-CSPIHT Algorithm 58 CHAPTER 5 PERFORMANCE DATA In this chapter we present performance data of the layered 3D-CSPIHT video coder. Performance measurements for image and video coding are introduced and the 3DCSPIHT video coder is evaluated in terms of PSNR, encoding time, and compression ratio. 5.1 Coding Performance Measurements Image and video quality is often measured in terms of the Mean Square Error (MSE) or the peak signal-to-noise ratio (PSNR). Suppose the total number of pixels in an image is N. Denote the original value of a pixel with xi and its reconstructed value with x'i . The mean square error (MSE) is then defined as: MSE = 1 N N −1 ∑ x − x' i =0 i 2 i (5.1) Distortion measures using MSE does not necessarily represent the real quality of a coded image and the peak signal-to-noise ratio (PSNR), which is defined in equation (5.2), is normally used. PSNR = 10 log10 M2 dB MSE (5.2) where M is the maximum peak-to-peak value in the signal. For 8-bit images, M is chosen to be 255. For an average PSNR on the luminance and chrominance channels, equation (5.3) is used. 59     2 (255)   dB PSNR = 10 log10  1 ( MSE (Y ) + MSE (Cb) + MSE (Cr ))    3  (5.3) 5.2 PSNR Performance of the Layered 3D-CSPIHT Codec In this section, we discuss the coding performance of the layered 3D-CSPIHT video codec. Experiments are done with standard 4:1:1 color QCIF (176×144) video sequences foreman, carphone, suzie, news, container, mother and akiyo at 10 frames per second. All experiments are performed on Pentium IV 1.6GHz computers. Fig. 5.1 shows frame by frame PSNR results of the foreman and the container sequences at three different resolutions: the lowest resolution (resolution 1), the medium resolution (resolution 2) and the highest resolution (resolution 3) in both spatial and temporal dimension. Clearly, high resolution results in high PSNR. Foreman at resolution 1 has an average PSNR of 26.12 dB in the luminance plane. When 3 more layers (layer 2, 3 and 4) are coded, the PSNR improves by 0.38 to 8.81 dB. In full resolution coding, the resulting average PSNR can reach as high as 46.09 dB in the luminance plane. The average PSNR results of the luminance plane as well as the chrominance planes on the foreman, news, container and suzie sequences are given in Table 5.1 and a rate-distortion curve of the layered 3D-CSPIHT codec is given in Fig. 5.2. 60 60 60 resolution 3 resolution 3 50 PSNR(Y) in dB PSNR(Y) in dB 50 40 resolution 2 30 20 resolution 1 0 100 200 300 Frame number 40 30 resolution 2 20 resolution 1 0 100 200 300 Frame number Fig. 5.1 Frame by frame PSNR results on (a) foreman and (b) container sequences at 3 different resolutions. Foreman news container Lum suzie 26.21 21.92 21.62 30.31 Resolution 1 Cb 37.96 31.51 37.88 46.02 Cr 37.58 37.06 36.30 44.88 Lum 31.09 26.30 26.75 35.06 Resolution 2 Cb 41.78 37.57 43.68 50.32 Cr 42.31 42.71 41.06 49.59 Lum 46.09 48.42 51.39 52.09 Resolution 3 Cb 51.97 50.01 54.24 54.69 Cr 50.97 52.18 54.58 54.32 Table 5.1 Average PSNR (dB) at 3 different resolutions 61 55 PSNR in dB 50 Y U V 45 40 35 30 25 1 2 3 4 5 6 7 resolution option Fig. 5.2 Rate distortion curve of the layered 3D-CSPIHT codec (refer to Table 4.1 in chapter 4 for the resolution options) We compare the performance of the layered 3D-CSPIHT codec and the original 3DCSPIHT codec in terms of PSNR. Comparisons are done at the same bit rate and frame rate. The layered codec is run at resolution 1, i.e., only layer 1 is coded. The bit rate required to fully code layer 1 is computed and the original codec is run at this bit rate. In our experiment the bit rate is 216580 bps. Fig. 5.3 gives a frame by frame comparison of the original and the layered codec on foreman sequence in terms of PSNR in the luminance and chrominance planes. In the luminance plane, the original codec outperforms the layered codec significantly. This is expected because confining the coding to the first resolution layer causes the coder to miss significant coefficients in the higher resolution subbands. These coefficients may be very large and discarding them can cause the PSNR to decrease significantly. However, in the chrominance planes, the layered codec performs on par with or even better than the original codec. Because chrominance nodes are normally smaller than luminance nodes, the affect of restricting the resolution will not be so significant. 62 Visually, the layered codec gives quite pleasant reconstruction with less brightness. Fig. 5.4, Fig. 5.5, Fig. 5.6 and Fig. 5.7 show frame 1, 58, 120 and 190 of the reconstructed foreman sequence at resolution 1, 2 and 3 respectively. Significant improvements in visual quality can be observed when more layers are added. 40 PSNR (dB) 35 30 25 20 15 Original 3D-CSPIHT +++++ Layered 3D-CSPIHT 0 50 frame number 100 150 200 250 300 200 250 300 200 250 300 350 (a) PSNR (dB) 45 40 35 30 Original 3D-CSPIHT +++++ Layered 3D-CSPIHT 0 50 frame number 100 150 350 (b) PSNR (dB) 45 40 35 30 Original 3D-CSPIHT +++++ Layered 3D-CSPIHT 0 50 frame number 100 150 350 (c) Fig. 5.3 PSNR (dB) comparison of the original and the layered codec in (a) luminance plane, (b) Cb plane and (c) Cr plane for the foreman sequence 63 (a) (c) (b) (d) Fig. 5.4 Frame 1 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original (a) (c) (b) (d) Fig. 5.5 Frame 58 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original 64 (a) (c) (b) (d) Fig. 5.6 Frame 120 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original (a) (c) (b) (d) Fig. 5.7 Frame 190 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original 65 (a) Layered codec at resolution 2 (c) 2D-CSPIHT at 312560bps (30.06dB) (34.86dB) (b) 3D-CSPIHT at 312560bps (d) original (35.67dB) Fig. 5.8 Comparison on carphone sequence (a) Layered codec at resolution 2 (c) 2D-CSPIHT at 312560bps (36.23dB) (42.65dB) 66 (b) 3D-CSPIHT at 312560bps (d) original (41.73dB) Fig. 5.9 Comparison on akiyo sequence Fig. 5.8 and Fig. 5.9 give a visual comparison of the 2D-CSPIHT, 3D-CSPIHT and the layered 3D-CSPIHT video coders. Frame 96 of the carphone sequence and frame 1 of the akiyo sequence are shown. The two sequences are chosen to contrast high motion sequence (carphone) with low motion sequence (akiyo). The 2D CSPIHT and 3D CSPIHT codecs are run at the same bit rate as the layered codec, which is run at resolution 2. The layered 3D-CSPIHT codec gives a PSNR of 30.06 dB and 36.23 dB on the selected frames of the carphone and akiyo sequences respectively, which is about 5 dB less than the original 2-D and 3-D codecs. Again, this is because loss of large coefficients due to layer restriction. Although the PSNR data reduce in the layered codec when compared to those in the original 2-D and 3-D codecs, the visual qualities are still pleasant. Despite decreased brightness, Fig. 5.8 (a) shows comparable visual quality to Fig. 5.8 (b) and (c). The background, the eyes and the hair area are clearly reconstructed. The mouth area shows better details. For the akiyo sequence, however, the original 2-D and 3-D give much sharper edges on the human object. This is because loss of high subband information in the layered codec on low motion videos. 67 On high motion videos, the original 2-D and 3-D are not obviously superior because the effect of motion estimation and compensation on high motion videos is greater. As stated in previous chapters, the objective of the layered 3D-CSPIHT codec is to support layered scalable video streaming. In real network environment, the layered codec is expected to perform much better than the original codec, due to its flexibility and the ability of the decoder to work with incomplete data. We demonstrate this by decoding using manually truncated bit streams of the foreman sequence. Some bits in the encoded stream are cut off to produce incomplete video streams (Fig. 5.10). stream header block header (a) stream header block header (b) Fig. 5.10 Manually formed incomplete bit streams (shaded area is discarded) Fig. 5.11 shows the decoding results visually. Needless to quote the actual PSNR value, we can see that the original codec has problems in decoding bit streams that become incomplete due to network transmission. Typical observation is many artifacts overlapped on the video. When it comes to frame 10, the bits are totally messed up and 68 reconstruction is unacceptable. This is expected as there is no layer ID in the original codec to assist in incomplete data decoding. On the other hand, the layered codec performs very well when data loss occurs. (a) (b) (c) (d) (e) (f) Fig. 5.11 Reconstruction of frames (a)(b)1, (c)(d)5, (e)(f)10 of the foreman sequence with layered codec (left) and with the original codec(right). 69 5.3 Coding Time and Compression Ratio Table 5.2 shows the encoding time of the original and the layered codec on four video sequences: foreman, carphone, mother and suzie. As the layered codec adds special markings to the encoded bit stream to support QoS implementation and incomplete stream decoding, it takes more time in coding and produces a less compressed bit stream. Experimental results also show that when coding 16, 64, 128 and 256 frames, the original codec saves about 0.1, 0.5, 0.9 and 1.5 seconds on average. Also, the layered codec has a lower compression ratio (1:729) than the original CSPIHT (1:825) due to extra bits introduced by the layer ID. It is worth mentioning that the re-sorting of wavelet coefficients in the transform domain incurs longer encoding time, too, because more iteration is needed. Specifically, if the original codec requires n iteration to fully code a video, the layered codec will require n × l iteration, where l is the number of layers. That is to say, n iteration is needed for each layer in the layered 3DCSPIHT codec. foreman carphone mother suzie 16 64 128 256 original 0.87 3.32 6.52 12.82 layered 1.13 3.87 7.61 15.28 ∆t 0.26 0.55 1.09 2.46 original 0.87 3.16 6.19 12.57 layered 0.99 3.7 7.1 14.13 ∆t 0.12 0.54 0.91 1.56 original 0.88 3.21 6.34 12.55 layered 0.97 3.69 7.24 14.23 ∆t 0.09 0.48 0.9 1.68 original 0.87 3.24 6.23 12.07 layered 1.02 3.75 7.24 13.18 ∆t 0.15 0.51 1.01 1.11 Table 5.2 Encoding time (in second) of the original and layered codec 70 CHAPTER 6 CONCLUSIONS In this thesis, we present a layered 3D-CSPIHT codec based on the CSPIHT and 3DCSPIHT coder. The layered codec incorporates a new sorting algorithm that produces resolution/frame rate scalable bit streams in a layered format. Moreover, it carries a specially designed layer ID that identifies the layer that a particular data packet belongs to. By doing so, layered scalable video streaming is supported. A unicast video delivery system using the layered 3D-CSPIHT codec is also implemented. As extensions to the SPIHT wavelet coding scheme, CSPIHT and 3D-CSPIHT codecs give satisfactory PSNR performance. However, they are designed assuming ideal network conditions and thus have limitations when applied to real network situations. In a layered multicast video streaming system, server sends only one video stream, which is normally fully coded. Receivers/clients obtain video of different resolutions, sizes or frame rates by subscribing to correspondent multicast groups that carry progressively decodable stream layers. These layers in turn are combined into different streams by multiple clients at the other side of the network to provide video of different qualities preferred by that specific client. Under such a network scenario, the encoder is expected to produce bit streams in a layered structure, and the decoder must be able to decode different combined versions as well as incomplete versions of the original encoded bit stream. The layered 3D-CSPIHT codec achieves this by re-sorting the coefficients in the transform domain following a restricted significance criterion, 71 and slotting a flag, called layer ID, as a layer identifier to the decoder and the network routers/switches. In the layered 3D-CSPIHT codec, layers are defined according to resolutions or subbands. Significance status in the original 3D-CSPIHT, which depends only on the magnitude of a node or its descendants, is re-determined by checking a modified significance criterion: an entry is significant when it is significant in the original 3DCSPIHT and when it resides in a subband belonging to the currently effective layer or the layer that is being coded. Thus, coefficients in lower layers are coded with higher priorities. Seven resolution options are provided in terms of different spatial resolutions and frame rates. To enable decoding of incomplete data, eight consecutive ‘1’s (11111111) are introduced as layer ID at the beginning of each layer. A zero (0) is inserted in the data after occurrence of every seven consecutive ‘1’s (1111111) to remove the ‘11111111’ sequences in the data section and protect the uniqueness of the layer ID. The same sequence is used as layer ID for all layers by maintaining an ID counter to track the number of times that the layer ID sequence is captured. The decoder switches to process the kth layer upon detection of the layer ID for the kth time. Thus, when a packet in the k-1th layer is lost, streams from the kth layer still gets to be decoded correctly. In the network, the layer ID should be marked with the lowest drop precedence to ensure safe delivery. The layered 3D-CSPIHT codec is tested using both high motion and low motion standard QCIF video sequences at 10 frames per second. It is compared against the 72 original 3D-CSPIHT and the 2D-CSPIHT video coder in terms of PSNR, encoding time and compression ratio. In the luminance plane, the original 3D-CSPIHT and the 2D-CSPIHT outperform the layered 3D-CSPIHT significantly in PSNR results. While in the chrominance planes, they give similar PSNR results. The layered 3D-CSPIHT also costs more in time and provides less compressed bit streams, because of the expense incurred by incorporating the layer ID. In conclusion, the layered 3D-CSPIHT codec improves the original 3D-CSPIHT codec in terms of network friendliness. It overcomes the limitations of the original 3DCSPIHT codec for application to a layered multicast streaming system. 73 REFENRENCES [1]. S. Keshav, An Engineering Approach to Computer Networking, Addison Wesley, 1997. [2]. Y.S. Gan and C.K. Tham, "Random Early Detection Assisted Layered Multicast", in "Managing IP Multimedia End-to-End", 5th IFIP/IEEE International Conference on Management of Multimedia Networks and Services, Oct. 2002, pp. 341-353, Santa Barbara, USA. [3]. C.K. Tham, Y.S. Gan and Y. Jiang., "Congestion Adaptation and Layer Prioritization in a Multicast Scalable Video Delivery System", Proceedings of IEEE/EURASIP Packet Video 2003, Apr. 2003, Nantes, France. [4]. Y.S. Gan and C.K. Tham, “Loss Differentiated Multicast Congestion Control”, Computer Networks, vol. 41, No. 2, pp. 161-176, Feb. 2003. [5]. A. Said and W.A. Pearlman, “A New, Fast, and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 243-250, Jun. 1996. [6]. F.W. Wheeler and W.A. Pearlman, “SPIHT Image Compression without Lists”, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Jun. 2000, vol. 4, pp. 2047-2050. [7]. W.S. Lee and A.A. Kassim, “Low Bit-Rate Video Coding Using Color Set Partitioning In Hierarchical Trees Scheme”, International Conference on Communication Systems 2001, Nov. 2001, Singapore. [8]. A.A. Kassim and W.S. Lee, “Performance of the Color Set Partitioning In Hierarchical Tree Scheme (C-SPIHT) in Video Coding”, Circuits, Systems and Signal Processing, vol. 20, pp. 253-270, 2001. 74 [9]. A.A. Kassim and W.S. Lee, “Embedded Color Image Coding Using SPIHT with Partial Linked Spatial Orientation Trees”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 203-206, Feb, 2003. [10]. G.D. Karlsson and M. Vetterli, “Three Dimensional subband coding of video”, International Conference on Acoustics, Speech and Signal Processing 1998, pp. 1100-1103. [11]. A.A. Kassim, E.H. Tan and W.S. Lee, “3D Wavelet Video Codec based on Color Set Partitioning in Hierarchical Trees (CSPIHT)”, submitted to IEEE Transactions on Circuits and Systems for Video Technology. [12]. Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing (Third Edition), pp.317-372, Addison Wesley, 1992. [13]. Simon Haykin, Neural Networks: A Comprehensive Foundation (Second Edition), pp392-437, Prentice Hall, 1998. [14]. K.R. Castleman, Digital Image Processing, Prentice Hall, 1996. [15]. R.J. Clarke, Digital Compression of Still Images and Video, pp.245-274, Academic Press, 1995. [16]. J.R. Ohm, “Three-Dimensional Subband Coding with Motion Compensation”, IEEE Transaction on Image Processing, vol. 3, pp.559-571, Sep. 1994. [17]. S.J. Choi and J.W. Woods, “Motion-Compensated 3-D Subband Coding of Video,” IEEE Trans. on Image Processing, vol. 8, pp. 155–167, Feb. 1999. [18]. B.J. Kim et al., “Low Bit-Rate Scalable Video Coding with 3-D Set Partitioning in Hierarchical Trees (3-D SPIHT)”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, pp. 1374-1386, Dec. 2000. [19]. J.Y. Tham, S. Ranganath, and A.A. Kassim, “Highly Scalable Wavelet-based Video Codec for Very Low Bit Rate Environment”, IEEE Journal on Selected 75 Areas in Communications – Special Issue on Very Low Bit-rate Video Coding, vol.16, pp. 12-27, Jan. 1998. [20]. G.D. Karlsson and M. Vetterli, “Three Dimensional subband coding of video”, International Conference on Acoustics, Speech and Signal Processing 1998, pp. 1100-1103. [21]. C.I. Podilchuk, N.S. Jayant and N. Farvardin, “Three-dimensional subband coding of video,” IEEE Trans. on Image Processing, vol. 4, pp.125–139, Feb. 1995. [22]. D. Taubman and A. Zakhor, “Multirate 3-D Subband Coding of Video,” IEEE Trans. on Image Processing, vol. 3, pp. 572–588, Sep. 1994. [23]. A. S. Lewis and G. Knowles, “Image compression using the 2-D wavelet transform,” IEEE Trans. on Image Processing, vol. 1, pp. 244–250, Feb. 1992. [24]. S.H. Man and F. Kossentini, “Robust EZW Image Coding for Noisy Channels,” IEEE Signal Processing Letters, vol. 4, pp. 227–229, Aug. 1997. [25]. C.D. Creusere, “A New Method of Robust Image Compression Based on the Embedded Zerotree Wavelet Algorithm,” IEEE Trans. on Image Processing, vol. 6, pp. 1436–1442, Oct. 1997. [26]. J.K. Rogers and P.C. Cosman, “Robust Wavelet Zerotree Image Compression with Fixed-Length Packetization,” Proceeding of Data Compression Conference, 1998, Snowbird, UT, pp. 418–427. [27]. P.C. Cosman, J.K. Rogers, P.G. Sherwood and K. Zeger, “Combined Forward Error Control Packetized Zerotree Wavelet Encoding for Transmission of Images over Varying Channels,” IEEE Trans. on Image Processing, vol. 9, pp. 982–993, Jun. 2000. 76 [28]. J.M. Shapiro, “Embedded Image Coding Using Zerotrees of Wavelets”, IEEE Transactions on Signal Processing, vol. 41, pp. 3445-3462, Dec. 1993. [29]. D. Wu, T. Hou and Y.Q. Zhang, “Scalable Video Coding and Transport over Broad-Band Wireless Networks”, Invited Paper, Proceedings of the IEEE, Special Issue on Multi-Dimensional Broadband Wireless Technologies and Applications, vol. 89, pp. 6-20, Jan. 2001. [30]. A. Tamhankar and K.R. Rao, “An Overview of H.264/MPEG-4 Part 10”, 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications, July, 2003, vol. 1, pp. 1-51. [31]. Z. Wang, Internet QoS: Architectures and Mechanisms for Quality of Service (First Edition), Morgan Kaufmann press, 2001. [32]. P. Ferguson and G. Huston, Quality of Service: Delivering QoS on the Internet and in corporate networks (First Edition), John Wiley & Sons, 1998. [33]. http://www.ietf.org/rfc.html, RFC 2205. [34]. http://www.ietf.org/rfc.html, RFC 2597, 2598, 3246, 3247. [35]. Y. Wang, J. Ostermann, and Y.Q. Zhang, Video Processing and Communications, Prentice Hall, 2001. [36]. E.C. Reed and F. Dufaux, “Constrained Bit-Rate Control for Very Low BitRate Streaming-Video Applications”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 882-889, Jul. 2001. [37]. C.W. Yap and K.N. Ngan, “Error Resilient Transmission of SPIHT Coded Images over Fading Channels”, IEE Proceedings of Vision, Image and Signal Processing, Feb. 2001, vol. 148, issue 1, pp. 59-64. [38]. C.K. Tham et al., "Layered Coding for a Scalable Video Delivery System”, Proceedings of IEEE/EURASIP Packet Video 2003, Apr. 2003, Nantes, France. 77 [39]. W. Feng, A.A. Kassim and C.K. Tham, “Layered Self-Identifiable and Scalable Video Codec for Delivery to Heterogeneous Receivers”, Proceedings of SPIE Visual Communications and Image Processing, Jul. 2003, Lugano, Switzerland. [40]. J.R. Ohm, “Advanced Packet Video Coding Based on Layered VQ and SBC techniques”, IEEE Trans. Circuits and Systems on Video Technology, vol. 3, pp. 208-221, Jun. 1993. [41]. M. Mrak, M. Grgic and S. Grgic, “Scalable Video Coding in Network Applications”, 4th EURASIP-IEEE Region 8 International Symposium on Video/Image Processing and Multimedia Communications, Jun. 2002, pp. 205- 211. [42]. S. Han and B. Girod, “Robust and Efficient Scalable Video Coding with Leaky Prediction”, Proceedings of IEEE International Conference on Image Processing, Sep. 2002, vol. 2, pp. II-41-II-44. [43]. H. Danyali and A. Mertins, “Highly Scalable Image Compression Based on SPIHT for Network Applications”, Proceedings of IEEE International Conference on Image Processing, Sep. 2002, vol. 1, pp. I-217-I-220. [44]. J.Y. Lee, H.S. Oh and S.J. Ko, “Motion-Compensated Layered Video Coding for Playback Scalability”, IEEE Transactions on Circuits and Systems for Video Technology, vol.11, No.5, pp. 619-629, May, 2001. [45]. M. Ghanbari, “Two-layer coding of video signals for VBR networks”, IEEE J. Selected Areas in Communications, vol.7, pp.771-781, Jun. 1989. [46]. D. Taubman and A. Zakhor, “Multi-rate 3-D subband coding of video”, IEEE Transactions on Image Processing, vol.3, No.5, pp.572-588, Sep, 1994. 78 [47]. M. Vishwanath and P. Chou, “An efficient algorithm for hierarchical compression of video”, in proceedings of the IEEE International Conference on Image Processing, Texas, USA, Nov. 1994. [48]. M. Khansari, A. Zakauddin, W.Y. Chan, E. Dubois and P. Mermelstein, “Approaches to layered coding for dual-rate wireless video transmission”, in proceedings of the IEEE International Conference on Image Processing, Texas, USA, Nov. 1994. [49]. S.R. McCanne, M. Vetterli and V. Jacobson, “Low-complexity video coding for receiver-driven layered multicast”, IEEE J. Selected Areas in Communications, vol.15, No.6, pp.983-1001, Aug, 1997. [50]. H. Radha, Y. Chen, K. Parthasarathy and R. Cohen, “Scalable Internet Video Using MPEG-4”, Signal Processing: Image Communication, vol.15, No.1-2, pp.95-126, Sep, 1999. [51]. T. Kim, S.K. Choi, R.E. Van Dyck and N.K. Bose, “Classified Zerotree Wavelet Image Coding and Adaptive Packetization for Low-Bit-Rate Transport”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1022-1034, Sep. 2001. [52]. J.W. Woods and G. Lilienfield, “A Resolution and Frame-Rate Scalable Subband/Wavelet Video Coder”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1035-1044, Sep. 2001. [53]. A. Puri et al., “Temporal Resolution Scalable Video Coding”, Proceedings of IEEE International Conference on Image Processing, Nov. 1994, vol. 2, pp. 947-951. [54]. D. Taubman and A. Zakhor, “Rate- and Resolution-Scalable Video and Image Compression with Subband Coding”, The Twenty-Seventh Asilomar 79 Conference on Signals, Systems and Computers, Nov. 1993. vol. 2, pp. 1489 - 1493. [55]. G.J. Conklin and S.S. Hemami, “Evaluation of Temporally Scalable Video Coding Techniques”, Proceedings of IEEE International Conference on Image Processing, Oct. 1997, vol. 2, pp. 61-64. [56]. G.J. Conklin and S.S. Hemami, “Comparison of Temporal Scalability Techniques”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, Issue 6, pp. 909-919, Sep. 1999. 80 [...]... clients to receive a video that is specially “shaped” for each of them Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec Recent subband coding algorithms based on the Discrete Wavelet Transform (DWT) support scalability The DWT based Set Partitioning in Hierarchical Trees (SPIHT) scheme [5] [6] for coding of monochrome images has yielded desirable... scheme for video coding The above coding schemes achieve satisfactory PSNR performance; however, they have been designed from a pure compression point of view, which render problems for their direct application to a QoS enabled streaming system In this project, we extended the 3D-CSPIHT codec to address these problems and enable it to produce layered bit streams that are suitable for layered video streaming. ..Encoder Raw video Sender Renderer Decoder Compressed video Network Receiver Fig 1.1 A typical video streaming system The challenge of video streaming lies in the highly delay-sensitive characteristic of video applications Video/ audio data need to arrive on time to be useful Unfortunately, current Internet service is best effort (BE) and guarantees no delay bound Delay... provide background information in image /video compression, and in chapter 3 we discuss related research in multimedia communications and network QoS The details of our extension of the 3D-CSPIHT codec, called layered 3D-CSPIHT video codec, are presented in chapter 4 We analyze performance of the layered codec in chapter 5 Finally, in chapter 6 we conclude this thesis 4 CHAPTER 2 IMAGE AND VIDEO CODING This... begins with an overview of transform coding for still images and video coding using motion compensation Then wavelet based image and video coding is introduced and the subband coding techniques are described in detail Finally, current image and video coding standards are briefly summarized 2.1 Transform Coding A typical transform coding system comprises of forward transform, quantization and entropy... groups, the video will be duplicated and branched at fork points, as shown at router R1 (Fig 3.2) client1 R2 harddisk client2 z server R1 R3 client3 R4 encoder client4 webcam client5 Fig 3.2 Multicast video streaming 3.2 Characteristics and Challenges of Video Streaming Video streaming is real time applications Unlike traditional data-oriented applications such as email, ftp, and web browsing, video streaming. .. video compression 10-100kpbs Table 2.1 Image and video compression standards 24 CHAPTER 3 VIDEO STREAMING AND NETWORK QoS In this chapter some fundamentals in video streaming are provided The network Quality of Service (QoS) is defined and two frameworks, i.e., the Integrated Services (IntServ) and the Differentiated Services (DiffServ) are discussed in detail Finally, principles of layered video streaming. .. Base layer of the video stream must be received for any other layers to be useful, and each additional layer improves the video quality As network clients always differ significantly in their capacities and preferences, layered scalable streaming is efficient in that it is able to 2 deliver one video stream over the network, while at the same time it enables the clients to receive a video that is specially... coding standards for reference Standard JPEG JPEG 2000 MPEG-1 Application Still image compression Improved still image compression Video on CD Bit Rate Variable Variable 1.5Mbps MPEG-2 Digital Television, Video on DVD 2-20Mpbs MPEG-4 Object -based coding Interactive video 28-1024kbps H.261 Video conferencing over ISDN Variable H.263 Video conferencing over Internet and PSTN Wireless video conferencing... Equation (2.4) then defines one-dimensional linear transform from vector x to y The goal of the transform process is to de-correlate the pixels or to pack signal energy into as few as possible transform coefficients However, not all linear transforms are optimal in this sense Only the whitening transform (viz Karhunen-Loeve transform (KLT), Hotelling transform or the method of principal components) [13], ... Layered scalable streaming is one of the QoS supportive video streaming mechanisms that provide both efficiency and flexibility The basic idea of layered scalable streaming is to encode raw video. .. clients to receive a video that is specially “shaped” for each of them Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec Recent subband... raw video in a layered format, which can be progressively transmitted and decoded Fig 3.7 is an illustration of such a layered codec 34 layered encoder layered decoder layered decoder + + layered

Định dạng
Số trang	89
Dung lượng	1,51 MB