Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 89 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
89
Dung lượng
1,51 MB
Nội dung
CSPIHT BASED SCALABLE VIDEO CODEC FOR
LAYERED VIDEO STREAMING
FENG WEI
(B. Eng. (Hons) , Xi’an Jiaotong University)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2003
ACKNOWLEDGEMENT
I would like to express my gratitude to all those who gave me the possibility to
complete this thesis.
First of all, I would like to extend my sincere gratitude to my two supervisors, A/P
Ashraf A. Kassim and Dr. Tham Chen Khong, for their insightful guidance throughout
my project and their valuable time and inputs on this thesis. They have helped and
encouraged me in numerous ways, especially when my progress was slow.
I am grateful to my three seniors -- Mr. Lee Wei Siong, Mr. Tan Eng Hong and Mr.
See Toh Chee Wah, who has provided me much information and many helpful
discussions. Their assistance was vital to this project. I wish to thank all the friends and
fellow students in the Vision and Image Processing lab, especially the lab officer Mr.
Francis Hoon. They have been wonderful companies to me for these two years.
Last but not least, I wish to thank my boyfriend Huang Qijie for his support all the way
along. Almost all of my progress was made when he is by my side.
i
TABLE OF CONTENTS
ACKNOWLEDGEMENT……………………………………….……..…………......i
TABLE OF CONTENTS……………………………………….……..………….......ii
LIST OF FIGURES.…………………………………………….……..……………..iv
LIST OF TABLES…………………………………………………….….……….…vii
SUMMARY……………………………………………………………..…………...viii
CHAPTER 1 INTRODUCTION……………………………..…………….…….......1
CHAPTER 2 IMAGE AND VIDEO CODING………………..…………….……....5
2.1 Transform Coding…………………………………...……..……………………..5
2.1.1 Linear Transform…..…….…………………………...…..…………………….6
2.1.2 Quantization……………….……………………………..………………….….7
2.1.3 Arithmetic Coding……………….……….………………......………………...8
2.1.4 Binary Coding………………….…………………………...….…..………….10
2.2 Video Compression Using MEMC………………………………..…………….10
2.3 Wavelet Based Image and Video Coding…………………………...…..………12
2.3.1 Discrete Wavelet Transform………………………………………….……….13
2.3.2 EZW Coding …………….………………………….……………...…………16
2.3.3 SPIHT Coding Scheme…………………………….……………...………..…18
2.3.4 Scalability……………………………………….………………..…………...23
2.4 Image and Video Coding Standards………………………………..…………..25
CHAPTER 3 VIDEO STREAMING AND NETWORK QoS………..…….……..25
3.1 Video Streaming Models……………………………………………….….…….25
3.2 Characteristics and Challenges of Video Streaming…………………..………26
3.3 Quality of Service……………………………………………………….…..……27
3.3.1 Definition of QoS ……...….…………………………….……………..……...27
3.3.2 IntServ Framework………….…………………………….………….….…….28
3.3.3 DiffServ Framework………….…………………………….…………...……..31
3.4 Layered Video Streaming……………………………………………..………...33
CHAPTER 4 Layered 3D-CSPIHT CODEC……………………….……..……….36
4.1 CSPIHT and 3D-CSPIHT Video Coder ………..……..….……...…..……….. 36
4.2 Limitations of Original 3-D CSPIHT Codec……………..……..…..………….41
4.3 Layered 3D-CSPIHT Video Codec………………………….…….…...……….42
4.3.1 Overview of New Features………………………………….…….…………43
4.3.2 Layer IDs…………………………………………………….…….………...44
4.3.3 Production of Multiresolutional Scalable Bit Streams………………………46
ii
4.3.4 How the Codec Functions in the Network………………………….………54
4.3.5 Layered 3D-CSPIHT Algorithm……………………………….………...…57
CHAPTER 5 PERFORMANCE DATA………………………………….….…….59
5.1 Coding Performance Measurements………………………………….……….59
5.2 PSNR Performance of the layered 3D-CSPIHT Codec………….….………..60
5.3 Coding Time and Compression Ratio………………………………….……...70
CHAPTER 6 CONCLUSIONS……………………………………………….……71
REFERENCES…………………………..…………………………………….……74
iii
SUMMARY
A layered scalable codec based on the 3-D Color Set Partitioning in Hierarchical
Trees (3D-CSPIHT) coder is presented in this thesis. The layered 3D-CSPIHT codec
introduces layering of encoded bit streams to support layered scalable video streaming.
It restricts the significance criteria of the original 3D-CSPIHT coder to generate
separate bit streams comprised of cumulative layers. Layers are defined according to
resolution subbands. The layered 3D-CSPIHT codec incorporates a new sorting
algorithm to produce multi-resolution scalable bit streams, and a specially designed
layer ID to identify the layer that a particular data packet belongs to. By doing so,
decoding of lossy data is achieved.
The layered 3D-CSPIHT codec is tested using both high motion and low motion
standard QCIF video sequences at 10 frames per second. It is compared against the
original 3D-CSPIHT and the 2D-CSPIHT video coder in terms of PSNR, encoding
time and compression ratio. In the luminance plane, the original 3D-CSPIHT and the
2D-CSPIHT give better PSNR than the layered 3D-CSPIHT. While in the
chrominance planes, they give similar PSNR results. The layered 3D-CSPIHT also
costs more in computational time and provides less compressed bit streams, because of
the expense incurred by incorporating the layer ID. However, encoded video data is
very likely to encounter loss in real network transmission. When decoding lossy data,
the layered 3D-CSPIHT codec outperforms the original 3D-CSPIHT significantly.
iv
LIST OF TABLES
Table 2.1 Image and video compression standards……………………..…………….24
Table 4.1 Resolution options……………………………………………….…………47
Table 4.2 LIP, LIS, LSP state after sorting at bit plane 2 (original CSPIHT)……...…50
Table 4.3 LIP, LIS, LSP state after sorting at bit plane 1 (original CSPIHT)………...51
Table 4.4 LIP, LIS, LSP state after sorting at bit plane 0 (original CSPIHT)………...51
Table 4.5 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 1
effective)………………………………………………………………………………52
Table 4.6 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 1
effective)………………………………………………………………………………52
Table 4.7 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 1
effective)………………………………………………………………………………52
Table 4.8 LIP, LIS, LSP state after sorting at bit plane 2 (layered CSPIHT, layer 2
effective)………………………………………………………………………………53
Table 4.9 LIP, LIS, LSP state after sorting at bit plane 1 (layered CSPIHT, layer 2
effective)………………………………………………………………………………53
Table 4.10 LIP, LIS, LSP state after sorting at bit plane 0 (layered CSPIHT, layer 2
effective)……………………………………………………………………………....53
Table 5.1 Average PSNR (dB) at 3 different resolutions…………………………..…61
Table 5.2 Encoding time (in second) of the original and layered codec……...………70
v
LIST OF FIGURES
Fig. 1.1 A typical video streaming system…………..…………………………………2
Fig. 2.1 Encoding model………………………………..………………………………5
Fig. 2.2 Decoding model……………………………….....……………………………6
Fig. 2.3 Binary coding model……………………………..………………….……….10
Fig. 2.4 Block matching motion estimation………………..……….…………………11
Fig. 2.5 1-D DWT decomposition…………………………..………………………...14
Fig 2.6 Dyadic DWT decomposition of an image……………..…….………………..14
Fig 2.7 Subbands after 3-level dyadic wavelet decomposition………..……………...15
Fig. 2.8 2-level DWT decomposed Barbara image……………………..…………….15
Fig. 2.9 Spatial Orientation Tree for EZW………………………………..…………..17
Fig. 2.10 Spatial Orientation Tree of SPIHT………………………………………….18
Fig. 2.11 SPIHT coding algorithm…………………………………………..………..25
Fig. 3.1 Unicast video streaming……………………………………………..……….25
Fig. 3.2 Multicast video streaming……………………………………………..……..26
Fig. 3.3 IntServ architecture……………………………………………………..……29
Fig. 3.4 Leaky bucket regulator…………………………………………………...…..30
Fig. 3.5 An example of the DiffServ network……………………………………..….32
Fig. 3.6 DiffServ inter-domain operations………………….………………..……..…33
Fig. 3.7 Principle of a layered codec………………………...………….………….....35
Fig. 4.1 CSPIHT SOT (2-D) ……………………………………………….……...….37
Fig. 4.2 CSPIHT video encoder …………………………………………………..….37
Fig. 4.3 CSPIHT video decoder……………………………...……………….……….38
Fig. 4.4 3D-CSPIHT STOT …………………………………….…………...……..…39
Fig. 4.5 3D-CSPIHT video encoder……………………………..……………….…...40
vi
Fig. 4.6 3D-CSPIHT video decoder……………………………..…….………..…….41
Fig. 4.7 Confusion when decode lossy data using original 3D-CSPIHT decoder…....41
Fig. 4.8 Network scenario considered for design of the layered codec………..…......43
Fig. 4.9 The bit stream after layer ID is added…………………………………..……45
Fig. 4.10 Resolution layers in layered 3D-CSPIHT……………………..……………47
Fig. 4.11 Progressively transmitted and decoded layers ……………………..…...….47
Fig. 4.12 (a) An example video frame after DWT transform ………………..…….…49
Fig. 4.12 (b) SOT for Fig. 4.14 (a)……...………………………………...……..……49
Fig. 4.13 Bit stream structure of the layered 3D-CSPIHT coder…………….....….….55
Fig. 4.14 Flowchart of the layered decoder algorithm…………………………...……56
Fig. 4.15 Layered 3D-CSPIHT algorithm………………………………………….…58
Fig. 5.1 Frame by frame PSNR results on (a) foreman and (b) container sequences at 3
different resolutions………………………………………………………………..….61
Fig. 5.2 Rate distortion curve of the layered 3D-CSPIHT codec..………………..…..62
Fig. 5.3 PSNR (dB) comparison of the original and the layered codec in (a) luminance
plane, (b) Cb plane and (c) Cr plane for the foreman sequence……………………....63
Fig. 5.4 Frame 1 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c)
resolution3 and (d) original…………………………………………………………...64
Fig. 5.5 Frame 58 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c)
resolution3 and (d) original…………………………………………………………...64
Fig. 5.6 Frame 120 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c)
resolution3 and (d) original…………………………………………………………...65
Fig. 5.7 Frame 190 of foreman reconstructed at (a) resolution 1, (b) resolution 2, (c)
resolution3 and (d) original…………………………………………………………...65
Fig. 5.8 Comparison on carphone sequence………………………………………….66
vii
Fig. 5.9 Comparison on akiyo sequence……………………………………………....67
Fig. 5.10 Manually formed incomplete bit streams...…………………………………68
Fig. 5.11 Reconstruction of frame (a)(b)1, (c)(d)5, (e)(f)10 of the foreman sequence
……………………………………………………………………………………..….69
viii
CHAPTER 1
INTRODUCTION
With the emergence of increasing demand of rich multimedia information on the
Internet, video streaming has become popular in both academia and industry.
Video streaming technology enables real time or on-demand distribution of video
resources over the network. Compressed video data are transmitted by a server
application, and received and displayed in real time by the corresponding client
applications. These applications normally start to display the video as soon as a certain
amount of data arrives at the client’s buffer, thus providing downloading and viewing
of the video simultaneously.
A typical video streaming system consists of five core functional blocks, i.e., coding
module, network sender, network receiver, decoding module and video renderer. As
shown in Fig. 1.1, raw video data will undergo compression in the coding module to
reduce the data load in the network. The compressed video is then transmitted by the
sender to the client on the other side of the network, where a decoding procedure is
performed to reconstruct the video for the renderer to display.
Video streaming is advantageous because a user does not have to wait until the whole
file to arrive before he can see the video. Besides, video streaming leaves no physical
files on the clients’ computer.
1
Encoder
Raw video
Sender
Renderer
Decoder
Compressed video
Network
Receiver
Fig. 1.1 A typical video streaming system
The challenge of video streaming lies in the highly delay-sensitive characteristic of
video applications. Video/audio data need to arrive on time to be useful. Unfortunately,
current Internet service is best effort (BE) and guarantees no delay bound. Delay
sensitive applications need a new service model in which they can ask for higher
assurance or priority from the network. Research in network Quality of Service (QoS)
aims to investigate and provide such service models. Technical details of QoS include
control protocols such as the Resource Reservation Protocols (RSVP), and individual
building blocks such as traffic policing, buffer management and admission control [1].
Layered scalable streaming is one of the QoS supportive video streaming mechanisms
that provide both efficiency and flexibility.
The basic idea of layered scalable streaming is to encode raw video into multiple
layers that can be separately transmitted, cumulatively received and progressively
decoded [2]-[4]. Clients obtain a preferred video quality by subscribing to different
layers and combining these layers into different bit streams. Base layer of the video
stream must be received for any other layers to be useful, and each additional layer
improves the video quality. As network clients always differ significantly in their
capacities and preferences, layered scalable streaming is efficient in that it is able to
2
deliver one video stream over the network, while at the same time it enables the clients
to receive a video that is specially “shaped” for each of them.
Besides adaptive QoS support from the network, layered scalable video streaming
requests a scalable video codec. Recent subband coding algorithms based on the
Discrete Wavelet Transform (DWT) support scalability. The DWT based Set
Partitioning in Hierarchical Trees (SPIHT) scheme [5] [6] for coding of monochrome
images has yielded desirable results despite its simplicity in implementation. The
Color SPIHT (CSPIHT) [7]-[9] improves the SPIHT and achieves comparable
compression results to SPIHT in color image coding. In the area of video compression,
interest is focused on the removal of temporal redundancy. The use of 3-D subband
coding schemes is one of the successful solutions. Karlsson and Vetterli implemented a
3-D subband coding system in [10] by generalized the common 2-D filter banks to 3-D
subband analysis and synthesis. As one of the embedded 3-D subband coding
algorithms that follow it, 3D-CSPIHT [11] is an extension of the CSPIHT coding
scheme for video coding.
The above coding schemes achieve satisfactory PSNR performance; however, they
have been designed from a pure compression point of view, which render problems for
their direct application to a QoS enabled streaming system.
In this project, we extended the 3D-CSPIHT codec to address these problems and
enable it to produce layered bit streams that are suitable for layered video streaming.
3
The rest of this thesis is organized as follows: In chapter 2 we provide background
information in image/video compression, and in chapter 3 we discuss related research
in multimedia communications and network QoS. The details of our extension of the
3D-CSPIHT codec, called layered 3D-CSPIHT video codec, are presented in chapter 4.
We analyze performance of the layered codec in chapter 5. Finally, in chapter 6 we
conclude this thesis.
4
CHAPTER 2
IMAGE AND VIDEO CODING
This chapter begins with an overview of transform coding for still images and video
coding using motion compensation. Then wavelet based image and video coding is
introduced and the subband coding techniques are described in detail. Finally, current
image and video coding standards are briefly summarized.
2.1 Transform Coding
A typical transform coding system comprises of forward transform, quantization and
entropy coding, as shown in Fig. 2.1. First, a reversible linear transform is used to
reduce redundancy between adjacent pixels, i.e., the inter-pixel redundancy, in an
image. After that, the image undergoes the quantization stage to reduce psychovisual
redundancy. Lastly, the quantized image goes through entropy coding which aims to
reduce coding redundancy. Transform coding is a core technique recommended by
JPEG and adopted by H. 261, H.263, and MPEG 1/2/4. The corresponding decoding
procedure is depicted in Fig. 2.2. We will discuss the three encoding stages in this
section.
Input signal
Transform
Quantization
Entropy coding
Compressed
signal
Fig. 2.1 Encoding model
5
Compressed
signal
Entropy decoding
Inverse transform
Reconstructed
signal
Fig. 2.2 Decoding model
2.1.1 Linear Transforms
Transform coding exploits the inter-pixel redundancy of an image by mapping the
image to the transform domain using a reversible linear transform. For most natural
images, a significant number of coefficients will have small magnitudes after the
transform. These coefficients therefore can be coarsely quantized or entirely discarded
without causing much image degradation [12]. There is no information loss during the
transform process, and the number of coefficients produced is equal to the number of
pixels transformed. Transform itself does not directly reduce the amount of data
required to represent the image. However, a set of transform coefficients are obtained
in this way, which makes the inter-pixel redundancies of the input image more
accessible for compression in later stages of the encoding process [12].
Defining the input signal x=[x1 , x2 , …, xN ]T as a vector of data samples with
standard basis {a1 , a2 , …, aN } of an N-dimensional Euclidean space, we obtain:
N
x = ∑ xn a n
(2.1)
n =1
where A=[ a1 , a2 , …, aN ] is an identity matrix of size N × N.
A different set of basis [ b1 , b2 , …, bN ] can be used to represent x as
N
x = ∑ ynb n
(2.2)
n =1
with yn being the co-ordinates of x with respect to bn ( n ∈{1,2,..., N } ).
6
Let B=[ b1 , b2 , …, bN ] and y=[ y1 , y2 , …, yN ]T, we have
x= By
(2.3)
y= Tx
(2.4)
Rearrange equation (2.3), we get
where T= B-1. Equation (2.4) then defines one-dimensional linear transform from
vector x to y.
The goal of the transform process is to de-correlate the pixels or to pack signal energy
into as few as possible transform coefficients. However, not all linear transforms are
optimal in this sense. Only the whitening transform (viz. Karhunen-Loeve transform
(KLT), Hotelling transform or the method of principal components) [13], in which the
eigenvectors of the input covariance matrix form the basis functions, de-correlates the
input signal or image and is optimal in sense of energy compaction. However, KLT is
seldom used in practice because it is data dependent, which causes high expense in
computation. Instead, other near-optimal transforms such as the discrete cosine
transform (DCT) is normally selected in practical transform coding systems because it
provides a good compromise between energy compaction ability and computational
complexity [14].
2.1.2 Quantization
After transform process, quantization is used to reduce the accuracy of the transform
coefficients according to a pre-established fidelity criterion [14]. The effect of
compression is achieved in this way. Quantization is an irreversible process.
7
Quantization is the mapping from the source data vector x to a code word rk = Q[x] in a
code book { rk ; 1 ≤ k ≤ L}. The criterion to choose the proper code word is to reduce
the expected distortion due to quantization with respect to a particular probability
density distribution of the data. Assume the probability density function of x is f(x).
The expected distortion can be formulated as:
N
2
D = ∑ ∫ x − rk I ( x, rk ) f x ( x)dx
(2.5)
k =1
where
1 Q[ x] = rk ;
I ( x, rk ) =
0 otherwise.
(2.6)
is an indicator function.
2.1.3 Arithmetic Coding
In the final stage of transform coding, a symbol coder is used to create code to
represent the output from the quantization process. In most cases, the quantized data is
mapped to a set of variable-length code. It assigns the shortest code to the output value
that occurs most frequently, and thereby reduces the coding redundancy and saves the
amount of data that is required to represent the quantized data set. The following
information theory provides the basic tools to deal with information representation
quantitatively.
Let {a1 , a2 ,...ai ,...ak } be a set of symbols from a memoryless source of messages, each
with a known probability of occurrence, denoted as p(ai ) . The amount of information
imparted by the occurrence of the symbol ai in the message is:
I (ai ) = − log 2 p(ai ) (1 ≤ i ≤ k )
(2.7)
8
where the unit of information is bit for logarithm of base 2.
The entropy of the message source is then defined as
k
H = −∑ p (a j ) log 2 p (a j )
(2.8)
j =1
Entropy specifies the average information content (per symbol) of the messages
generated by the source [14] and gives the minimum amount of bits (average) required
to encode all the symbols in the system. Entropy coding aims to encode a given set of
symbols with the minimum number of bits required so as to approach the entropy of
the system. Examples of entropy coding include Huffman coding, run length coding
and arithmetic coding. We give some details on arithmetic coding in the following.
Arithmetic coding is a variable length coding based on the frequency of each character
or symbol. It is suitable to encode a long stream of symbols or long messages. In
arithmetic coding, probabilities of all code words sum up to unity. The events in the
data set are arranged in an interval between 0 and 1. Each code word probability can be
related to a subdivision of this interval. The algorithm for arithmetic coding then works
as follows:
i)
Begin with a current interval [L, H) initialized to [0, 1);
ii)
For each incoming event, the current interval is subdivided into subintervals
proportional to their probabilities of occurrence, one for each possible event;
iii)
Select the subinterval corresponding to the incoming event, make it the new
current interval and go back to step 1.
Arithmetic coding reduces the information that needs to be transmitted to a single
number within the final interval, which is identified after the whole data set is encoded.
9
The arithmetic decoder, with the knowledge of occurrence probability of the different
events and the number received, then maps the intervals identified and scales the
intervals accordingly to decode the data set.
2.1.4 Binary Coding
Binary coding is lossless, and is a necessary step in any coding system. The process of
binary coding is shown in Fig. 2.3.
Binary encoding
Symbol ai
Codeword ci
bit length li
Probability table pi
Fig. 2.3 Binary coding model
Denote the bit rate produced by such a binary coding system as R. According to Fig.
2.3, we have
R=
∑ p(a )l (a )
ai ∈ A
i
i
(2.9)
2.2 Video Compression Using MEMC
Unlike still image compression, video compression attempts to exploit the temporal
redundancy. There are two types of coding categorized according to the type of
redundancy being exploited, i.e., intraframe coding and interframe coding. In
intraframe coding, each frame is coded separately using still image compression
methods such as transform coding, while interframe coding uses spatial redundancies
10
and motion compensation to exploit temporal redundancy of the video sequence. This
is done by predicting a new frame from its previous frame, thus the original frame to
code is reduced to the prediction error or residual frame [15]. We do this because
prediction errors have smaller energy than the original pixel values and therefore can
be coded with fewer bits. Those regions with high motion or scene changes will be
coded directly using transform coding. Video compression system is evaluated using
three criteria: reconstruction quality, compression rate and complexity.
The method used to predict a frame from its previous one is called Motion Estimation
(ME) or Motion Compensation (MC) [16] [17]. MC uses the motion vectors to
eliminate or reduce the effects of motion, while ME computes motion vectors to carry
on the displacement information of a moving object. Normally the two terms are often
referred to as MEMC.
reference frame
actual frame
motion vector
actual block
prediction block
Fig. 2.4 Block matching motion estimation
MEMC is normally done at macro block (MB) (16x16 pixels) level independently in
order to reduce computation complexity, which is called the Block Matching Algorithm.
In the Block Matching Algorithm (Fig. 2.4), a video frame is divided into macro
11
blocks. Each pixel within the block is assumed to have the same amount of
translational motion. Motion estimation is achieved by doing block matching between
a block in the current frame and a similar matching block within a search window in
the reference frame. A two-dimensional displacement vector or motion vector (MV) is
then obtained by finding the displaced co-ordinate of the match block to the reference
frame. The best prediction is found by minimizing a matching criterion such as the
Sum of Absolute Difference (SAD). SAD is defined as:
M
N
SAD = ∑∑ Bi , j ( x, y ) − BI −u , j −v ( x, y )
(2.10)
x =1 y =1
where Bi , j ( x, y ) represents the pixel with coordinate (x,y) in a MxN block from the
current frame at spatial location (i,j), while BI −u , j −v ( x, y ) represents the pixel with
coordinate (x,y) in the candidate matching block from the reference frame at spatial
location (i,j) displaced by vector (u,v).
2.3 Wavelet Based Image and Video Coding
This section provides a brief overview of wavelet based image and video coding [18][22]. The Discrete Wavelet Transform (DWT) is introduced and the subband coding
schemes including the Embedded Zerotree Wavelet (EZW) and the Set Partitioning in
Hieratical Tree (SPIHT) are discussed in detail. In the last sub-section, the concept of
scalability is introduced.
12
2.3.1 Discrete Wavelet Transform
The Discrete Wavelet Transform (DWT) is an invertible linear transform that
decomposes a signal into a set of orthogonal functional basis called wavelets. The
fundamental idea behind DWT is to present each frequency component as a resolution
matched to its scale, so that a signal can be analyzed at various levels of scales or
resolutions. In the field of image and video coding, DWT performs decomposition of
video frames or residual frames into a multi-resolution subband representation.
We denote the wavelet basis as
−
j
2
φ( j ,k ) ( x ) = 2 φ ( 2 − j x − k )
(2.11)
where variables j and k are integers that are the scale and location index indicating the
wavelet's width and position, respectively. They are used to scale or “dilate” φ (x) or
the mother function to generate wavelets.
The DWT transform pair is then defined as
f ( x) =
∞
∑ (c j ,k φ j ,k ( x))
∑
j = −∞ k = −∞
∞
(2.12)
∞
c j ,k == ∫ φ j ,k ( x)* f ( x)dx
(2.13)
−∞
where f (x) is the signal to be decomposed, and c j,k is the wavelet coefficient. To span
the data domain at different resolutions, we use equation (2.14):
W ( x) =
N −2
∑ (−1)
k = −1
k
ck +1φ (2 x + k )
(2.14)
W(x) is called the scaling function for the mother function φ (x) .
13
input vector
L
2
H
2
aj
cj
L
2
2
H
aj+1
...
cj+1
...
Fig. 2.5 1-D DWT decomposition
L
input image
L
2
LL
2
2
H
L
2
LH
HL
2
H
2
H
HH
Rows
Columns
Fig 2.6 Dyadic DWT decomposition of an image
In real applications, the DWT is often performed on a vector whose length is an integer
power of 2. As Fig. 2.5 shows, the process of 1-D DWT computation comprises of a
series of filtering and sub-sampling operations. H and L denote high and low-pass
filters respectively, ↓ 2 denotes down-sampling by a factor of 2. Elements aj are passed
on to the next step of the DWT and elements cj are the final wavelet coefficients
obtained from the DWT. The 1-D DWT can be extended to 2-D for image and video
processing. In this case, filtering and sub-sampling are first performed along all the
rows of the image and then all the columns. 2-D DWT is called dyadic DWT. 1-level
dyadic DWT results in four different resolution subbands, namely the LL, LH, HL and
the HH subbands. The decomposition process is shown in Fig. 2.6. The LL subband
contains the low frequency image and can be further decomposed by 2-level or 3-level
14
dyadic DWT. Fig. 2.7 depicts the subbands of an image decomposed using a 3-level
dyadic DWT. Fig. 2.8 shows the Barbara image after 2- level decomposition.
LL
HL3
LH3
HH3
HL2
HL1
LH2
HH2
LH1
HH1
Fig 2.7 Subbands after 3-level dyadic wavelet decomposition
Fig. 2.8 2-level DWT decomposed Barbara image
The advantage of DWT is that it has versatile time frequency localization. This is
because DWT has shorter basis functions for higher frequencies, and longer basis
functions for lower frequencies. The DWT has an important advantage over traditional
15
Fourier Transform in that it can analyze signals containing discontinuities and sharp
spikes.
2.3.2 EZW Coding Scheme
Good energy compaction property has attracted huge research interest on DWT based
image and video coding schemes. The main challenge of wavelet-based coding is to
achieve an efficient structure to quantize and code the wavelet coefficients in the
transform domain. Lewis and Knowles defined a spatial orientation tree (SOT)
structure [23] - [27] and Shapiro then made use of the SOT concept and introduced the
Embedded Zerotree Wavlet (EZW) encoder [28] in 1993. The idea is further improved
by Said and Pearlman by modifying the EZW SOT structure. Their new structure is
called Set Partitioning in Hierarchical Trees (SPIHT). A brief discussion on the EZW
scheme is provided in this section and a detailed description on SPIHT is provided in
the next section.
Shapiro’s EZW coder contains 4 key steps:
i)
the discrete wavelet transform;
ii)
subband coding using the EZW SOT structure (Fig. 2.9);
iii)
entropy coded successive-approximation quantization;
iv)
adaptive arithmetic coding.
A zerotree is actually a SOT which has no significant coefficients with respect to a
given threshold. For simplicity, the image in Fig. 2.9 is transformed using a 2-level
DWT. However, in most situations, a 3-level DWT is applied to ensure better
16
reconstruction quality. As shown in Fig. 2.9, the image is divided into 7 subbands
after the 2-level wavelet transform. Nodes in the lowest subband will each have 3
children nodes with one in each of its neighborhood subband. Its children, in turn,
will each have 4 children nodes which reside in the same spatial location of the
correspondent higher subband. Thus, all the nodes are linked in the SOTs, a
searching through these SOT trees will then be performed to have the significant
coefficients found and coded with a higher priority. The core of the EZW encoder,
step ii), is based on three concepts, i.e., comparison of coefficient magnitudes to a
series of decreasing thresholds representing the current bit plane, ordered bit plane
coding of refinement bits and exploitation of the correlation across subbands in the
transform domain.
Fig. 2.9 Spatial Orientation Tree for EZW
EZW coding scheme is proved competitive with virtually all known compression
techniques in performance, while still generating a fully embedded bit stream. It
utilizes both the bit plane coding and the zerotree concept.
17
2.3.3 SPIHT Coding Scheme
Pearlman’s SPIHT coder is an enhancement to the EZW coder. Basically, SPIHT is
also a sorting algorithm which tries to code wavelet coefficients according to priority
defined by their significance with respect to a certain threshold. This is achieved by
tracking down the SPIHT SOT and comparing the coefficients against the given
threshold. SPIHT scheme inherits the basic concepts of the EZW, except that it uses a
modified SOT called SPIHT SOT (Fig. 2.10).
*
Fig. 2.10 Spatial Orientation Tree of SPIHT
The SPIHT SOT structure is designed according to the observation that if a coefficient
magnitude in a certain node of a SOT does not exceed a given threshold, it is very
likely that none of the nodes in the same location in the higher subbands will exceed
that threshold. The SPIHT SOT naturally defines this spatial relationship using a
hierarchical pyramid. Each node is identified by the coordinate of the pixel and its
magnitude is the correspondent absolute value of that pixel. As Fig. 2.10 shows, each
node has either no or 4 offspring, which are located at the same spatial orientation in
the next finer level of the pyramid. The 4 offspring always form 2 × 2 adjacent pixel
groups. The nodes in the lowest subband of the image or the highest level of the
18
pyramid will be the roots of the SOT. There is a slight difference in the offspring
branching rule for the tree roots, i.e., in each 2 × 2 group, the upper left node will be
childless. Thus, the wavelet coefficients are organized in hierarchical trees with nodes
in common orientation across all subbands linked in one same SOT. This will allow us
to predict a coefficient’s significance according to the magnitude of its parent node
later.
We use the symbols in Said and Pearlman’s paper to denote the coordinates and the
sets:
•
O(i,j) denotes set of coordinates of all offspring of node (i,j);
•
D(i,j) denotes set of coordinates of all descendants of the node (i,j);
•
H denotes all nodes in the lowest subband, inclusive of the childless nodes;
•
L(i,j) denotes the set of coordinates of all non-direct descendants of the node
(i,j), i.e., L(i,j)=D(i,j)-O(i,j);
Now we are able the express the SOT descendants branching rule by equation (2.15):
O(i,j) = {(2i, 2j), (2i+1, 2j), (2i, 2j+1), (2i+1,2j+1)}
(2.15)
After the sets are defined, the set partitioning rule is used to create new partitions in
order to effectively predict and code significant nodes. A magnitude test is performed
on each partitioned subset to determine its significance. If significant, the subset will
be further partitioned into new subsets, and the magnitude test will again be applied to
the new subsets until each individual significant coefficient is identified. Note that an
individual coefficient is significant when it is larger than the current threshold and
insignificant otherwise. To make a significant set, at least one descendant must be
19
significant on an individual basis. We denote the transformed coefficients as ci,j, the
pixel sets as τ and use the following function to define the relationship between
magnitude comparisons and message bits:
{ }
1, if max ci , j ≥ 2 n
( i , j )∈τ
S (τ ) =
0,
otherwise
(2.16)
The set partitioning rule is then defined as follows:
i)
The initial partition is formed with sets {(i,j)} and D(i,j), for all (i,j) ∈ H;
ii)
If D(i,j) is significant, then it is partitioned into L(i,j) plus the four singleelement sets with (k,l) ∈ O(i,j);
iii)
If L(i,j) is significant, then it is partitioned into the four sets D(k,l) with (k,l)
∈ O(i,j).
Following the SOT structure and the set partitioning rule, an image that has large
coefficients at the SOT roots and zero or very small coefficients in higher level of the
SOTs will need very little sorting and partitioning of the pixel sets. This property
reduces the computational complexity greatly and allows for a better reconstruction of
the image.
In implementation, the SPIHT coding algorithm uses 3 ordered lists to store the
significance information, i.e.:
i)
list of insignificant pixels (LIP)
ii)
list of significant pixels (LSP)
iii)
list of insignificant sets (LIS)
20
In case of LIP and LSP, the coordinates of the pixels will be stored in the list. In case
of LIS, however, the list contains two types of coordinates categorized according to
which set it represents. If an entry represents the set D(i,j), we say it is a type A entry;
if it represents L(i,j), we say it is a type B entry.
To initialize the coding algorithm, the maximum coefficient in the image is identified
and the initial bit plane is assigned a value of n=log2(max(i,j){|ci,j|}. The threshold
value is then obtained by computing 2n. Also, the LIS and LIP lists are initialized with
pixel coordinates in the highest subband. The set portioning rule is then applied to the
LIP and LIS lists to judge the significance status of the pixels or sets. This is called the
sorting pass. Thereafter, the refinement pass will go through the LSP list to code bits
necessary to enhance the precision of the significant coefficients from the previous
sorting pass by a bit position. Thus, we fulfill coding under the first bit plane. To
continue, the bit plane is decreased by 1 and the sorting and refinement passes are reexecuted in the next iteration. This process is repeated until the bit plane is reduced to
zero or a user given bit budget runs out. Fig. 2.11 demonstrates the above coding
algorithm. Note that in step 2.2), the entries added to the end of the LIS list are
evaluated before that same sorting pass ends. So, step 2.2) not only sorts the original
initialized entries, but also sorts the entries that are being added to the LIS list.
The SPIHT coder improves the performance of the EZW coder by 0.3-0.6 dB. This
gain is mostly due to the fact that the original zerotree algorithms allow special
symbols only for single zerotrees, while there are often other sets of zeros in reality. In
particular, SPIHT coder provides symbols for combinations of parallel zerotrees.
Moreover, SPIHT produces a fully embedded bit stream that can be precisely
21
controlled in bit rate. SPIHT coder is very fast and has a low computational complexity.
Both EZW and SPIHT belong to subband coding schemes, and both exploit the
correlation between subbands through the SOT.
1) Initialization:
1.1) Output n= log2 (max (i, j) {| ci, j |};
1.2) Set the LSP, LIP and LIS as empty lists
Add coordinates (i,j) ∈ H to the LIP and those with descendants
to the LIS as TYPE A entries.
2) Sorting Pass:
2.1) For each entry (i,j) in LIP do:
2.1.1) Output S(i,j);
2.1.2) If S(i,j)=1 then
-Move (i,j) to LSP;
-Output sign of ci,j;
2.2) For each entry (i,j) in LIS do:
2.2.1) If entry is TYPE A then
-Output S(D(i,j));
-If S(D(i,j))=1 then
+For each offspring (k,l) of (i,j) do:
-Output S(k,l);
-if S(k,l)=1 then add (k,l) to LSP
and output sign of ck,l;
-if S(k,l)=0 then add (k,l) to end of LIP;
+If L(i,j)!=0 then move (i,j) to end of LIS as
TYPE B and go step 2.2.2);
otherwise remove (i,j) from LIS;
2.2.2) If the entry is TYPE B then
-Output S(L(i,j));
-If S(L(i,j))=1 then
+Add each element in L(i,j) to end of LIS as TYPE A;
+Remove (i,j) from LIS;
3) Refinement Pass:
For each entry (i,j) in LSP, except those from the last sorting pass:
-Output the nth most significant bit of ci,j;
4) Quantization-Step Update:
Decrease n by 1 and go to step 2.
Fig. 2.11 SPIHT coding algorithm
22
2.3.4 Scalability
One advantage of the SPIHT image and video coder is the bit rate scalability.
Scalability is the degree to which video and image formats can be sized in systematic
proportions for distribution over communication channels of varying capacities [29]. In
other words, it measures how flexible an encoded bit stream is. Scalable image and
video coding has received considerable attention from the research community due to
the diversity of the communication networks and network users.
There are three basic types of scalability, and they refine video quality along three
different dimensions, i.e.:
•
Temporal scalability or temporal resolution/frame rate
•
Spatial scalability or spatial resolution
•
SNR scalability or amplitude resolution
Each type of scalable coding provides scalability of one dimension of the video
sequence. Multiple types of scalability can be combined to provide scalability along
multiple dimensions. In real applications, being temporal scalable often means
supporting different frame rates, while spatial and SNR scalability means video of
different spatial resolution and visual quality respectively.
One common method of providing scalability is to apply subband decomposition on
the video sequences. Thus, the full resolution video can be decoded using both the low
pass and high pass subbands, while half resolution video can be decoded using only the
low pass subband. The resulting half resolution video sequences can be passed through
further subband decomposition to create quarter resolution video, and so on. We will
use this concept in chapter 4.
23
2.4 Image and Video Coding Standards
ISO/IEC and ITU-T have been heavily involved in the standardization of image, audio
and video coding as international organizations. Specifically, ISO/IEC focuses on
video storage, broadcast video and video streaming applications while ITU-T caters to
real time video applications. Current video standards mainly comprise of the ISO
MPEG family and the ITU-T H.26x family. Table 2.1 [30] provides an overview of
these standards and their applications. JPEG and JPEG 2000 are also listed as still
image coding standards for reference.
Standard
JPEG
JPEG 2000
MPEG-1
Application
Still image compression
Improved still image compression
Video on CD
Bit Rate
Variable
Variable
1.5Mbps
MPEG-2
Digital Television, Video on DVD
2-20Mpbs
MPEG-4
Object-based coding
Interactive video
28-1024kbps
H.261
Video conferencing over ISDN
Variable
H.263
Video conferencing over Internet and PSTN
Wireless video conferencing
>=33kbps
H.26L
Improved video compression
10-100kpbs
Table 2.1 Image and video compression standards
24
CHAPTER 3
VIDEO STREAMING AND NETWORK QoS
In this chapter some fundamentals in video streaming are provided. The network
Quality of Service (QoS) is defined and two frameworks, i.e., the Integrated Services
(IntServ) and the Differentiated Services (DiffServ) are discussed in detail. Finally,
principles of layered video streaming are provided.
3.1 Video Streaming Models
Unicast and multicast are the two models of video streaming. Unicast is the
communication between a single sender and a single receiver. As shown in Fig. 3.1,
the sender sends individual copies of video streams to each client even when some of
the clients require the same video resource. Unicast is also called point-to-point
communication because there seems to be a non-shared connection from the server to
each client.
client1
harddisk
client2
z
client3
server
encoder
client4
client5
Fig. 3.1 Unicast video streaming
25
By contrast, communication between a single sender and multiple receivers is called
multicast or point-to-multi-points communication. In multicast scenario (Fig. 3.2), the
sender sends only one copy of the required video over the network. It is then routed to
several destinations by the network switches or routers. A client receives the video
stream by tuning in to a multicast group in its neighborhood. When the clients are
multiple groups, the video will be duplicated and branched at fork points, as shown at
router R1 (Fig. 3.2).
client1
R2
harddisk
client2
z
server
R1
R3
client3
R4
encoder
client4
webcam
client5
Fig. 3.2 Multicast video streaming
3.2 Characteristics and Challenges of Video Streaming
Video streaming is real time applications. Unlike traditional data-oriented applications
such as email, ftp, and web browsing, video streaming applications is highly delaysensitive and need the data to arrive on time to be useful. As such, the service
requirements of video streaming applications differ significantly from those of
traditional data-oriented applications. To satisfy these requirements is a great challenge
under today’s Internet.
26
First of all, the best effort (BE) services that current Internet provides are far from
sufficient for a real time application as video streaming. Under the BE service model,
there is no guarantee on delay bound or loss rate. When the data load is heavy on the
Internet, delivery results could be very unacceptable. On the other side, video
streaming requires timely and, to some extent, correct delivery. We must ensure that
the streaming is still viable with a decreased quality in time of congestion.
Second, client machines on the Internet normally vary significantly in their computing,
display and memory capabilities. In most cases, these heterogeneous clients will
require video of different qualities. It is obviously inefficient to delivery the same
video stream to all clients seperately. Instead, streaming in response to particular
requests from each individual client is desirable.
As a conclusion, suitable coding and streaming strategies are needed to support
efficient real time video streaming. Scalable video coding and new network service
model which supports network Quality of Service are developed to address the above
challenges. We will discuss details of QoS in the next section.
3.3 Quality of Service
3.3.1 Definition of QoS
The current Internet provides one single class of service, the BE service. BE service
generally treats all data as one service class and provides priority to no particular data
or users. This is not enough for the new real time applications such as video streaming.
We must modify the Internet and provide more service options, which can, to some
27
extent, keep the service quality up to a certain level that has been previously agreed on
by the user and the network. This new service model is provided by support of network
Quality of Service.
Quality of Service (QoS) is the ability of a network element, such as the video
streaming server/client and the switches/routers in the network, to have a certain level
of assurance that its traffic and service requirements can be satisfied [31] [32]. QoS
enables a network to deliver a data flow end-to-end with a guaranteed delay bound and
bit rate required by user applications.
There are generally two approaches to support QoS. One is fine-grained approaches,
which provide QoS to individual applications or flows, the other is coarse-grained
approaches, which aggregate data of traffic into classes, and provide QoS to these
classes. Currently, the Internet Engineering Task Force (IETF) has developed QoS
frameworks through both approaches, namely, the Integrated Services (IntServ)
framework as example of fine-grained approaches, and the Differentiated Services
(DiffServ) framework as example of coarse-grained approaches.
3.3.2 IntServ Framework
IntServ is a per-flow based QoS framework with dynamic resource reservation [31]
[32]. It provides QoS for specific flows by reserving resource requested by that flow
through signaling between the host and network routers using the Resource
Reservation Protocol (RSVP) [33]. As shown in Fig. 3.3, RSVP acts as a messenger
between a particular application flow and various network mechanisms. These
mechanisms are individual functional blocks that work together with the RSVP to
28
determine or assist in the reservation. Packet classifier classifies a packet into an
appropriate QoS class. Policy control is then used to examine the packet to see whether
it has administrative permission to make the requested reservation. For final
reservation success, however, admission control must also be passed to ensure that the
desired resources can be granted without affecting the QoS previously requested by
and admitted to other flows. Finally, the packet is scheduled to enter the network at a
proper time by the packet scheduler in the primary data forwarding path.
Router
Application
Host
Resource
Reservation
Policy
Control
Resource
Reservation
Routing
Admission
Control
Classifier
Packet
Scheduler
...
Policy
Control
Admission
Control
Classifier
Packet
Scheduler
...
Fig. 3.3 IntServ architecture
The IntServ architecture adds two service classes to the existing BE model, i.e.,
guaranteed service and controlled load service.
Basic concept of guaranteed service can be described using a linear flow regulator
called leaky bucket regulator (Fig. 3.4). Suppose there are b tokens in the bucket, and
new tokens are being filled in at a rate of r tokens/sec. Before filtering by the regulator,
packets are being thrown in with a variable rate. Under the filtering, however, they
must wait at the input queue for the same amount of tokens to be available before they
29
can further proceed into the network. Obviously, such a regulator allows flows of a
maximum burst rate of b tokens/sec and an average rate of r tokens/sec to pass. Thus,
it confines the traffic to a network to b+rt tokens over an interval of t seconds.
r tokens/sec
b tokens
variable rate
packets input
...
...
remove
token
to network
input queue
Fig. 3.4 Leaky bucket regulator
To invoke the service, a router needs to be informed of the traffic and the reservation
characteristics, denoted by Tspec and Rspec respectively.
Tspec contains the following parameters:
•
p = peak rate of flow (bytes/s)
•
b = bucket depth (bytes)
•
r = token bucket rate (bytes/s)
•
m = minimum policed unit (bytes)
•
M = maximum datagram size (bytes)
Rspec contains the following parameters:
•
R = bandwidth, i.e., service rate (bytes/s)
•
S = slack term (ms)
30
Guaranteed service promises a maximum delay for a flow, provided that the flow
conforms to its specified traffic parameters. This service model aims to support
applications with hard real time requirements.
Unlike guaranteed service, controlled-load service provides no rigid delay or loss
guarantees. Instead, it provides a QoS similar to BE service in an under utilized
network, with almost no loss or delay. When the network is overloaded, it tries to share
the bandwidth among multiple streams in a controlled way to manage approximately
the same level of QoS. Controlled-load service is intended to support applications that
can tolerate reasonable amount of delay and loss.
3.3.3 DiffServ Framework
IntServ provides fine-grained QoS guarantees using the Tspec message. However,
introducing Tspec for each flow may be too expensive in implementation. Besides,
incremental deployment is only possible for controlled-load service, while it is difficult
to realize guaranteed service across the network. Therefore, there is a need for more
flexible service models to allow for more qualitative definitions of service distinctions.
The solution is DiffServ, which aims to develop architecture for providing scalable and
flexible service differentiation.
Generally, the DiffServ architecture comprises of 4 key concepts:
•
DiffServ Domain;
•
Per Hop Behaviors (PHB) for forwarding;
•
Packet classification;
•
Traffic conditioning including metering, marking, shaping and policing.
31
DiffServ exploits edge-core distinction for scalability. As shown in Fig. 3.5, packet
classification and traffic conditioning are done at the edge routers, where packets are
marked in its Differentiated Services field (DS field) to identify the behavior aggregate
it belongs to. At the core routers, forwarding is done very quickly according to the
PHB associated with the DS mark.
leaf router
core router
Fig. 3.5 An example of the DiffServ network
It is necessary that we clearly define PHB here for further understanding of the
DiffServ framework. A PHB is the description of the externally observable forwarding
behavior of a DiffeServ node applied to a particular DiffServ behavior aggregate [34].
Until now, the IETF has defined 2 PHBs, i.e., the Expedited Forwarding (EF) PHB
and the Assured Forwarding (AF) PHB. EF PHB promises services like a “virtual
leased line” or “premium service”. AF PHB provides several class categories, with
each class allocated different levels of forwarding assurances. These assurances in turn
guarantee that a packet will be forwarded timely within the timeframe previously
agreed on.
32
destination
source
BB
BB
egress
router
leaf router
core routers
ingress router
DiffServ domain
Fig. 3.6 DiffServ inter-domain operations
Between different domains, Bandwidth Brokers (BB) are used to make arrangements
on agreements (Fig. 3.6). The leaf router polices and marks incoming flows, and the
egress edge router shapes aggregates. In the DiffServ domain, the ingress edge router
is used to classify, police and mark the aggregates. BB performs admission control,
manages network resources and configures leaf and edge routers.
3.4 Layered Video Streaming
As discussed above, the current Internet is unreliable, which contradicts the QoS
requirements of video streaming applications. In addition, the heterogeneity of
receivers makes it difficult to achieve efficiency and flexibility in multicast video
streaming. To address these problems, scalable video streaming [35]-[37] is introduced.
Scalable video streaming comprises of two related parts: scalable video coding and
adaptive QoS support from the network and end systems.
33
One approach to accomplish scalability is to split and distribute the video information
over a number of layers, including a base layer and several enhancement layers. This
technique is referred to as layered video coding [38]-[44]. Layered video coding
produces multiple bit streams that can be progressively decoded. It is especially
suitable for receiver-driven layered multicast, where the clients consciously select the
layers they need from a group of common stream layers, and combine them into
different bit streams for the decoder to produce video of different quality.
Layered scalable video coding was first suggested by Ghanbari [45]. After that, it was
widely studied. Early groundwork include Taubman and Zakhor’s multi-rate 3-D
subband coding [46], and Vishwanath and Chou’s wavelet and Hierarchical Vector
Quantization (HVQ) combined coder [47] for interactive multicast network scenarios.
Khansari et al modified the h.261 (p*64) standard algorithm, and proposed an
approach to encoding video data into two priority streams, thereby enabling the
transmission of video data over wireless links to be switched between two bit rates
[48]. McCanne et al developed a hybrid DCT/wavelet-based video coder and
introduced the concept of Receiver-driven Layered Multicast (RLM) into literature
[49]. MPEG-4 based layered video coding method has also been studied, such as
MPEG-4 video coder using both a prediction-based base layer and a fine-granular
enhancement layer in [50].
Layered scalable video coding makes it possible to represent the raw video in a layered
format, which can be progressively transmitted and decoded. Fig. 3.7 is an illustration
of such a layered codec.
34
layered encoder
layered
decoder
layered
decoder
+
+
layered
decoder
Fig. 3.7 Principle of a layered codec
Network QoS support is another key mechanism in scalable video streaming.
Principles of QoS have been discussed in detail in the previous sections. Particularly,
layered scalable video streaming is used to realize a better allocation of bandwidth in
time of congestion. The basic idea of layered streaming is to encode raw video into
multiple layers of cumulative bit streams that can be separately transmitted and
progressively decoded. Packets from different layers will be marked with different
drop priorities in the network. Clients consciously select different combinations of
layers to receive a video of the quality they prefer. Layered mulicast streaming will be
discussed in detail in chapter 4.
35
CHAPTER 4
LAYERED 3D-CSPIHT CODEC
In this chapter, we propose a layered multi-resolution scalable video codec based on
the 3-D Color Set Partitioning in Hierarchical Trees (3D-CSPIHT) scheme, called
layered 3D-CSPIHT. The original CSPIHT and 3D-CSPIHT video coder are
introduced and limitations with regard to layered multicast video streaming are
discussed before presenting the layered 3D-CSPIHT codec.
4.1 CSPIHT and 3D-CSPIHT Video Coder
To effectively code color images, SPIHT has been extended to Color SPIHT (CSPIHT).
Like the SPIHT scheme, the CSPIHT is essentially an algorithm for sorting wavelet
coefficients across subbands. The coefficients in the transform domain are linked by an
extended SOT structure that is then partitioned such that the coefficients are divided
into sets defined by the level of the most significant bit in a bit-plane representation of
their magnitudes [51]. Significant bits are coded with higher priority under certain bit
budget constraints, thus creating a rate controllable bit stream. In the luminance plane,
the SPIHT algorithm is used to define the SOT while in chrominance planes, the EZW
structure is adopted. To efficiently link the nodes in the luminance plane to those in the
chrominance planes and still produce an embedded bit stream, the childless nodes in
the luminance plane are linked to the root nodes in the chrominance planes. Fig. 4.1
depicts the SOT structure of the CSPIHT for image coding.
36
Y (SPIHT structure)
Cb
Cr (EZW structure)
Fig. 4.1 CSPIHT SOT (2-D)
To code video sequences, the CSPIHT integrates block motion estimation into the
coding system. The overall CSPIHT structure includes four functional blocks, namely,
the DWT, the block motion estimation scheme, the CSPIHT kernel and the
binary/arithmetic coding module (Fig. 4.2).
coding feedback
Rate
Control
Intra-frame
Inter-frames
Video In
CSPIHT
Coding Kernel
DWT
Binary/Arithmetic
Coding
Bit Stream
Huffman Coding
Motion Vector
Stream
Error Frames
Block MEMC
Motion Vectors
Fig. 4.2 CSPIHT video encoder
The incoming video sequences are separately coded with the first frame coded under
the intra-frame mode and the rest under the inter-frame mode. The intra-frame is
37
passed directly to the 2-D DWT for filtering. The transformed image first undergoes
encoding in the CSPIHT kernel and then goes through binary or arithmetic coding to
produce encoded bit streams. All the inter-frames, however, are passed into block
MEMC, resulting in a series of error frames and motion vectors. The error frames are
then coded following exactly the intra-frame coding procedure and the motion vectors
are coded using Huffman coding scheme to reduce statistical redundancy.
The transmitted bits are constantly counted and updated to the CSPIHT kernel via the
coding feedback (Fig. 4.2). CSPIHT kernel compares this information against a user
given bit budget and stops the coding procedure when the budget is used up. This
allows for precise rate control. If the bit budget is not met until the bit plane reduces to
zero, the coding halts automatically. The corresponding CSPIHT video decoder is
illustrated in Fig 4.3.
Bit Stream
Binary/Arithmetic
Decoding
Inverse
CSPIHT
Inverse
DWT
+
Video
Out
+
MCME
Prediction
Motion
Vector Stream
Huffman
Decoding
Fig. 4.3 CSPIHT video decoder
The CSPIHT video coder has also been extended to 3-D for the compression of video
sequences. 3D-CSPIHT encoder takes a group of frames (GOF), which normally
consists of 16 frames, as input. 3-D DWT will then be applied, where we do temporal
filtering in addition to horizontal and vertical filtering at each stage of the wavelet
38
decomposition. Finally, the wavelet coefficients will be linked using a Spatial
Temporal Orientation Tree (STOT) (Fig. 4.4) and coded in the 3D-CSPIHT kernel.
Fig. 4.4 3D-CSPIHT STOT
The 3-D STOT is obtained by a straightforward extension of the 2D-CSPIHT SOT to
3-D. The roots in the luminance STOT consists of 2×2×2 cubes in the lowest subband
where there is one childless node. This childless node is then linked as in the CSPIHT
to the roots of the chrominance STOTs.
Denoting the coordinates of each coefficient by (i,j,k), the previously defined SPIHT
set denotation is extended as follows :
•
O(i,j,k) represents the set containing all the offspring of node (i,j,k);
39
•
D(i,j,k) represents the set containing all the descendants of node (i,j,k);
•
H represents the set of all the STOT roots;
•
L(i,j,k) = D(i,j,k) - O(i,j,k).
We can then express the 3-D descendents branching rule using equation (4.1):
O(i,j,k) = { (2i,2j,2k), (2i,2j+1,2k), (2i,2j,2k+1), (2i,2j+1,2k+1),
(2i+1,2j,2k), (2i+1,2j+1,2k), (2i+1,2j,2k+1), (2i+1,2j+1,2k+1)}
(4.1)
The 3D-CSPIHT set partitioning rule is defined as follows:
i)
The initial partition is formed with sets {(i,j,k)} and D(i,j,k), for all (i,j,k) ∈ H;
ii)
If D(i,j,k) is significant, then it is partitioned into L(i,j,k) plus the four singleelement sets with (l,m,n) ∈ O(i,j,k);
iii)
If L(i,j,k) is significant, then it is partitioned into the four sets D(l,m,n) with
(l,m,n) ∈ O(i,j,k).
Following the above extensions, the 3D-CSPIHT encoder (Fig. 4.5) and decoder (Fig.
4.6) are very similar to the CSPIHT encoder and decoder.
Feedback
Rate
Control
Video In
3-D DWT
3-D CSPIHT
Coding Kernel
Binary/Arithmetic
Coding
Bit Stream
Fig. 4.5 3D-CSPIHT video encoder
40
Bit Stream In
Binary/Arithmetic
Coding
Inverse
3-D DWT
3-D CSPIHT
Coding Kernel
Video In
Fig. 4.6 3D-CSPIHT video decoder
4.2 Limitations of the Original 3D-CSPIHT Codec
3D-CSPIHT video codec provides satisfactory results for color image and video
coding in terms of PSNR, but there are some limitations which prevent it from being
directly incorporated with layered video streaming systems.
First of all, the original 3D-CSPIHT aims at precise rate control and sorting of
coefficients terminates when a user defined bit budget is used up. The decoder must
know the same bit budget information to decode a particular stream. For video
reconstruction at a different bit-rate, the encoder needs to run again with the desired bit
budget. However, in layered streaming systems, it is not appropriate to control bit rate
by re-encoding. Instead, the encoder is expected to produce a fully coded video stream.
Video quality is controlled at the client side by subscribing to multicast groups that
carry the desirable layers.
block header
lost data
GOF1
GOF2
stream at the server
Data
Data
stream at the decoder
Data
Data
GOF1
GOF2
GOF3
Data
Data
GOF3
GOF4
Data
Data
Data data
confused data
GOF4
Fig. 4.7 Confusion when decode incomplete data using original 3D-CSPIHT decoder
41
Furthermore, as discussed in chapter 3, a layered bit stream is assumed for layered
video streaming. We achieve this in the layered 3D-CSPIHT codec by re-sorting the
wavelet coefficients with different priorities according to pre-defined resolution layers
[52]-[56].
Finally, since data loss is very possible over the network, the decoder must be able to
decode differently truncated versions as well as incomplete versions of the original
encoded stream (stream before network transmission). Fig. 4.7 illustrates the confusion
when decoding incomplete data under the original 3D-CSPIHT scheme. The first
stream in Fig. 4.7 is the encoded video data sent from the network server, and the
second is the one that arrives at the decoder. As shown, the loss of one packet in GOF2
renders the decoder unable to correctly decode all data that arrive after it. To overcome
this problem, additional flags are needed to enable the decoder to identify the
beginning of new layers and GOFs.
4.3 Layered 3D-CSPIHT Video Codec
In this section we present the Layered 3D-CSPIHT codec which overcomes the
limitations of the 3D-CSPIHT highlighted above so that it can be applied to a layered
video transmission system. New features of the layered codec and the layer ID
functions are presented. The production of the multi-resolution scalable bit stream is
then discussed, and how the layered codec functions in the network is explained.
Finally, in the last sub-section, the layered 3D-CSPIHT algorithm is provided.
42
4.3.1 Overview of New Features
Our main consideration is how to cooperate with the network elements and enable the
decoder to decode incomplete bit streams. We suppose the video streaming system is
real time and multicast. (Fig. 4.8)
multi
document
receiver
decoder
player
sender
encoder
layer 1
layer 2
server
layer 7
...
...
...
PDA
..
.
desktop PC
laptop PC
Fig. 4.8 Network scenario considered for design of the layered codec
Layered multicast enables receiver-based subscription of multicast groups that carry
incrementally decodable video layers. In a layered multicast system, the server sends
only one video stream and the receivers/clients can choose different resolutions, sizes,
and frame rates for display, resulting in different bit rates. As shown in Fig 4.8, the
encoder is executed offline. The resulting stream is separated into layers according to
resolution and stored separately so that they can be sent over different multicast groups.
The server must make various decisions including the number of layers to be sent, the
layers to be discarded if bandwidth is not adequate for all layers and etc. The server is
able to do this because it has information about the network status including the
43
available bandwidth and the congestion level. Heterogeneous clients subscribe to the
layers they want based on the capacity of the client machines and user requests. Users
may not always want to see the best quality video even if they could because it takes
more time and costs more. In Fig. 4.8 for example, the PDA client only subscribes to
the first layer of each GOF, the laptop PC client subscribes to the first two layers while
the powerful desktop PC client subscribes to all seven layers. Under such network
scenario, we need a codec which provides layered and progressively decodable streams,
so that the encoding can be executed offline and the clients are able to control their
own video qualities from one copy of fully encoded bit stream. Also, to accommodate
possible loss during network delivery, the decoder needs to know which layer the
incoming data belong to.
Due to the above considerations, we extended the original 3D-CSPIHT codec so that it
can be used in layered video streaming systems by removing the bit budget parameter
so that all coefficients are coded and by incorporating the following new features:
•
A new sorting algorithm that produces resolution/frame rate scalable bit
streams in a layered format;
•
A specially designed layer ID in the encoded bit stream that identifies the layer
that a particular data packet belongs to.
4.3.2 Layer IDs
The layer ID is designed to act as a layer identifier, which tells the beginning positions
of new layers. It must be unique and result in minimal overhead for any combination
of video data. Synchronization bits consisting of k consecutive ‘1’s, i.e., ‘1111…11’
are introduced as the layer ID at the beginning of each layer.
44
ID
layer1
ID
10...
11...1
01110011...1010100110111011...101011...101011...1001101.....
11...1
k
k-1
k-1
k-1
layer2
.
.
.
.
.
.
1
1
0
0
1
1
1
1
header
k-1
data
added
added
added
added
Fig. 4.9 The bit stream after layer ID is added
To make the ID unique, occurrences of k consecutive ‘1’s in the video data stream are
extended by inserting a ‘0’ bit or layer ID protecting bit after k -1 ‘1’s so that the
sequence becomes ‘1111…101’. If the data bit after k-1 consecutive ‘1’s is ‘0’, an
additional ‘0’ will still be added to protect the original ‘0’ from being removed at the
decoder (Fig. 4.9). Once the video stream is received at the decoder, layer ID
protecting bits are removed from occurrences of ‘1111…10’ while conducting normal
3D-CSPIHT decoding. A good value of k is one that results in the smallest overhead.
If k is too large, the layer ID will be a high overhead while if it is too small many ‘0’s
have to be inserted. From simulation conducted by examining the resulting
compression ratios with different reasonable values of k, k=8 was found to be a good
choice.
Different layers use the same ID by maintaining an ID counter that counts the number
of times the layer ID is captured. For example, the decoder knows that the successive
data belongs to layer 3 when it detects the layer ID for the third time. ID counter is
reset to zero as soon as it reaches the maximum number of layers. Our layered codec
has 7 layers as it uses 3-level spatial and temporal wavelet transform.
In a congestion adaptive network, the network elements can consciously select less
important data to discard when congestion occurs. In our proposed codec, different
45
layers have different priorities according to their resolution. The Layer ID is essential
as it enables network elements and decoders to identify the beginning position of a
new layer or the boundary between two layers. Knowing the layer boundaries not only
enables decoding of video streams that have experience packet loss, but also helps QoS
marking when unicast streaming is needed.
4.3.3 Production of Multi-Resolution Scalable Bit Streams
As stated in chapter 3, layered video streaming systems demand a layered video codec.
This is achieved in the layered 3D-CSPIHT video coder by re-arranging the wavelet
coefficients to produce multi-resolution scalable bit streams. The encoded bit stream is
divided into progressively decodable layers according to their resolution levels.
Fig. 4.10 depicts the relationship of the subbands and the layers for one GOF. The total
of 22 subbands are divided into 7 layers that are coded in the order of layer 1, layer 2,
up to layer 7. Subbands of the same pattern are aggregated as one layer. Fig. 4.11
shows how the layers are sent over different multicast groups. As more layers are
added on, the video quality will improve in both spatial and temporal resolution. Table
4.1 shows the 7 resolution options which provide different combinations of spatial
resolutions and frame rates.
To re-sort the transform coefficients according to layer priorities, the significance
criteria of the original 3D-CSPIHT needs to be restricted. In the original 3D-CSPIHT,
there are four types of significance criteria:
46
layer6
layer4
layer7
layer3
layer1
layer2
layer5
Fig. 4.10 Resolution layers in the layered 3D-CSPIHT
layer 7
multicast group 7
layer 6
multicast group 6
layer 5
multicast group 5
layer 4
multicast group 4
layer 3
multicast group 3
layer 2
multicast group 2
layer 1
multicast group 1
GOF1
GOF2
Fig. 4.11 Progressively transmitted and decoded layers
Layers
1
1+2
1+3
1+2+3+4
1+2+3+4+5
1+2+3+4+6
All 7 layers
Spatial resolution
Low
Medium
Low
Medium
High
Medium
High
Frame rate
Low
Low
Medium
Medium
Medium
High
High
Table 4.1 Resolution options
47
i)
A node in the LIP list is significant if its magnitude is larger than the current
threshold;
ii)
A type A entry is significant if one of its descendants is larger in magnitude
than the current threshold;
iii)
Type A offspring is significant if its magnitude is larger than the current
threshold;
iv)
A type B entry is significant if one of its indirect or non-offspring descendants
is larger in magnitude than the current threshold.
In the layered 3D-CSPIHT, a set entry is significant when it satisfies the above criteria
and when it is in an effective subband, i.e., a subband that is currently being coded.
However, the significance criteria for individual nodes remain the same as in the
original 3D-CSPIHT, because previous subband effectiveness checks on the LIS sets
have prevented any non-effective node from being selected as future significant node
candidates.
We now compare the original 3D-CSPIHT and the layered 3D-CSPIHT sorting on an
example video frame to derive the extended significance criteria. For simplicity, we
suppose that the maximum bit plane is 2.
Fig. 4.12 (a) shows a decomposed video frame using a 3-level wavelet transform. The
DWT divides the 16 × 16 image into seven subbands, which comprises of three layers,
shown in different colors. Except the childless root A, roots B, C and D and their
descendants are organized into three SOTs according to the 3D-CSPIHT descendents
branching rule (Fig. 4.12 (b)).
48
A*
B
B
C
D
B
C
C
D
1
2
C3
C4
C
C
11
12
1
3
1
D3
C
21
B
B
D
B
2
B
4
B
2
11
13
31
B 12
B
B
B
B
14
22
B
23
B
32
B
21
B
41
B 33
B 34
B 43
B 44
C
D
D
D
D
11
12
21
22
C 13
C 14
C 23
C 24
D 13
D 14
D 23
D 24
C 31
C 32
C 41
C 42
D 31
D 32
D 41
D42
C
C
C
C
D
D
D
D
33
34
43
44
33
34
43
{B12}
{B21}
{B22}
{B13}
{B14}
{B23}
{B24}
{B31}
{B32}
{B41}
{B42}
{B33}
{B34}
{B43}
42
D4
22
{B11}
24
{B44}
44
{C11}
{C12}
{C21}
{C22}
{D11}
{D12}
{D21}
{D22}
{C13}
{C14}
{C23}
{C24}
{D13}
{D14}
{D23}
{D24}
{C31}
{C32}
{C41}
{C42}
{D31}
{D32}
{D41}
{D42}
{C33}
{C34}
{C43}
{C44}
{D33}
{D34}
{D43}
{D44}
legend
childless node
: significant at bit plane 2
: significant at bit plane 1
A*:
B
C
22
{D11} : D111, D112, D113, D114
: layer 1
: layer 2
: layer 3
Fig. 4.12 (a) An example video frame after DWT transforms
B
B2
B1
B3
B11 B12 B13 B14
B4
B31 B32 B33 B34
B21 B22 B23 B24
{B11}{B12}{B13}{B14}
B41 B42 B43 B44
{B31}{B32}{B33}{B34}
{B21}{B22}{B23}{B24}
{B41}{B42}{B43}{B44}
Fig. 4.12 (b) SOT for Fig. 4.12 (a)
49
At initialization of the original 3D-CSPIHT algorithm, nodes from the lowest subband
(i.e., A, B, C and D) are added to the LIP list and those with descendants (i.e., B, C, D)
are added to the LIS list as type A entries. The bit plane is set to 2. Sorting then begins
with significance checks on the LIP nodes. As assumed in Fig. 4.12 (a), only node B is
significant at bit plane 2, therefore B is moved to the LSP list. The LIP sorting then
terminates because no other significant nodes are present.
In the LIS sorting, we first determine the significance of type A entries B, C and D.
There are significant descendants for all of them, therefore their offspring B1, B2 up to
D4 are coded and moved to the LIP or LSP list accordingly. Meanwhile, nodes B, C
and D are moved to the end of the LIS list as type B entries. The processing of LIS
entries continues with significance checks on these newly added type B entries (i.e.,
nodes B, C and D). As entry D has no significant non-offspring descendants, it remains
in the list, while B and C are removed and their offspring are added to the end of the
list as new type A entries. These entries are processed recursively as above until no
significant entry is present in the LIS list. Table 4.2 shows the initial state and the final
state of the LIP and LIS sorting at bit plane 2. We underline an entry to indicate that it
is a type B entry, and use ‘~’ to substitute nodes or entries in the previous column of
the same list.
Initialization
State after LIP sorting
State after LIS sorting
LIP
ABCD
ACD
~B3C1C2C3D1D2D3B21B23B24B42C42C44
LIS
BCD
~
LSP
Φ
B
DB3C1C2C3B14B21B23B24B42C42C43C44
BB1B2B4C4D4B11B12B13B14B22B41B43B44C41C43
{B11}{B12}{B13}{B22}{B41}{B43}{B44}{C41}
Table 4.2 LIP, LIS, LSP state after sorting at bit plane 2 (original 3D-CSPIHT)
50
To continue the sorting, bit plane is reduced by 1 and the same process is carried out in
the LIP and LIS lists. Sorting terminates when the bit plane is reduced to 0. The sorting
results at bit plane 1 and 0 are shown in Table 4.3 and Table 4.4.
State after LIP sorting
State after LIS sorting
LIP
DB3C3D2D3B23C44
~ C13
LIS
DB3C1C2C3B14B21B23B24B42C42C43C44
DB3C3B14B23C42C43C44C12C13
LSP
~ACC1C2D1B21B24B42C42
~C11C12C14C21C22C23C24{B21}{B24}{B42}
{C11}{C14}{C21}{C22}{C23}{C24}
Table 4.3 LIP, LIS, LSP state after sorting at bit plane 1 (original 3D-CSPIHT)
State after LIP sorting
State after LIS sorting
LIP
Φ
Φ
LIS
DB3C1C2C3B14B21B23B24B42C42C43C44
Φ
LSP
~DB3C3D2D3B23C44
~B31-B44C31-C44{B14}{B23}{C42}{C44}{C12}{C13}
D11-D44 {B31}-{B34}{C31}-{C34}{D11}-{D44}
Table 4.4 LIP, LIS, LSP state after sorting at bit plane 0 (original 3D-CSPIHT)
There is no layering in the original 3D-CSPIHT algorithm; therefore, the above
described sorting process does not conduct checks of subband effectiveness. In the
layered 3D-CSPIHT, however, subband effectiveness checks are necessary to confine
sorting within an effective subband.
At initialization, layer 1 is set as the effective layer. In the LIP sorting, only node B has
significant descendants, and the descendants are in the currently effective layer (layer
1). Therefore, node B is moved to the LSP list. The LIS sorting then begins by
examining type A entries (i.e., B, C and D) in the LIS list. Each of them has significant
descendants in layer 1, so they are coded as in the original 3D-CSPIHT. When
examination of type B entries is conducted, however, it is found that although
significant non-offspring descendants are present for entry B and C, none of them
51
resides in the current effective layer. Therefore, the final result of LIS sorting at bit
plane 2 is different from that of the original 3D-CSPIHT. (Table 4.5)
Initialization
State after LIP
sorting
State after LIS sorting
LIP
ABCD
ACD
ACDB3C1C2C3D1D2D3
LIS
BCD
~
BCD
LSP
Φ
B
BB1B2B4C4D4
Table 4.5 LIP, LIS, LSP state after sorting at bit plane 2
(Layered 3D-CSPIHT, layer 1 effective)
The bit plane is then reduced by 1 and the sorting continues. Significant LIP nodes are
now found effective and moved to the LSP list as in the original 3D-CSPIHT, while
significant LIS nodes remains non-effective, resulting in no change in the LIS list
(Table 4.6). At bit plane 0, all LIP nodes and no LIS nodes are identified as significant
(Table 4.7).
State after LIP sorting
State after
LIS sorting
LIP
DB3C3D2D3
~
LIS
BCD
~
LSP
BB1B2B4C4D4ACC1C2D1
~
Table 4.6 LIP, LIS, LSP state after sorting at bit plane 1
(Layered 3D-CSPIHT, layer 1 effective)
State after LIP sorting
State after
LIS sorting
LIP
Φ
Φ
LIS
BCD
~
LSP
BB1B2B4C4D4ACC1C2D1DB3C3D2D3
~
Table 4.7 LIP, LIS, LSP state after sorting at bit plane 0
(Layered 3D-CSPIHT, layer 1 effective)
52
Initialization
State after
LIP sorting
State after LIS sorting
LIP
Φ
Φ
B21B23B24B42C42C44
LIS
BCD
~
DB3C1C2C3B1B2B4C4
LSP
BB1B2B4C4D4ACC1C2D1DB3C3D2D3
~
~ B11B12B13B14B22B41B43B44C41C43
Table 4.8 LIP, LIS, LSP state after sorting at bit plane 2
(Layered 3D-CSPIHT, layer 2 effective)
The effective layer is then updated to layer 2 and the bit plane is reset to 2. Those noneffective descendants of entry B and C become effective, resulting in the coding of all
significant coefficients in layer 2 (Table 4.8). Coding results of the LIP and LIS sorting
at bit plane 1 and 0 are shown in Table 4.9 and Table 4.10.
LIP
State after LIP
sorting
B23C44
LIS
DB3C1C2C3B1B2B4C4
DB3C3B1B2B4C4C1C2
LSP
~B21B24 B42 C42
~C11C12C14C21C22C23C24
State after LIS sorting
B23C44C13
Table 4.9 LIP, LIS, LSP state after sorting at bit plane 1
(Layered 3D-CSPIHT, layer 2 effective)
LIP
State after LIP sorting
Φ
State after LIS sorting
Φ
LIS
DB3C3B1B2B4C4C1C2
~
LSP
~B23C44C13
~
Table 4.10 LIP, LIS, LSP state after sorting at bit plane 0
(Layered 3D-CSPIHT, layer 2 effective)
From the above analysis, it is easily found that any node that is in a non-effective layer
can not enter the LIP list as a future candidate of significant node because of previous
subband effectiveness checks on the LIS entries. As a conclusion, subband
53
effectiveness checks are necessary only when determining significance of type A and
type B entries, i.e.:
•
A type A LIS entry is significant if at least one of its effective descendants is
larger in magnitude than the current threshold;
•
A type B LIS entry is significant if at least one of its effective non-offspring
descendants is larger in magnitude than the current threshold.
4.3.4 How the Codec Functions in the Network
The layered 3D-CSPIHT solves the problem mentioned in section 4.2 (Fig. 4.7). Fig.
4.13 is a detailed version of the first stream in Fig. 4.7. Each GOF is re-sorted and
separated into 7 resolution layers. Layer ID is slotted between every two layers at the
encoding stage. As layer ID is designed to be a unique binary code, the decoder can
easily identify it while ‘reading’ the stream. When a packet in layer 6 is lost (dark area
in Fig. 4.13), the decoder will stop decoding the current layer (layer 6) on detecting the
subsequent layer ID. Thus, the confusion in Fig. 4.7 is avoided and correct decoding of
the subsequent layers after the lost packet is realized. If the lost packet is not the last
packet in a layer, the decoder will have to wait until the next layer ID before it can
conduct correct decoding.
In the network, block headers and layer IDs should be marked with the lowest drop
precedence. That is to say, correspondent QoS support from the network is expected to
ensure that layer ID is safely transmitted. The layered 3D-CSPIHT codec relies on the
layer ID to support decoding when corruption or loss of the encoded bit stream occurs.
Therefore, layer ID itself must be protected from corruption or loss. As stated in
chapter 3, layered video streaming system comprises of two related parts: a layered
54
codec and network QoS support. The focus of the layered 3D-CSPIHT codec is to
provide the required layered codec to work with the network providing QoS support. It
is reasonable to make the assumption that the layer ID will be safely transmitted when
certain QoS is guaranteed from the network.
layer1
layer2
......
layer7
layer1
layer2
layer3
GOF2
GOF1
block header
...
layer ID
data
lost data
Fig. 4.13 Bit stream structure of the layered 3D-CSPIHT coder
Fig. 4.14 is a flowchart of the layered 3D-CSPIHT decoder. Before the normal 3DCSPIHT sorting is carried out, the incoming stream is inspected to detect the presence
of layer ID sequences. Any data between two layer IDs are considered to be in the
same layer as the first layer ID. Layer serial numbers are stored in a variable called
id_cnt. Decoder switches to the layer identified by the id_cnt value upon detection of
each layer ID sequence. When id_cnt is 0, it is assumed that the successive stream
belongs to layer 7.
When the unicast streaming is required, layer IDs also act as signaling to the network
routers or switches. Unlike multicast streaming, unicast streaming does not provide
multiple channels for data from different layers and it is very possible that different
layers get mixed up during the transmission. For example, a layer 1 packet will seem to
be no difference from a layer 7 packet to a router in the network. On the other hand, in
layered streaming algorithms, networks rely on packet marking to provide QoS service.
Packets from different resolution layers will be marked with different priorities since
55
they contribute differently to the reconstruction. Hence, it is highly desirable for
different layers in the encoded bit stream to be easily identified by network routers.
read in
stream
Yes
stream
header?
No
start of a
GOF
Yes
block
header?
No
layer id?
start of a new
layer within one
GOF
No
Yes
Yes
process layer 7
No
id_cnt++
id_cnt==7?
id_cnt==0?
No
process
layer [id_cnt]
Yes
id_cnt=0
Fig. 4.14 Flowchart of the layered decoder algorithm
In the original 3D-CSPIHT coder, the encoding is transparent to the network. In other
words, once encoded, network elements (e.g., servers, routers and clients) are not able
to know how a particular chunk of bits will contribute to reconstruction of the
compressed video. In the layered 3D-CSPIHT bit stream, the layer IDs are used to
inform the network routers which layer the data being currently processed belong to.
By doing this, the router is able to drop packets according to their layer priorities when
the network is congested.
4.3.5 Layered 3D-CSPIHT Algorithm
56
Our layered 3D-CSPIHT algorithm is similar to the original 3D-CSPIHT algorithm
except for the following:
i)
coefficients are re-sorted by redefining the criterion for a node to be significant;
ii)
the layer ID is inserted in the encoded bit stream between consecutive layers;
iii)
additional zeros are inserted to protect the uniqueness of the layer ID.
A bit-counter is used to keep track of the number of ‘1’ bits. At initialization stage, the
bit-counter is reset to zero and subbands belonging to layer 1 are marked as effective
subbands. Next, the layered 3D-CSPIHT sorting pass is conducted. Nodes in the LIP
list are coded as in the original 3D-CSPIHT algorithm, while subband effectiveness is
checked when judging significance of entries in the LIS list. As explained in section
4.3.3, subband effectiveness checks are not necessary in the LIP list because the check
done in the LIS list prevents nodes from non-effective subbands from entering the LIP.
A special step, called layer ID protecting, is carried out during the sorting pass in the
layered 3D-CSPIHT whenever a ‘1’ is output to the encoded bit stream. In layer ID
protecting, we increment the bit-counter by 1. When bit-counter reaches k-1, a‘0’ will
be added to the encoded bit stream to prevent the occurrence of ‘1111…11’. Also,
layer effectiveness must be updated to the next layer, and layer ID must be written to
the encoded bit stream at end of coding each layer. The entire layered 3D-CSPIHT
algorithm is listed in Fig. 4.15.
57
1) Initialization:
1.1) Output n= log2 (max (i, j,k) {| ci, j,k |};
1.2) Set subbands belonging to layer 1 as effective subbands;
1.3) Set bit-counter to 0;
1.4) Set the LSP, LIP and LIS as empty lists and add coordinates (i,j,k) in the first
subband to the LIP and those with descendants to the LIS as TYPE A entries.
2) Sorting Pass:
2.1) For each entry (i,j,k) in LIP do:
2.1.1) Check for significance (two conditions must both be satisfied);
2.1.2) If significant then
-Output ONE and execute step (4);
-Output sign of ci,j,k;
+If positive, execute step (4);
-Move (i,j,k) to LSP; (add to LSP and remove from LIP.
If insignificant then
-Output ZERO and move to next node in LIP.
2.2) For each entry (i,j,k) in LIS do:
2.2.1) If entry is TYPE A then
2.2.1.1) Output significance of descendents;
2.2.1.2) If one of the descendents is significant then
-Output ONE;
-For each offspring (k,l,m) of (i,j,k) do
+Output significance;
+If significant then
/Output ONE;
/Encode sign bit, if positive, execute step (4);
/Add (k,l,m) to LSP;
+If insignificant, then
/Output ZERO;
/Add (k,l,m) to end of LIP;
-If further descendents are present, then
+Move (i,j,k) to end of LIS as TYPE B entry;
+Remove (i,j,k) from LIS;
+Go to step (2.2.2);
2.2.1.3) If none descendent is significant then
-Output ZERO and move to next node in LIS;
2.2.2) If the entry is TYPE B then
2.2.2.1) Output significance;
2.2.2.2) If significant then
-Output ONE and execute step (4);
-Add all offspring of (i,j,k) to end of LIS as TYPE A ;
-Remove (i,j,k) from LIS;
2.2.2.3) If insignificant then
-Output ZERO and move to next node in LIS;
3) Refinement Pass:
For each entry (i,j,k) in LSP, except those from the last sorting pass:
-Output the nth most significant bit of ci,j,k;
4) Layer ID Checking:
Increment bit-counter by 1;
If bit-counter is k-1 then
-Output ZERO and reset bit-counter to ZERO;
5) Quantization-Step Update:
Decrement n by 1 go to step 2.
6) Layer Effectiveness Update and Layer ID Writing:
If n=0, update effective subbands to the next layer, write layer ID and go to step (1).
Fig. 4.15 Layered 3D-CSPIHT Algorithm
58
CHAPTER 5
PERFORMANCE DATA
In this chapter we present performance data of the layered 3D-CSPIHT video coder.
Performance measurements for image and video coding are introduced and the 3DCSPIHT video coder is evaluated in terms of PSNR, encoding time, and compression
ratio.
5.1 Coding Performance Measurements
Image and video quality is often measured in terms of the Mean Square Error (MSE)
or the peak signal-to-noise ratio (PSNR). Suppose the total number of pixels in an
image is N. Denote the original value of a pixel with xi and its reconstructed value
with x'i . The mean square error (MSE) is then defined as:
MSE =
1
N
N −1
∑ x − x'
i =0
i
2
i
(5.1)
Distortion measures using MSE does not necessarily represent the real quality of a
coded image and the peak signal-to-noise ratio (PSNR), which is defined in equation
(5.2), is normally used.
PSNR = 10 log10
M2
dB
MSE
(5.2)
where M is the maximum peak-to-peak value in the signal. For 8-bit images, M is
chosen to be 255. For an average PSNR on the luminance and chrominance channels,
equation (5.3) is used.
59
2
(255)
dB
PSNR = 10 log10
1 ( MSE (Y ) + MSE (Cb) + MSE (Cr ))
3
(5.3)
5.2 PSNR Performance of the Layered 3D-CSPIHT Codec
In this section, we discuss the coding performance of the layered 3D-CSPIHT video
codec. Experiments are done with standard 4:1:1 color QCIF (176×144) video
sequences foreman, carphone, suzie, news, container, mother and akiyo at 10 frames
per second. All experiments are performed on Pentium IV 1.6GHz computers.
Fig. 5.1 shows frame by frame PSNR results of the foreman and the container
sequences at three different resolutions: the lowest resolution (resolution 1), the
medium resolution (resolution 2) and the highest resolution (resolution 3) in both
spatial and temporal dimension. Clearly, high resolution results in high PSNR.
Foreman at resolution 1 has an average PSNR of 26.12 dB in the luminance plane.
When 3 more layers (layer 2, 3 and 4) are coded, the PSNR improves by 0.38 to 8.81
dB. In full resolution coding, the resulting average PSNR can reach as high as 46.09
dB in the luminance plane. The average PSNR results of the luminance plane as well
as the chrominance planes on the foreman, news, container and suzie sequences are
given in Table 5.1 and a rate-distortion curve of the layered 3D-CSPIHT codec is
given in Fig. 5.2.
60
60
60
resolution 3
resolution 3
50
PSNR(Y) in dB
PSNR(Y) in dB
50
40
resolution 2
30
20
resolution 1
0
100
200
300
Frame number
40
30
resolution 2
20
resolution 1
0
100
200
300
Frame number
Fig. 5.1 Frame by frame PSNR results on
(a) foreman and (b) container sequences at 3 different resolutions.
Foreman news container
Lum
suzie
26.21
21.92
21.62
30.31
Resolution 1 Cb
37.96
31.51
37.88
46.02
Cr
37.58
37.06
36.30
44.88
Lum
31.09
26.30
26.75
35.06
Resolution 2 Cb
41.78
37.57
43.68
50.32
Cr
42.31
42.71
41.06
49.59
Lum
46.09
48.42
51.39
52.09
Resolution 3 Cb
51.97
50.01
54.24
54.69
Cr
50.97
52.18
54.58
54.32
Table 5.1 Average PSNR (dB) at 3 different resolutions
61
55
PSNR in dB
50
Y
U
V
45
40
35
30
25
1
2
3
4
5
6
7
resolution option
Fig. 5.2 Rate distortion curve of the layered 3D-CSPIHT codec
(refer to Table 4.1 in chapter 4 for the resolution options)
We compare the performance of the layered 3D-CSPIHT codec and the original 3DCSPIHT codec in terms of PSNR. Comparisons are done at the same bit rate and frame
rate. The layered codec is run at resolution 1, i.e., only layer 1 is coded. The bit rate
required to fully code layer 1 is computed and the original codec is run at this bit rate.
In our experiment the bit rate is 216580 bps.
Fig. 5.3 gives a frame by frame comparison of the original and the layered codec on
foreman sequence in terms of PSNR in the luminance and chrominance planes. In the
luminance plane, the original codec outperforms the layered codec significantly. This
is expected because confining the coding to the first resolution layer causes the coder
to miss significant coefficients in the higher resolution subbands. These coefficients
may be very large and discarding them can cause the PSNR to decrease significantly.
However, in the chrominance planes, the layered codec performs on par with or even
better than the original codec. Because chrominance nodes are normally smaller than
luminance nodes, the affect of restricting the resolution will not be so significant.
62
Visually, the layered codec gives quite pleasant reconstruction with less brightness. Fig.
5.4, Fig. 5.5, Fig. 5.6 and Fig. 5.7 show frame 1, 58, 120 and 190 of the reconstructed
foreman sequence at resolution 1, 2 and 3 respectively. Significant improvements in
visual quality can be observed when more layers are added.
40
PSNR (dB)
35
30
25
20
15
Original 3D-CSPIHT
+++++ Layered 3D-CSPIHT
0
50
frame number
100
150
200
250
300
200
250
300
200
250
300
350
(a)
PSNR (dB)
45
40
35
30
Original 3D-CSPIHT
+++++ Layered 3D-CSPIHT
0
50
frame number
100
150
350
(b)
PSNR (dB)
45
40
35
30
Original 3D-CSPIHT
+++++ Layered 3D-CSPIHT
0
50
frame number
100
150
350
(c)
Fig. 5.3 PSNR (dB) comparison of the original and the layered codec in
(a) luminance plane, (b) Cb plane and (c) Cr plane for the foreman sequence
63
(a)
(c)
(b)
(d)
Fig. 5.4 Frame 1 of foreman reconstructed at
(a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original
(a)
(c)
(b)
(d)
Fig. 5.5 Frame 58 of foreman reconstructed at
(a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original
64
(a)
(c)
(b)
(d)
Fig. 5.6 Frame 120 of foreman reconstructed at
(a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original
(a)
(c)
(b)
(d)
Fig. 5.7 Frame 190 of foreman reconstructed at
(a) resolution 1, (b) resolution 2, (c) resolution3 and (d) original
65
(a) Layered codec at resolution 2
(c) 2D-CSPIHT at 312560bps
(30.06dB)
(34.86dB)
(b) 3D-CSPIHT at 312560bps
(d) original
(35.67dB)
Fig. 5.8 Comparison on carphone sequence
(a) Layered codec at resolution 2
(c) 2D-CSPIHT at 312560bps
(36.23dB)
(42.65dB)
66
(b) 3D-CSPIHT at 312560bps
(d) original
(41.73dB)
Fig. 5.9 Comparison on akiyo sequence
Fig. 5.8 and Fig. 5.9 give a visual comparison of the 2D-CSPIHT, 3D-CSPIHT and the
layered 3D-CSPIHT video coders. Frame 96 of the carphone sequence and frame 1 of
the akiyo sequence are shown. The two sequences are chosen to contrast high motion
sequence (carphone) with low motion sequence (akiyo). The 2D CSPIHT and 3D
CSPIHT codecs are run at the same bit rate as the layered codec, which is run at
resolution 2. The layered 3D-CSPIHT codec gives a PSNR of 30.06 dB and 36.23 dB
on the selected frames of the carphone and akiyo sequences respectively, which is
about 5 dB less than the original 2-D and 3-D codecs. Again, this is because loss of
large coefficients due to layer restriction. Although the PSNR data reduce in the
layered codec when compared to those in the original 2-D and 3-D codecs, the visual
qualities are still pleasant. Despite decreased brightness, Fig. 5.8 (a) shows comparable
visual quality to Fig. 5.8 (b) and (c). The background, the eyes and the hair area are
clearly reconstructed. The mouth area shows better details. For the akiyo sequence,
however, the original 2-D and 3-D give much sharper edges on the human object. This
is because loss of high subband information in the layered codec on low motion videos.
67
On high motion videos, the original 2-D and 3-D are not obviously superior because
the effect of motion estimation and compensation on high motion videos is greater.
As stated in previous chapters, the objective of the layered 3D-CSPIHT codec is to
support layered scalable video streaming. In real network environment, the layered
codec is expected to perform much better than the original codec, due to its flexibility
and the ability of the decoder to work with incomplete data.
We demonstrate this by decoding using manually truncated bit streams of the foreman
sequence. Some bits in the encoded stream are cut off to produce incomplete video
streams (Fig. 5.10).
stream header
block header
(a)
stream header
block header
(b)
Fig. 5.10 Manually formed incomplete bit streams (shaded area is discarded)
Fig. 5.11 shows the decoding results visually. Needless to quote the actual PSNR value,
we can see that the original codec has problems in decoding bit streams that become
incomplete due to network transmission. Typical observation is many artifacts
overlapped on the video. When it comes to frame 10, the bits are totally messed up and
68
reconstruction is unacceptable. This is expected as there is no layer ID in the original
codec to assist in incomplete data decoding. On the other hand, the layered codec
performs very well when data loss occurs.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5.11 Reconstruction of frames (a)(b)1, (c)(d)5, (e)(f)10 of the foreman sequence
with layered codec (left) and with the original codec(right).
69
5.3 Coding Time and Compression Ratio
Table 5.2 shows the encoding time of the original and the layered codec on four video
sequences: foreman, carphone, mother and suzie. As the layered codec adds special
markings to the encoded bit stream to support QoS implementation and incomplete
stream decoding, it takes more time in coding and produces a less compressed bit
stream. Experimental results also show that when coding 16, 64, 128 and 256 frames,
the original codec saves about 0.1, 0.5, 0.9 and 1.5 seconds on average. Also, the
layered codec has a lower compression ratio (1:729) than the original CSPIHT (1:825)
due to extra bits introduced by the layer ID. It is worth mentioning that the re-sorting
of wavelet coefficients in the transform domain incurs longer encoding time, too,
because more iteration is needed. Specifically, if the original codec requires n iteration
to fully code a video, the layered codec will require n × l iteration, where l is the
number of layers. That is to say, n iteration is needed for each layer in the layered 3DCSPIHT codec.
foreman
carphone
mother
suzie
16
64
128
256
original
0.87
3.32
6.52
12.82
layered
1.13
3.87
7.61
15.28
∆t
0.26
0.55
1.09
2.46
original
0.87
3.16
6.19
12.57
layered
0.99
3.7
7.1
14.13
∆t
0.12
0.54
0.91
1.56
original
0.88
3.21
6.34
12.55
layered
0.97
3.69
7.24
14.23
∆t
0.09
0.48
0.9
1.68
original
0.87
3.24
6.23
12.07
layered
1.02
3.75
7.24
13.18
∆t
0.15
0.51
1.01
1.11
Table 5.2 Encoding time (in second) of the original and layered codec
70
CHAPTER 6
CONCLUSIONS
In this thesis, we present a layered 3D-CSPIHT codec based on the CSPIHT and 3DCSPIHT coder. The layered codec incorporates a new sorting algorithm that produces
resolution/frame rate scalable bit streams in a layered format. Moreover, it carries a
specially designed layer ID that identifies the layer that a particular data packet
belongs to. By doing so, layered scalable video streaming is supported. A unicast video
delivery system using the layered 3D-CSPIHT codec is also implemented.
As extensions to the SPIHT wavelet coding scheme, CSPIHT and 3D-CSPIHT codecs
give satisfactory PSNR performance. However, they are designed assuming ideal
network conditions and thus have limitations when applied to real network situations.
In a layered multicast video streaming system, server sends only one video stream,
which is normally fully coded. Receivers/clients obtain video of different resolutions,
sizes or frame rates by subscribing to correspondent multicast groups that carry
progressively decodable stream layers. These layers in turn are combined into different
streams by multiple clients at the other side of the network to provide video of
different qualities preferred by that specific client. Under such a network scenario, the
encoder is expected to produce bit streams in a layered structure, and the decoder must
be able to decode different combined versions as well as incomplete versions of the
original encoded bit stream. The layered 3D-CSPIHT codec achieves this by re-sorting
the coefficients in the transform domain following a restricted significance criterion,
71
and slotting a flag, called layer ID, as a layer identifier to the decoder and the network
routers/switches.
In the layered 3D-CSPIHT codec, layers are defined according to resolutions or
subbands. Significance status in the original 3D-CSPIHT, which depends only on the
magnitude of a node or its descendants, is re-determined by checking a modified
significance criterion: an entry is significant when it is significant in the original 3DCSPIHT and when it resides in a subband belonging to the currently effective layer or
the layer that is being coded. Thus, coefficients in lower layers are coded with higher
priorities. Seven resolution options are provided in terms of different spatial
resolutions and frame rates.
To enable decoding of incomplete data, eight consecutive ‘1’s (11111111) are
introduced as layer ID at the beginning of each layer. A zero (0) is inserted in the data
after occurrence of every seven consecutive ‘1’s (1111111) to remove the ‘11111111’
sequences in the data section and protect the uniqueness of the layer ID. The same
sequence is used as layer ID for all layers by maintaining an ID counter to track the
number of times that the layer ID sequence is captured. The decoder switches to
process the kth layer upon detection of the layer ID for the kth time. Thus, when a
packet in the k-1th layer is lost, streams from the kth layer still gets to be decoded
correctly. In the network, the layer ID should be marked with the lowest drop
precedence to ensure safe delivery.
The layered 3D-CSPIHT codec is tested using both high motion and low motion
standard QCIF video sequences at 10 frames per second. It is compared against the
72
original 3D-CSPIHT and the 2D-CSPIHT video coder in terms of PSNR, encoding
time and compression ratio. In the luminance plane, the original 3D-CSPIHT and the
2D-CSPIHT outperform the layered 3D-CSPIHT significantly in PSNR results. While
in the chrominance planes, they give similar PSNR results. The layered 3D-CSPIHT
also costs more in time and provides less compressed bit streams, because of the
expense incurred by incorporating the layer ID.
In conclusion, the layered 3D-CSPIHT codec improves the original 3D-CSPIHT codec
in terms of network friendliness. It overcomes the limitations of the original 3DCSPIHT codec for application to a layered multicast streaming system.
73
REFENRENCES
[1].
S. Keshav, An Engineering Approach to Computer Networking, Addison
Wesley, 1997.
[2].
Y.S. Gan and C.K. Tham, "Random Early Detection Assisted Layered
Multicast", in "Managing IP Multimedia End-to-End", 5th IFIP/IEEE
International Conference on Management of Multimedia Networks and
Services, Oct. 2002, pp. 341-353, Santa Barbara, USA.
[3].
C.K. Tham, Y.S. Gan and Y. Jiang., "Congestion Adaptation and Layer
Prioritization in a Multicast Scalable Video Delivery System", Proceedings of
IEEE/EURASIP Packet Video 2003, Apr. 2003, Nantes, France.
[4].
Y.S. Gan and C.K. Tham, “Loss Differentiated Multicast Congestion Control”,
Computer Networks, vol. 41, No. 2, pp. 161-176, Feb. 2003.
[5].
A. Said and W.A. Pearlman, “A New, Fast, and Efficient Image Codec Based
on Set Partitioning in Hierarchical Trees”, IEEE Transactions on Circuits and
Systems for Video Technology, vol. 6, pp. 243-250, Jun. 1996.
[6].
F.W. Wheeler and W.A. Pearlman, “SPIHT Image Compression without Lists”,
Proceedings of IEEE International Conference on Acoustics, Speech, and
Signal Processing, Jun. 2000, vol. 4, pp. 2047-2050.
[7].
W.S. Lee and A.A. Kassim, “Low Bit-Rate Video Coding Using Color Set
Partitioning In Hierarchical Trees Scheme”, International Conference on
Communication Systems 2001, Nov. 2001, Singapore.
[8].
A.A. Kassim and W.S. Lee, “Performance of the Color Set Partitioning In
Hierarchical Tree Scheme (C-SPIHT) in Video Coding”, Circuits, Systems and
Signal Processing, vol. 20, pp. 253-270, 2001.
74
[9].
A.A. Kassim and W.S. Lee, “Embedded Color Image Coding Using SPIHT
with Partial Linked Spatial Orientation Trees”, IEEE Transactions on Circuits
and Systems for Video Technology, vol. 13, pp. 203-206, Feb, 2003.
[10].
G.D. Karlsson and M. Vetterli, “Three Dimensional subband coding of video”,
International Conference on Acoustics, Speech and Signal Processing 1998, pp.
1100-1103.
[11].
A.A. Kassim, E.H. Tan and W.S. Lee, “3D Wavelet Video Codec based on
Color Set Partitioning in Hierarchical Trees (CSPIHT)”, submitted to IEEE
Transactions on Circuits and Systems for Video Technology.
[12].
Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing (Third
Edition), pp.317-372, Addison Wesley, 1992.
[13].
Simon Haykin, Neural Networks: A Comprehensive Foundation (Second
Edition), pp392-437, Prentice Hall, 1998.
[14].
K.R. Castleman, Digital Image Processing, Prentice Hall, 1996.
[15].
R.J. Clarke, Digital Compression of Still Images and Video, pp.245-274,
Academic Press, 1995.
[16].
J.R. Ohm, “Three-Dimensional Subband Coding with Motion Compensation”,
IEEE Transaction on Image Processing, vol. 3, pp.559-571, Sep. 1994.
[17].
S.J. Choi and J.W. Woods, “Motion-Compensated 3-D Subband Coding of
Video,” IEEE Trans. on Image Processing, vol. 8, pp. 155–167, Feb. 1999.
[18].
B.J. Kim et al., “Low Bit-Rate Scalable Video Coding with 3-D Set
Partitioning in Hierarchical Trees (3-D SPIHT)”, IEEE Transactions on
Circuits and Systems for Video Technology, vol. 10, pp. 1374-1386, Dec. 2000.
[19].
J.Y. Tham, S. Ranganath, and A.A. Kassim, “Highly Scalable Wavelet-based
Video Codec for Very Low Bit Rate Environment”, IEEE Journal on Selected
75
Areas in Communications – Special Issue on Very Low Bit-rate Video Coding,
vol.16, pp. 12-27, Jan. 1998.
[20].
G.D. Karlsson and M. Vetterli, “Three Dimensional subband coding of video”,
International Conference on Acoustics, Speech and Signal Processing 1998, pp.
1100-1103.
[21].
C.I. Podilchuk, N.S. Jayant and N. Farvardin, “Three-dimensional subband
coding of video,” IEEE Trans. on Image Processing, vol. 4, pp.125–139, Feb.
1995.
[22].
D. Taubman and A. Zakhor, “Multirate 3-D Subband Coding of Video,” IEEE
Trans. on Image Processing, vol. 3, pp. 572–588, Sep. 1994.
[23].
A. S. Lewis and G. Knowles, “Image compression using the 2-D wavelet
transform,” IEEE Trans. on Image Processing, vol. 1, pp. 244–250, Feb. 1992.
[24].
S.H. Man and F. Kossentini, “Robust EZW Image Coding for Noisy
Channels,” IEEE Signal Processing Letters, vol. 4, pp. 227–229, Aug. 1997.
[25].
C.D. Creusere, “A New Method of Robust Image Compression Based on the
Embedded Zerotree Wavelet Algorithm,” IEEE Trans. on Image Processing,
vol. 6, pp. 1436–1442, Oct. 1997.
[26].
J.K. Rogers and P.C. Cosman, “Robust Wavelet Zerotree Image Compression
with Fixed-Length Packetization,” Proceeding
of
Data
Compression
Conference, 1998, Snowbird, UT, pp. 418–427.
[27].
P.C. Cosman, J.K. Rogers, P.G. Sherwood and K. Zeger, “Combined Forward
Error Control Packetized Zerotree Wavelet Encoding for Transmission of
Images over Varying Channels,” IEEE Trans. on Image Processing, vol. 9, pp.
982–993, Jun. 2000.
76
[28].
J.M. Shapiro, “Embedded Image Coding Using Zerotrees of Wavelets”, IEEE
Transactions on Signal Processing, vol. 41, pp. 3445-3462, Dec. 1993.
[29].
D. Wu, T. Hou and Y.Q. Zhang, “Scalable Video Coding and Transport over
Broad-Band Wireless Networks”, Invited Paper, Proceedings of the IEEE,
Special Issue on Multi-Dimensional Broadband Wireless Technologies and
Applications, vol. 89, pp. 6-20, Jan. 2001.
[30].
A. Tamhankar and K.R. Rao, “An Overview of H.264/MPEG-4 Part 10”, 4th
EURASIP Conference focused on Video/Image Processing and Multimedia
Communications, July, 2003, vol. 1, pp. 1-51.
[31].
Z. Wang, Internet QoS: Architectures and Mechanisms for Quality of Service
(First Edition), Morgan Kaufmann press, 2001.
[32].
P. Ferguson and G. Huston, Quality of Service: Delivering QoS on the Internet
and in corporate networks (First Edition), John Wiley & Sons, 1998.
[33].
http://www.ietf.org/rfc.html, RFC 2205.
[34].
http://www.ietf.org/rfc.html, RFC 2597, 2598, 3246, 3247.
[35].
Y. Wang, J. Ostermann, and Y.Q. Zhang, Video Processing and
Communications, Prentice Hall, 2001.
[36].
E.C. Reed and F. Dufaux, “Constrained Bit-Rate Control for Very Low BitRate Streaming-Video Applications”, IEEE Transactions on Circuits and
Systems for Video Technology, vol. 11, pp. 882-889, Jul. 2001.
[37].
C.W. Yap and K.N. Ngan, “Error Resilient Transmission of SPIHT Coded
Images over Fading Channels”, IEE Proceedings of Vision, Image and Signal
Processing, Feb. 2001, vol. 148, issue 1, pp. 59-64.
[38].
C.K. Tham et al., "Layered Coding for a Scalable Video Delivery System”,
Proceedings of IEEE/EURASIP Packet Video 2003, Apr. 2003, Nantes, France.
77
[39].
W. Feng, A.A. Kassim and C.K. Tham, “Layered Self-Identifiable and Scalable
Video Codec for Delivery to Heterogeneous Receivers”, Proceedings of SPIE
Visual Communications and Image Processing, Jul. 2003, Lugano, Switzerland.
[40].
J.R. Ohm, “Advanced Packet Video Coding Based on Layered VQ and SBC
techniques”, IEEE Trans. Circuits and Systems on Video Technology, vol. 3, pp.
208-221, Jun. 1993.
[41].
M. Mrak, M. Grgic and S. Grgic, “Scalable Video Coding in Network
Applications”, 4th EURASIP-IEEE Region 8 International Symposium on
Video/Image Processing and Multimedia Communications, Jun. 2002, pp. 205-
211.
[42].
S. Han and B. Girod, “Robust and Efficient Scalable Video Coding with Leaky
Prediction”, Proceedings of IEEE International Conference on Image
Processing, Sep. 2002, vol. 2, pp. II-41-II-44.
[43].
H. Danyali and A. Mertins, “Highly Scalable Image Compression Based on
SPIHT for Network Applications”, Proceedings of IEEE International
Conference on Image Processing, Sep. 2002, vol. 1, pp. I-217-I-220.
[44].
J.Y. Lee, H.S. Oh and S.J. Ko, “Motion-Compensated Layered Video Coding
for Playback Scalability”, IEEE Transactions on Circuits and Systems for
Video Technology, vol.11, No.5, pp. 619-629, May, 2001.
[45].
M. Ghanbari, “Two-layer coding of video signals for VBR networks”, IEEE J.
Selected Areas in Communications, vol.7, pp.771-781, Jun. 1989.
[46].
D. Taubman and A. Zakhor, “Multi-rate 3-D subband coding of video”, IEEE
Transactions on Image Processing, vol.3, No.5, pp.572-588, Sep, 1994.
78
[47].
M. Vishwanath and P. Chou, “An efficient algorithm for hierarchical
compression of video”, in proceedings of the IEEE International Conference
on Image Processing, Texas, USA, Nov. 1994.
[48].
M. Khansari, A. Zakauddin, W.Y. Chan, E. Dubois and P. Mermelstein,
“Approaches to layered coding for dual-rate wireless video transmission”, in
proceedings of the IEEE International Conference on Image Processing, Texas,
USA, Nov. 1994.
[49].
S.R. McCanne, M. Vetterli and V. Jacobson, “Low-complexity video coding
for
receiver-driven
layered
multicast”,
IEEE
J.
Selected
Areas
in
Communications, vol.15, No.6, pp.983-1001, Aug, 1997.
[50].
H. Radha, Y. Chen, K. Parthasarathy and R. Cohen, “Scalable Internet Video
Using MPEG-4”, Signal Processing: Image Communication, vol.15, No.1-2,
pp.95-126, Sep, 1999.
[51].
T. Kim, S.K. Choi, R.E. Van Dyck and N.K. Bose, “Classified Zerotree
Wavelet Image Coding and Adaptive Packetization for Low-Bit-Rate
Transport”, IEEE Transactions on Circuits and Systems for Video Technology,
vol. 11, pp. 1022-1034, Sep. 2001.
[52].
J.W. Woods and G. Lilienfield, “A Resolution and Frame-Rate Scalable
Subband/Wavelet Video Coder”, IEEE Transactions on Circuits and Systems
for Video Technology, vol. 11, pp. 1035-1044, Sep. 2001.
[53].
A. Puri et al., “Temporal Resolution Scalable Video Coding”, Proceedings of
IEEE International Conference on Image Processing, Nov. 1994, vol. 2, pp.
947-951.
[54].
D. Taubman and A. Zakhor, “Rate- and Resolution-Scalable Video and Image
Compression
with
Subband
Coding”,
The
Twenty-Seventh
Asilomar
79
Conference on Signals, Systems and Computers, Nov. 1993. vol. 2, pp. 1489 -
1493.
[55].
G.J. Conklin and S.S. Hemami, “Evaluation of Temporally Scalable Video
Coding Techniques”, Proceedings of IEEE International Conference on Image
Processing, Oct. 1997, vol. 2, pp. 61-64.
[56].
G.J. Conklin and S.S. Hemami, “Comparison of Temporal Scalability
Techniques”, IEEE Transactions on Circuits and Systems for Video
Technology, vol. 9, Issue 6, pp. 909-919, Sep. 1999.
80
[...]... clients to receive a video that is specially “shaped” for each of them Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec Recent subband coding algorithms based on the Discrete Wavelet Transform (DWT) support scalability The DWT based Set Partitioning in Hierarchical Trees (SPIHT) scheme [5] [6] for coding of monochrome images has yielded desirable... scheme for video coding The above coding schemes achieve satisfactory PSNR performance; however, they have been designed from a pure compression point of view, which render problems for their direct application to a QoS enabled streaming system In this project, we extended the 3D-CSPIHT codec to address these problems and enable it to produce layered bit streams that are suitable for layered video streaming. ..Encoder Raw video Sender Renderer Decoder Compressed video Network Receiver Fig 1.1 A typical video streaming system The challenge of video streaming lies in the highly delay-sensitive characteristic of video applications Video/ audio data need to arrive on time to be useful Unfortunately, current Internet service is best effort (BE) and guarantees no delay bound Delay... provide background information in image /video compression, and in chapter 3 we discuss related research in multimedia communications and network QoS The details of our extension of the 3D-CSPIHT codec, called layered 3D-CSPIHT video codec, are presented in chapter 4 We analyze performance of the layered codec in chapter 5 Finally, in chapter 6 we conclude this thesis 4 CHAPTER 2 IMAGE AND VIDEO CODING This... begins with an overview of transform coding for still images and video coding using motion compensation Then wavelet based image and video coding is introduced and the subband coding techniques are described in detail Finally, current image and video coding standards are briefly summarized 2.1 Transform Coding A typical transform coding system comprises of forward transform, quantization and entropy... groups, the video will be duplicated and branched at fork points, as shown at router R1 (Fig 3.2) client1 R2 harddisk client2 z server R1 R3 client3 R4 encoder client4 webcam client5 Fig 3.2 Multicast video streaming 3.2 Characteristics and Challenges of Video Streaming Video streaming is real time applications Unlike traditional data-oriented applications such as email, ftp, and web browsing, video streaming. .. video compression 10-100kpbs Table 2.1 Image and video compression standards 24 CHAPTER 3 VIDEO STREAMING AND NETWORK QoS In this chapter some fundamentals in video streaming are provided The network Quality of Service (QoS) is defined and two frameworks, i.e., the Integrated Services (IntServ) and the Differentiated Services (DiffServ) are discussed in detail Finally, principles of layered video streaming. .. Base layer of the video stream must be received for any other layers to be useful, and each additional layer improves the video quality As network clients always differ significantly in their capacities and preferences, layered scalable streaming is efficient in that it is able to 2 deliver one video stream over the network, while at the same time it enables the clients to receive a video that is specially... coding standards for reference Standard JPEG JPEG 2000 MPEG-1 Application Still image compression Improved still image compression Video on CD Bit Rate Variable Variable 1.5Mbps MPEG-2 Digital Television, Video on DVD 2-20Mpbs MPEG-4 Object -based coding Interactive video 28-1024kbps H.261 Video conferencing over ISDN Variable H.263 Video conferencing over Internet and PSTN Wireless video conferencing... Equation (2.4) then defines one-dimensional linear transform from vector x to y The goal of the transform process is to de-correlate the pixels or to pack signal energy into as few as possible transform coefficients However, not all linear transforms are optimal in this sense Only the whitening transform (viz Karhunen-Loeve transform (KLT), Hotelling transform or the method of principal components) [13], ... Layered scalable streaming is one of the QoS supportive video streaming mechanisms that provide both efficiency and flexibility The basic idea of layered scalable streaming is to encode raw video. .. clients to receive a video that is specially “shaped” for each of them Besides adaptive QoS support from the network, layered scalable video streaming requests a scalable video codec Recent subband... raw video in a layered format, which can be progressively transmitted and decoded Fig 3.7 is an illustration of such a layered codec 34 layered encoder layered decoder layered decoder + + layered