Hệ thống truyền thông di động WCDMA P6 potx

6 Multimedia Processing Scheme Minoru Etoh, Hiroyuki Yamaguchi, Tomoyuki Ohya, Toshiro Kawahara, Hiroshi Uehara, Teruhiro Kubota, Masayuki Tsuda, Seishi Tsukada, Wataru Takita, Kimihiko Sekino and Nobuyuki Miura 6.1 Overview The Introduction of International Mobile Telecommunications-2000 (IMT-2000) has enabled high-speed data transmission, laying the groundwork for full-scale multimedia communications in mobile environments. Taking into account the characteristics and limitations of radio access, multimedia processing suitable for mobile communication is required. In this chapter, signal processing, which is a basic technology for implementing multimedia communication, is first discussed. It contains descriptions on the technology, characteristics and trends of the Moving Picture Experts Group (MPEG-4) image coding method, the Adaptive MultiRate (AMR) speech coding, and 3G-324M. MPEG-4 is regarded as a key technology for IMT-2000, developed for use in mobile communication and standardized on the basis of various existing coding methods. AMR achieves excel- lent quality, designed for use under various conditions such as indoors or on the move. 3G-324M is adopted by 3rd Generation Partnership Project (3GPP) as a terminal system technology for implementing audiovisual services. A functional overview of mobile Internet Service Provider (ISP) services using the IMT-2000 network is also provided together with some other important issues that must be taken into account when providing such services, including the information distribution method, copyright protection scheme and trends in the content markup language. The standardization direction in Wireless Application Protocol (WAP) Forum – a body respon- sible for the development of an open, globally standardized specification for accessing the Internet from wireless networks, and the technical and standardization trends of a common platform function required for expanding and rolling out applications in the future will also be discussed, with particular focus on such technologies as messaging, location information and electronic authentication. W-CDMA: Mobile Communications System. Edited by Keiji Tachikawa Copyright  2002 John Wiley & Sons, Ltd. ISBN: 0-470-84761-1 308 W-CDMA Mobile Communications System 6.2 Multimedia Signal Processing Scheme 6.2.1 Image Processing The MPEG-4 image coding method is used in various IMT-2000 multimedia services such as videophone and video distribution. MPEG-4 is positioned as a compilation of existing image coding technologies. This section explains its element technologies and the characteristics of various image-coding methods developed before MPEG-4. 6.2.1.1 Image Coding Element Technology Normally, image signals contain about 100 Mbit/s of information. To process images, various efficient image coding methods have been developed taking advantage of the characteristics of images. Element technologies common to these methods include interframe motion prediction, Discrete Cosine Transform (DCT), and variable length coding [1–3]. Interframe Motion-Compensated Prediction Interframe motion-compensated prediction is a technique used to determine how much and in which direction the specific part of an image has moved by referencing the previous and subsequent images rather than by encoding each image (Figure 6.1). The direction and amount of movement (motion vector) vary depending on the block of each frame. Therefore, a frame is divided into a size of about 16 by 16 pixels (called macro block), to obtain the motion vector of each block. The difference between the macro blocks of the frame and the previous frame is called predicted error. DCT mentioned in the following section is applied to this error. DCT Each frame in a video is expressed by the sum of weights ranging from simple image components (low-frequency components) to complex image components (high-frequency components) (Figure 6.2). It is known that information generally is concentrated in the low-frequency components and plays a visually important role. DCT is aimed at extracting only the important frequency components at the end to perform information compression. Present frame Next frame (The movement of the smoke and the airplane is the difference) Figure 6.1 Basic idea of interframe motion-compensated prediction Multimedia Processing Scheme 309 = a 1 ×+ a 2 ×+ a 3 × + a 4 ×+ a 5 × • • • + a 16 × Figure 6.2 Concept of decomposing screen into frequency components This method is widely adopted as the conversion into the space frequency domain can be carried out efficiently. In practice, DCT is applied to e ach block of a frame that is divided into blocks with a size of about 8 by 8 pixels. In Figure 6.2, “a i ” denotes the DCT coefficient. This coefficient is further quantized and rounded to a quantization level, and then variable length coding is applied as mentioned in the following section. Variable Length Coding Variable length coding is used to compress information exploiting the uneven nature of input signal values. This method allocates short codes to signal values that occur frequently and long codes to less frequent signal values. As mentioned in the previous section, many coefficients of high frequency components become zero in the process of rounding to the quantization representative value. As such, there are many cases in which “all subsequent values are zero (EOB: End of Block)” or “a value L follows after a certain number of zeros.” Information can also be compressed by allocating short code to frequently occurring combinations of the number of z eros (zero run) and L value (Level). The methods explained in the preceding text a re schemes that allocate one code to a c ombination of two values. This method is called two-dimensional variable length coding. 6.2.1.2 Positioning of Various Video-Coding Methods Internationally standardized video-coding methods include H.261, MPEG-1, MPEG-2, H.263, and MPEG-4. Figure 6.3 shows the applicable areas of each scheme. The subsequent sections describe how each method uses the above-mentioned element technologies to improve compression efficiency and the functional differences of these methods. H.261 Video Coding This method is virtually the world’s first international standard for video coding, designed for use in ISDN videophone and videoconference, standardized by International Telecom- munication Union-Telecommunication (ITU-T) in 1990 [4]. H.261 uses all the element technologies mentioned in the preceding text. That is: 1. Predicts motion vector of a macro block containing 16 by 16 pixel in units of pixels to perform interframe motion-compensated prediction. 2. Applies DCT to the predicted error with the previous frame of size 8 by 8 pixels. For areas with rapid motion that exceeds a certain quantity of predicted e rror, interframe 310 W-CDMA Mobile Communications System High Low Quality H.261 Transmission speed (bit/s) 10 K 100 K 1 M 10 M H.263 MPEG-4 MPEG-2 MPEG-1 Figure 6.3 Relationship between MPEG-4 video coding and other standards motion-compensated prediction is not performed. Instead, 8 × 8 pixel-DCT is applied within a frame to increase coding efficiency. 3. Performs variable length coding on the motion vector obtained with interframe motion compensation and the result of DCT processing, respectively. Two-dimensional variable length coding is used on the result of DCT processing. H.261 assumes the use of conventional TV cameras and monitors. TV signal formats (number of frames and number of scanning lines), however, vary depending on the region. To cope with international communications, these formats have to be converted into a common intermediate format. This format is called Common Intermediate Format (CIF), defined as “352 (horizontal) by 288 (vertical) pixels, a maximum of 30 frames per second, and noninterlace.” Quarter CIF (QCIF) that is a quarter of the size of CIF was defined at the same time and used also in subsequent video-coding applications. MPEG-1/MPEG-2 Video Coding MPEG-1 was standardized by International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) in 1993 for use with storage media such as CD-ROM [5]. This coding method is designed to handle visual data in the vicinity of 1.5 Mbit/s. Since this is a coding scheme for storage media, requirements for real-time processing are relaxed compared with H.261, thereby increasing chances to adopt new technologies that require capabilities such as random search. W hile basically the same element technologies such as H.261 are used, the following new capabilities have been added: 1. All-intraframe image is periodically inserted to enable random access replay. 2. H.261 predicts motion vector from the past screen to perform interframe motion- compensated prediction (this is called forward prediction). In addition to this, MPEG-1 has enabled prediction from the future screen (called backward prediction), by taking advantage of the characteristics of the storage media. Moreover, MPEG-1 evaluates Multimedia Processing Scheme 311 forward prediction, backward prediction, and average of backward prediction and forward prediction and then selects the one having least prediction error among the three averages to improve the compression rate. 3. While H.261 predicts motion vector in units of 1 pixel, MPEG-1 introduced prediction in units of 0.5 pixel. To achieve this, an interporation image is created by taking the average of adjacent pixels. Interframe motion prediction is performed with the interporated image to enhance the compression rate. With these capabilities added, MPEG-1 is widely used as a video encoder and player for personal computers. MPEG-2 is a generic video-coding method developed by taking into account the requirements for telecommunications, broadcasting, and storage. MPEG-2 was standardized by ISO/IEC in 1966 and has a common text with ITU-T H.262 [6]. MPEG-2 is the coding scheme for video of 3 to 20 Mbit/s, widely used for digital TV broadcast, High Defini- tion Television (HDTV), and Digital Versatile Disk (DVD). MPEG-2 inherits the element technologies of MPEG-1 and has the following new features: 1. The capability to efficiently encode interlace images used in conventional TV signals. 2. A function to adjust the screen size and quality (called spatial scalability and SNR scalability, respectively) as required by retrieving only part of the coded data. Since capabilities are added for various uses, special attention must be paid to ensure the compatibility of coded data. To cope with this issue, MPEG-2 has introduced new concepts as “profile” and “ level” that classify the difference of capabilities and complexity of processing. These concepts are used in MPEG-4 as well. H.263 Video Coding This is an ultra low bit rate video-coding method for videophones over analog networks, standardized by ITU-T in 1996. This method assumes the use of 28.8 kbit/s modem and adopts part of the new technologies developed for MPEG-1. Interframe motion- compensated prediction in units of 0.5 pixel is a mandatory basic function (baseline). Another baseline is three-dimensional coding including EOB that extends the conventional two-dimensional variable length coding (run and level). Furthermore, interframe motion- compensated prediction in units of 8 by 8 pixel blocks and processing to reduce block distortion in images are newly added as options. With these functional additions, H.263 is now used in some equipment for ISDN videophones and videoconference. 6.2.1.3 MPEG-4 Video Coding MPEG-4 video coding was developed by making various improvements on top of ITU-T H.263 video coding, including the error-resilience e nhancement. This coding method is backward compatible with the H.263 baseline. MPEG-2 was designed to mainly process image handling on computers, digital broadcasting and high-speed communications. In addition to these services, MPEG-4 was standardized with a special focus on its a pplication to telecommunications, in particular, mobile communications. As a result, in 1999, MPEG-4 established a very generic video-coding method [7] as the ISO/IEC standard. Hence, MPEG-4 is recognized as a key 312 W-CDMA Mobile Communications System Broadcast • Mobile TV • Mobile information distribution (Video and audio) Communication • Mobile video phone • Mobile video conference MPEG-4 Computer • Video mail • Multimedia on demand • Mobile internet BroadcastBroadcast CommunicationCommunication ComputerComputer Broadcast Communication Computer Figure 6.4 Scope of MPEG-4 technology for image-based multimedia services including video mail, video distribution as well as videophone in IMT-2000 (Figure 6.4). Profile and Level To ensure the interchangeability and interoperability of encoded data, the functions of MPEG-4 are classified by profile, while the computational complexity is classified by level as in the case of MPEG-2. Defined profiles include Simple, Core, Main, and Simple Scalable, among which the Simple profile defines the common functions. The interframe motion-compensated prediction with 8 by 8 pixels, which is defined as an option in H.263, is positioned as Simple profile. With Simple profile, QCIF images are handled by levels 0 and 1, and CIF images by level 2. The Core and Main profiles define an arbitrary area in a video a s an “object”, so as to improve the image quality, or to incorporate the object into other coded data. Other more sophisticated profiles such as those composed with CG (Computer Generated) images are also provided with MPEG-4. IMT-2000 Standards 3GPP 3G-324M, the visual phone standard in IMT-200 detailed in Section 6.4 requires the H.263 baseline a s a mandatory video-coding scheme and highly recommends the use of MPEG-4 Simple profile level 0. The Simple profile contains the following error-resilience tools: 1. Resynchronization: Localizes transmission errors by inserting r esynchronization code in variable length coded data and partitioning it at an appropriate position in a frame. Since header information follows the resynchronization code to specify coding parameters, a swift recovery from the state of decoding errors is enabled. Insertion interval of resynchronization code can be optimized taking into account the overhead of the header information, visual scene in input type and transmission characteristics. 2. Data Partition: Enables error concealment by inserting Synchronization Code (SC) at boundaries of different types of coded data. For example, by inserting SC between the Multimedia Processing Scheme 313 Decode Error Decode Error Error Reverse decoding (a) Unidirectional decoding with normal variable length code (b) Bidirectional decoding with RVLC Not decoded → Discard Not decoded → Discard × ×× Figure 6.5 Example of decoding reversible variable length code (RVLC) motion vector and the DCT coefficient, the motion vector can be transmitted correctly even if a bit error is mixed into the DCT coefficient, enabling more natural error concealment. 3. Reversible Variable Length Code (RVLC ): As shown in Figure 6.5, this code is a variable length code that can be decoded from the reverse direction. This is applied to the DCT coefficient. With this tool, all the macro blocks can be decoded except for those that c ontain bit errors. 4. Adaptive Intrarefresh: This tool prevents error propagation by performing intraframe coding on highly motive area. As described in the preceding text, MPEG-4 Simple profile level 0 constitutes a very simple CODEC suitable for mobile communications. 6.2.2 Speech and Audio Processing 6.2.2.1 Code Excited Linear Prediction (CELP) Algorithm There are typically three speech coding methods, namely, waveform coding, vocoder and hybrid coding. Like Pulse Coded Modulation (PCM) or Adaptive Differential PCM (ADPCM), waveform coding encodes the waveform of signals as accurately as possible without depending on the nature of the signals. Therefore if the bit rate is high enough, high-quality coding is possible. If the bit rate becomes low, however, the quality drops sharply. On the other hand, vocoder assumes a generation model of speech and analyzes and encodes its parameters. Although this method can keep the bit rate low, it is difficult to improve the quality even if the bit rate is increased because the voice quality largely depends on the assumed speech generation model. Hybrid coding is a combination of waveform coding and vocoder. This method assumes a voice generation model and analyzes and encodes its parameters and then performs waveform coding on the remaining information (residual signals) not expressed with parameters. One of the typical hybrid methods is CELP. This method is widely used for mobile communication speech coding as a generic algorithm for implementing highly efficient and high-quality speech coding. 314 W-CDMA Mobile Communications System Excitation information Spectrum information Voice waveform Voice spectrum Vocalization (vocal cord) Articulation (oral cavity) Frequency Power Unvoiced source Voiced source Synthesis filter + Figure 6.6 Voice generation model used in CELP coding Figure 6.6 shows the speech generation model used in CELP coding. The CELP encoder has the same internal structure as the decoder. The CELP decoder consists of a linear prediction synthesis filter and two codebooks (adaptive codebook and stochastic codebook) that generate excitation signals for driving the filter. The linear prediction synthesis filter corresponds to the human vocal tract to represent spectrum envelope characteristics of speech signals and the excitation signals generated from the excitation codebook correspond to the air exhaled from the lung, which passes through the glottis. This means that CELP simulates the vocalization mechanism of human beings. The subsequent sections explain the basic technologies used in CELP coding. Linear Prediction Analysis As shown in Figure 6.7, linear prediction analysis uses temporal correlation of speech signals and predicts the current signal from the past inputs. The difference between the predicted signal and the original signal is prediction residual. The CELP encoder calculates the autocorrelation of speech signals and obtains linear prediction coefficients α i using the Levinson-Dervin-Itakura method and so on. The order of the linear prediction coefficient in telephone band coding is normally ten. Since it is difficult to determine filter stability, linear prediction coefficients are converted to equivalent and stable coefficients such as reflection coefficients or Line Spectrum Pair (LSP) coefficients and then quantized for transmission. The decoder constitutes a synthesis filter with transmitted α i and it drives the synthesis filter with the prediction residual to obtain the decoded speech. The frequency characteristics of the synthesis filter correspond to the speech spectrum envelope. Perceptual Weighting Filter The CELP encoder has the same internal structure as the decoder. It encodes signals by searching patterns and gains in each codebook so that the error between the synthe- sized speech signal and the input speech signal is minimized. Such techniques are called Analysis-by-Synthesis (A-b-S), one of the characteristics of CELP. Multimedia Processing Scheme 315 Transfer function 1 − F ( z ) = A ( z ) Linear prediction filter: Inverse filter: Synthesis filter: F ( z ) = 1 = 1 A ( z ) a p • X t − p Time Prediction coefficients Predicted value t · · · a1 • X t − 1 a2 • X t − 2 a3 • X t − 3 Prediction residual X t = X t + e t ^ 1 − Σ p i = 1 a i • z − i Σ p i = 1 a i • z − i + Figure 6.7 Linear prediction analysis A-b-S calculates error using the weighted error based on the perceptual characteristics of human beings. The perceptual weighting filter is expressed as an ARMA (Auto Regressive Moving Average)-type filter that uses the coefficient obtained through linear prediction analysis. This filter minimizes the quantization error of spectrum valleys that are relatively easy to hear, by having frequency characteristics of vertically inverted speech spectrum envelope. Although using nonquantized linear prediction coefficient improves the characteristics, the computational complexity increases. Because of this, there were some cases in the past in which the computational complexity was reduced by offsetting the quantized linear prediction coefficient against the synthesis filter at the cost of quality. Today, however, calculation is mainly performed using the impulse response of the synthesis filter and perceptual weighting synthesis filter. Adaptive Codebook The adaptive codebook stores past excitation signals in memory and changes dynamically. If the excitation signal is cyclic, like voiced sound, the excitation signal can be efficiently expressed using the adaptive c odebook because the excitation signal repeats at the pitch cycle that corresponds to the pitch of the voice. The pitch cycle chosen is the one in which the difference between the source voice and the output of the adaptive codebook vector, from the synthesis filter, is the smallest in the perceptually weighted area. As an average voice pitch cycle, cycles of about 16 to 144 samples are searched for an 8 kHz sampling input. If the pitch cycle is relatively short, it is quantized to an accuracy of noninteger cycle by over sampling to increase the frequency resolution. Since error calculation involves considerable computational complexity, normally the autocorrelation of speech is calculated in advance to obtain an approximate pitch cycle and then error calculation is performed including over sampling around that pitch cycle to significantly reduce the computational complexity. Exploring only around the previously obtained pitch cycle and quantizing the difference is also effective to reduce the amount of information and computational complexity. 316 W-CDMA Mobile Communications System Stochastic Codebook The stochastic codebook expresses residual signals that cannot be expressed with the a dap- tive codebook and therefore has noncyclic patterns. Traditionally, the codebook contained Gaussian random noises and noise signals it had learned. But now the algebraic codebook, which can express residual signals with sparse pulses, is often used. With this, it is now possible to significantly reduce memory required for storing noise vectors, orthogonaliza- tion operation with the adaptive codebook and the amount of error calculation. Post Filter The post filter is used in the final stage of decoding in order to improve the subjective quality of the decoded voice by reshaping it. The Formant emphasis filter, a typical post filter, is of the ARMA type and has the inverse characteristics of the perceptual weighting filter, capable of suppressing spectrum valleys to make quantization errors less noticeable. Normally this filter is a dded with a filter for correcting the spectrum tilt of output signals. 6.2.2.2 Peripheral Technologies for Mobile Communications In mobile communications, various peripheral technologies are used to cope with special conditions such as the utilization of radio links, use of service in outdoors or on the move. This section outlines these peripheral technologies. Error Correction Technology The error-correcting code is used for correcting transmission errors generated in the radio channels. B it Selective Forward Error Correction (BS-FEC) or Unequal Error Protection (UEP) is used to perform error correction efficiently since they use correction codes with different capabilities depending on the error sensitivity of the speech coding information bit (the size of distortion given to the decoded voice when the bit is erroneous). Error-Concealment Technology If an error is not corrected with the aforementioned error-correcting code or information is lost, correct decoding cannot be performed with the received information. In such a case, speech signals of the erroneous part are generated with parameter interpolation using past speech information to minimize the deterioration of the speech quality. This is called the error-concealment technology. Parameters to be interpolated include linear prediction coefficient, pitch cycle, and gain, which have high temporal correlation. Discontinuous Transmission Discontinuous Transmission (DTX) sends no or very little information during a period when there is no speech, which is effective to save the battery of Mobile Stations (MSs) and to reduce interference. Voice Activity Detector (VAD) uses voice parameters to determine whether there is speech or not. In silent periods, background noise is generated on the basis of the background noise information that contains far less a mount of information than speech information in order to reduce the user’s discomfort caused by DTX. Noise Suppression As mentioned in Section 6.2.2.1, since the CELP algorithm uses the vocal model of human beings, the characteristics of other sounds such as street noises deteriorate. Therefore, suppressing noises other than human voice required for conversation improves speech quality. [...]... [36–38] 6.3.2.3 Multimedia Information Storage Methods As mentioned in Section 6.3.2.1., to distribute multimedia information, multimedia contents are first created with the contents production system, stored in the multimedia information distribution server and then distributed to users The contents production system and the multimedia information distribution server transfer the multimedia contents in a... easily and quickly 6.3.2 Multimedia Information Distribution Methods 6.3.2.1 Overview of Multimedia Information Distribution Server In contrast with relatively small amount of information such as voice and text handled by conventional communications, large amount of digital information such as image and sound is called multimedia information When multimedia information including text, image, and sound... authoring tool in the multimedia information distribution server and distribute them to the terminals based on the request from the terminals The terminal that received the content performs decoding in order to replay the images and sound in the format before encoding The contents are then reconfigured and replayed There are two methods of distribution between the multimedia information distribution server and... embedded in multimedia information, that it functions when part of the Multimedia information distribution server User terminal Storage media Authentication request Authentication Contents distribution request Encrypted contents decryption key Decryption Encryption specific to storage media Figure 6.17 Basic concept of copyright protection method Multimedia Processing Scheme 333 multimedia information... (IPTEL), and Audio/Video Transport (AVT) Meanwhile, 3GPP is proceeding with its standardization tasks cooperating with these organizations, with an aim to implement IP over mobile networks 6.2.3 Multimedia Signal Processing Systems 6.2.3.1 History of Standardization Figure 6.9 shows the history of the international standardization of audiovisual terminals H.320 [11] is the recommendation for audiovisual... 6.4 Multimedia Messaging Methods 6.4.1 Overview Multimedia messaging is a technology for transferring multimedia information using a store-and-forward type transmission technology called messaging This technology is distinguished from real-time communication technologies such as videophone and remote conference in terms of immediacy of information Multimedia information integrates multiple media information... code Gain Total 8 8 23 20 36 16 95 Multimedia Processing Scheme 319 Radio Access Network (RAN) of IMT-2000 is defined so that it can be designed flexibly as a toolbox To enable this, classification of coding information is defined according to its significance so that RAN can apply UEP to the AMR coding information Note that IMT-2000 Steering Group (ISG) defines radio parameters to meet this classification... •HTTP/TCP/IP •RTSP/RTP/UDP/IP Multimedia information Mobile phone, distribution server etc terminal software Communication processor Configuration of multimedia information distribution server Contents 330 W-CDMA Mobile Communications System Commencement of playback at terminal 1 2 3 ··· Sequential play 1 2 3 · · · Buffering 1 2 3 ··· Reception at terminal Sending from server Distributed contents Completion... progress of the standardization activity of the third generation mobile communication system, ITU-T commenced studies on audiovisual terminals for mobile communications networks in 1995 Studies were made by extending the H.324 recommendation for PSTN, and led to the development of H.324 Annex C in February 1998 H.324 Annex C enhances error resilience against transmission over radio channels Since H.324... control unit and multimedia-multiplexing unit The speech CODEC requires AMR support as a mandatory function and video CODEC requires the H.263 baseline as a mandatory capability with MPEG-4 support recommended The support of H.223 Annex B, which offers improved error resilience, is a mandatory requirement for the multimedia-multiplexing unit 6.2.3.3 Media Coding While various media coding schemes can be . evaluates Multimedia Processing Scheme 311 forward prediction, backward prediction, and average of backward prediction and forward prediction and then selects the one having least prediction error. baseline is three-dimensional coding including EOB that extends the conventional two-dimensional variable length coding (run and level). Furthermore, interframe motion- compensated prediction in units. account when providing such services, including the information distribution method, copyright protection scheme and trends in the content markup language. The standardization direction in Wireless

Định dạng
Số trang	58
Dung lượng	911,55 KB