20 © 2000 by CRC Press LLC MPEG System — Video, Audio, and Data Multiplexing In this chapter, we present the methods and standards requiring how to multiplex and synchronize the MPEG-coded video, audio, and other data into a single bitstream or multiple bitstreams for storage and transmission. 20.1 INTRODUCTION ISO/IEC MPEG has completed work on the ISO/IEC 11172 and 13818 standards known as MPEG-1 and MPEG-2, respectively, which deal with the coding of digital audio and video signals. Currently, ISO/IEC is working on ISO/IEC 14496 known as MPEG-4 that is object-based generic coding for multimedia applications. As mentioned in the previous chapters, the MPEG-1, 2, and 4 standards are designed as generic standards and as such are suitable for use in a wide range of audiovisual applications. The coding part of the standards convert the digital visual, audio, and data signals to the compressed formats that are represented as binary bits. The task of the MPEG system is focused on multiplexing and synchronizing the coded audio, video, and data into a single bitstream or multiple bitstreams. In other words, the digital compressed video, audio, and data are all first represented as binary formats which are referred to as bitstreams, and then the function of system is to mix the bitstreams from video, audio, and data together. For this purpose, several issues have to be addressed by the system part of the standard: • Distinguishing different data, such as audio, video, or other data; • Allocating bandwidth during muxing; • Reallocating or decoding the different data during demuxing; • Protecting the bitstreams in error-prone media and detecting the errors; • Dynamically multiplexing several bitstreams. Additional requirements for the system should include extensibility issues, such as: • New service extensions should be possible; • Existing decoders should recognize and ignore data they cannot understand; • The syntax should have extension capacity. It should also be noted that all system-timing signals are included in the bitstream. This is the big difference between digital systems and traditional analog systems in which the timing signals are transmitted separately. In this chapter, we will introduce the concept of systems and give detailed explanations for existing standards such as MPEG-2. However, we will not go through the standards page by page to explain the syntax, we will pay more attention to the core parts of the standard and the parts which always cause confusion during implementation. One of the key issues is system timing. For MPEG-4, we will give a presentation of the current status of the system part of the standards. © 2000 by CRC Press LLC 20.2 MPEG-2 SYSTEM The MPEG-2 system standard is also referred to as ITU-T Rec. H.222.0/ISO/IEC 13818-1 (ISO/IEC, 1996). The ISO document gives a very detailed description of this standard. A simplified overview of this system is shown in Figure 20.1. The MPEG-2 system coding is specified in two forms: the transport stream and the program stream. Each is optimized for a different set of applications. The audio and video data are first encoded by an audio and a video encoder, respectively. The coded data are the compressed bitstreams, which follow the syntax rules specified by the video-coding standard 13818-2 and audio- coding standard 13818-3. The compressed audio and video bitstreams are then packetized to the packetized elementary streams (PES). The video PES and audio PES are coded by system coding to the transport stream or program stream according to the requirements of the application. The system coding provides a coding syntax which is necessary and sufficient to synchronize the decoding and presentation of the video and audio information; at the same time it also has to ensure that data buffers in the decoders do not overflow and underflow. Of course, buffer regulation is also considered by the buffer control or rate control mechanism in the encoder. The video, audio, and data information are multiplexed according to the system syntax by inserting time stamps for decoding, presenting, and delivering the coded audio, video, and other data. It should be noted that both the program stream and the transport stream are packet-oriented multiplexing. Before we explain these streams, we first give a set of parameter definitions used in the system documents. Then, we describe the overall picture regarding the basic multiplexing approach for single video and audio elementary streams. 20.2.1 M AJOR T ECHNICAL D EFINITIONS IN THE MPEG-2 S YSTEM D OCUMENT In this section, the technical definitions that are often used in the system document are provided. First, the major packet- and stream-related definitions are given. Access unit: A coded representation of a presentation unit. In the case of audio, an access unit is the coded representation of audio frame. In the case of video, an access unit indicates all the coded data for a picture, and any stuffing that follows it, up to but not including the start of the next access unit. In other words, the access unit begins with the first byte of the first start code. Except for the end of sequence, all bytes between the last byte of the coded picture and the sequence end code belong to the access unit. DSM - CC: Digital storage media command and control. Elementary stream ( ES ) : A generic term for one of the coded video, coded audio, or other coded bitstreams in PES packets. One elementary stream is carried in a sequence of PES FIGURE 20.1 Simplified overview of system layer scope. (From ISO/IEC 13818-1, 1996. With permission.) © 2000 by CRC Press LLC packets with one and only one stream identification. This implies that one elementary stream can only carry the same type of data, such as audio or video. Packet: A packet consists of a header followed by a number of contiguous bytes from an elementary data stream. Packet identification ( PID ) : A unique integer value used to associate elementary streams of a program in a single- or multiprogram transport stream. It is a 13-bit field, which indicates the type of data stored in the packet payload. PES packet: The data structure used to carry elementary stream data. It contains a PES packet header followed by PES packet payload . PES stream: A PES stream consists of PES packets, all of whose payloads consist of data from a single elementary steam, and all of which have the same stream identification. Specific semantic constraints apply. PES packet header: The leading fields in the PES packet up to and not including the PES packet data byte fields. Its function will be explained in the section on syntax description. System target decoder ( STD ) : A hypothetical reference model of a decoding process used to describe the semantics of the MPEG-2 system-multiplexed bitstream. Program-specific information ( PSI ) : PSI includes normal data that will be used for demul- tiplexing of programs in the transport stream by decoders. One case of PSI, the nonman- datory network information table, is privately defined. System header: The leading fields of program stream packets. Transport stream packet header: The leading fields of program stream packets. The following definitions are related to timing information: Time stamp: A term that indicates the time of a specific action such as the arrival of a byte or the presentation of a presentation unit. System clock reference ( SCR ) : A Time stamp in the program stream from which decoder timing is derived. Elementary stream clock reference ( ESCR ) : A time stamp in the PES stream from which decoders of the PES stream may derive timing information. Decoding time stamp ( DTS ) : A time stamp that may be presented in a PES packet header used to indicate the time when an access unit is decoded in the system target decoder. Program clock reference ( PCR ) : A time stamp in the transport stream from which decoder timing is derived. Presentation time stamp ( PTS ) : A time stamp that may be presented in the PES packet header used to indicate the time that a presentation unit is presented in the system target decoder. 20.2.2 T RANSPORT S TREAMS The transport stream is a stream definition that is designed for communicating or storing one or more programs of coded video, audio, and other kinds of data in lossy or noisy environments where significant errors may occur. A transport stream combines one or more programs with one or more time bases into a single stream. However, there are some difficulties with constructing and delivering a transport stream containing multiple programs with independent time bases such that the overall bit rate is variable. As in other standards, the transport stream may be constructed by any method that results in a valid stream. In other words, the standards just specify the system coding syntax. In this way, all compliant decoders can decode bitstreams generated according to the standard syntax. However, the standard does not specify how the encoder generates the bitstreams. It is possible to generate transport streams containing one or more programs from elementary coded data streams, from program streams, or from other transport streams, which may themselves contain © 2000 by CRC Press LLC one or more programs. An important feature of a transport stream is that the transport stream is designed in such a way that makes the following operations possible with minimum effort. These operations include several transcoding requirements, including the following: • Retrieve the coded data from one program within the transport stream, decode it, and present the decoded results. In this operation, the transport stream is directly demulti- plexed and decoded. The data in the transport stream are constructed in two layers: a system layer and a compression layer. The system decoder decodes the transport streams and demultiplexes them to the compressed video and audio streams that are further decoded to the video and audio data by the video decoder and the audio decoder, respectively. It should be noted that nonaudio/video data is also allowed. The function of the transport decoder includes demultiplexing, depacketization, and other functions such as error detection, which will be explained in detail later. This procedure is shown in Figure 20.2. • Extract the transport stream packets from one program within the transport stream and produce as the output a new transport stream that contains only that one program. This operation can be seen as system-layer transcoding that converts a transport stream containing multiple programs to a transport stream containing only a single program. In this case, the remultiplexing operation may need the correction of PCR values to account for changes in the PCR locations in the bitstream. • Extract the transport stream packets of one or more programs from one or more transport streams and produce as output of a new transport stream. This is another kind of transcoding that converts selected programs of one transport stream to a different one. • Extract the contents of one program from the transport stream and produce as output another program stream. This is a transcoding that converts the transport program to a program stream for certain applications. • Convert a program stream to a transport stream that can be used in a lossy communication environment. To answer the question of how to define the transport stream and then make the above transcoding simpler and more efficient, we will begin by describing the technical detail of the systems specification in the following section. 20.2.2.1 Structure of Transport Streams As described earlier, the task of the transport stream coding layer is to allow one or more programs to be combined into a single stream. Data from each elementary stream are multiplexed together with timing information, which is used for synchronization and presentation of the elementary FIGURE 20.2 Example of transport demultiplexing and decoding. (From ISO/IEC 13818-1, 1996. With permission.) © 2000 by CRC Press LLC stream during decoding. Therefore, the transport stream consists of one or more programs such as audio, video, and data elementary stream access units. The transport stream structure is a layered structure. All the bits in the transport stream are packetized to the transport packets. The size of transport packet is chosen to be 188 bytes, among which 4 bytes are used as the transport stream packet header. In the first layer, the header of the transport packets indicates whether or not the transport packet has an adaptation field. If there is no adaptation field, the transport payload may consist of only PES packets or consist of both PES packets and PSI packets. Figure 20.3 illustrates the case of containing PES packets only. If the transport stream carries both PES and PSI packets, then the structure of transport stream is as shown in Figure 20.4 would result. If the transport stream packet header indicates that the transport stream packet includes the adaptation field, then the construct is as shown in Figure 20.5. In Figure 20.5, the appearance of the optional field depends on the flag settings. The function of adaptation field will be explained in the syntax section. Before we go ahead, however, we should give a little explanation regarding the size of the transport stream packet. More specifically, why is a packet size of 188 bytes chosen? Actually, there are several reasons. First, the transport packet size needs to be large enough so that the overhead due to the transport headers is not too significant. Second, the size should not be so large that the packet-based error correction code becomes inefficient. Finally, the size 188 bytes is also compatible with ATM packet size which is 47 bytes; one transport stream packet is equal to four ATM packets. So the size of 188 bytes is not a theoretical solution but a practical and compromised solution. FIGURE 20.3 Structure of transport stream containing only PES packets. (From ISO/IEC 13818-1, 1996. With permission.) FIGURE 20.4 Structure of transport stream containing both PES packets and PSI packets. © 2000 by CRC Press LLC 20.2.2.2 Transport Stream Syntax As we indicated, the transport stream is a layered structure. To explain the transport stream syntax we start from the transport stream packet header. Since the header part is very important, it is the highest layer of the stream. We describe it in more detail. For the rest, we do not repeat the standard document and just indicate the important parts that we think may cause some confusion for readers. The detail of other parts that are not covered here can be found from the MPEG standard document (ISO/IEC, 1996). Transport stream packet header — This header contains four bytes that are assigned as eight parts: The mnemonic in the above table means: • The sync_byte is a fixed 8-bit field whose value is 0100 0111 (hexadecimal 47 = 71). • The transport_error_indicator is a 1-bit flag, when it is set to 1, it indicates that at least 1 uncorrectable bit error exists in the associated transport stream packet. It will not be reset to 0 unless the bit values in error have been corrected. This flag is useful for error concealment purpose, since it indicates the error location. When an error exists, either resynchronization or another concealment method can be used. • The payload_unit_start_indicator is a 1-bit flag that is used to indicate whether the transport stream packets carry PES packets or PSI data. If it carries PES packets, then the PES header starts in this transport packet. If it contains PSI data, then a PSI table starts in this transport packet. • The transport_priority is a 1-bit flag which is used to indicate that the associated packet is of greater priority than other packets having the same PID which do not have the flag FIGURE 20.5 Structure of transport stream whose header contains an adaptation field. Syntax No. of bits Mnemonic sync_byte 8 bslbf transport_error_indicator 1 bslbf payload_unit_start_indicator 1 bslbf transport_priority 1 bslbf PID 13 uimsbf transport_scrambling_control 2 bslbf adaptation_field_control 2 bslbf continuity_counter 4 uimsbf bslbf Bitstream left bit first unimsbf Unsigned integer, most significant bit first © 2000 by CRC Press LLC bit set to 1. The original idea of adding a flag to indicate the priority of packets comes from video coding. The video elementary bitstream contains mostly bits that are con- verted from DCT coefficients. The priority indicator can set a partitioning point that can divide the data into a more important part and a less important part. The important part includes the header information and low-frequency coefficients, and the less important part includes only the high-frequency coefficients that have less effect on the decoding and quality of reconstructed pictures. • PID is a 13-bit field that provides information for multiplexing and demultiplexing by uniquely identifying which packet belongs to a particular bitstream. • The transport_scrambling_control is a 2-bit flag. 00 indicates that the packet is not scrambled, the other three (01, 10, and 11) indicate that the packet is scrambled by a user-defined scrambling method. It should be noted that the transport packet header and adaptation field (when it is present) should not be scrambled. In other words, only the payload of transport packets can be scrambled. • The adaptation_field_control is a 2-bit indicator that is used to inform whether or not there is an adaptation field present in the transport packet. 00 is reserved for future use: 01 indicates no adaptation field; 10 indicates that there is only an adaptation field and no payload. Finally, 11 indicates that there is an adaptation field followed by a payload in the transport stream packet. • The continuity_counter is a 4-bit counter which increases with each transport stream packet having the same PID. From the header of the transport stream packet we can obtain information about future bits. There are two possibilities; if the adaptation field control value is 10 or 11, then the bits following the header are adaptation field; otherwise, the bits are payload. The information contained in the adaptation field is described as follows. Adaptation field — The structure of the adaptation field data is shown in Figure 20.5. The functionality of these headers is basically related to the timing and decoding of the elementary bit steam. Some important fields are explained below: • Adaptation field length is an 8-bit field specifying the number of bytes immediately following it in the adaptation field including stuffing bytes. • Discontinuity indicator is 1-bit flag which when it is set to 1 indicates that the discon- tinuity state is true for the current transport packet. When this flag is set to 0, the discontinuity is false. This discontinuity indicator is used to indicate two types of discontinuities, system time-base discontinuities and continuity-counter discontinuities. In the first type, this transport stream packet is the packet of a PID designed as a PCR- PID. The next PCR represents a sample of a new system time clock for the associated program. In the second type, the transport stream packet could be any PID type. If the transport stream packet is not designated as a PCR-PID, the continuity counter may be discontinuous with respect to the previous packet with the same PID or when a system time-base discontinuity occurs. For those PIDs that are not designated as PCR-PIDs, the discontinuity indicator may be set to 1 in the next transport stream packet with the same PID, but will not be set to 1 in three consecutive transport stream packet with the same PID. • Random access indicator is a 1-bit flag that indicates the current and subsequent transport stream packets with the same PID, containing some information to aid random access at this point. Specifically, when this flag is set to 1, the next PES packet in the payload of the transport stream packet with the current PID will contain the first byte of a video sequence header or the first byte of an audio frame. © 2000 by CRC Press LLC • Elementary stream priority indicator is used for data-partitioning application in the elementary stream. If this flag is set to 1, the payload contains high-priority data, such as the header information, or low-order DCT coefficients of the video data. This packet will be highly protected. • PCR flag and OPCR flag: If these flags are set to 1, it means that the adaptation field contains the PCR data and original PCR data. These data are coded in two parts. • Splicing point flag: When this flag is set to 1, it indicates that a splice-countdown field will be present to specify the occurrence of a splicing point. The splice point is used to splice two bitstreams smoothly into one stream. The Society of Motion Picture and Television Engineers (SMPTE) has developed a standard for seamless splicing of two streams (SMPTE, 1997). We will describe the function of splicing later. • Transport private flag: This flag is used to indicate whether the adaptation field contains private data. • Adaptation filed extension-flag: This flag is used to indicate whether the adaptation field contains the extension field that gives more-detailed splicing information. Packetized elementary stream — It is noted that the elementary stream data is carried in PES packets. A PES packet consists of a PES packet header followed by packet data, or payload. The PES packet header begins with a 32-bit start-code that also identifies the stream or stream type to which the packet data belong. The first byte of each PES packet header is located at the first available payload location of a transport stream packet. The PES packet header may also contain decoding time stamps (DTS), presentation time stamps (PTS), elementary stream clock reference (ESCR), and other optional fields such as DSM trick-mode information. The PES packet data field contains a variable number of contiguous bytes from one elementary stream. Readers can learn this part of the syntax in the same way as described for the transport packet header and adaptation field. Program-specific information — PSI includes both MPEG-2 system-compliant data and private data. In the transport streams, the program-specific information is classified into four table struc- tures: program association table, program map table, conditional access table, and network infor- mation table. The network information table is private data and the other three are MPEG-2 system compliant data. The program associate table provides the information of program number and the PID value of the transport stream packets. The program map table specifies PID values for components of one or more programs. The conditional access (CA) table provides the association between one or more CA systems, their entitlement management messages (EMM), and any special parameters associated with them. The EMM are private conditional-access information that specify the authorization levels or the services of specific decoders. They may be addressed to a single decoder or groups of decoders. The network information table is optional and its contents are private. Its contents provide physical network parameters such as FDM frequencies, transponder numbers, etc. 20.2.3 T RANSPORT S TREAM S PLICING The operation of bitstream splicing is switching from one source to another according to the requirements of the applications. Splicing is the most common operation performed in TV stations today (Hurst, 1997). The examples include inserting commercials into programming, editing, inserting, or replacing a segment into an existing stream, and inserting local commercials or news into a network feed. The most important problem for bitstream splicing is managing the buffer fullness at the decoder. Usually, the encoded bitstream satisfies the buffer regulation with a buffer control algorithm at the encoder. During decoding, this bitstream will not cause the decoder buffer to suffer from buffer overflow and underflow. A typical example of buffer fullness trajectory at the decoder is shown in Figure 20.6. However, after bitstream splicing, the buffer regulation is not © 2000 by CRC Press LLC guaranteed depending on the selection of splicing point and the bit rate of the new bitstream. It is necessary to have a rule for selecting the splicing point. The committee on packetized television technology, PT20 of SMPTE, has proposed a standard that deals with the splice point for MPEG-2 transport streams (SMPTE, 1997). In this standard, two techniques have been proposed for selecting splicing points. One is the seamless splicing and the other is nonseamless splicing. The seamless splicing approach can provide clean and instant switching of bitstreams, but it requires careful selection of splicing points on video bitstreams. The nonseamless splicing approach inserts a “drain time” that is a period of time between the end of an old stream and the start of a new stream to avoid overflow in the decoder buffer. The “drain time” assures that the new stream begins with an empty buffer. However, the decoder has to freeze the final presented picture of the old stream and wait for a period of start-up delay while the new stream is initially filling the buffer. The difference between seamless splicing and nonseamless splicing is shown in Figure 20.7. FIGURE 20.6 Typical buffer fullness trajectory at the decoder. FIGURE 20.7 Difference between seamless splicing and nonseamless splicing: (a) the VBV buffer behavior of seamless splicing, (b) the VBV buffer behavior of nonseamless buffer behavior. © 2000 by CRC Press LLC In the SMPTE proposed standard (SMPTE, 1997), the optional indicator data in the PID streams (all the packets with the same PID within a transport stream) are used to provide important information about the splice for the applications such as inserting commercial programs. The proposed standard defines a syntax that may be carried in the adaptation field in the packets of the transport stream. The syntax provides a way to convey two kinds of information. One type of information is splice-point information that consists of four splicing parameters: drain time, in- point flag, ground id and picture-param-type. The other types of information are splice point indicators that provide a method to indicate application-specific information. One such application example is the insertion indicator for commercial advertisements. This indicator includes flags to indicate that the original stream is obtained from the network and that the splice point is the time point where the network is going out or going in. Other fields give information about whether it is scheduled, how long it is expected to last, as well as an ID code. The detail about splicing can be found in the proposed standard (SMPTE, 1997). Although the standard provides a tool for bitstream splicing, there are still some difficulties for performing bitstream splicing in practice. One problem is that the selection of a splicing point has to consider that the bitstream contains video that has been encoded by a predictive coding scheme. Therefore, the new stream should begin from the anchor picture. Other problems include uneven timing frames and splicing of bitstreams with different bit rates. In such cases, one needs to be aware of any consequences related to buffer overflow and underflow. 20.2.4 P ROGRAM S TREAMS The program stream is defined for the multiplexing of audio, video, and other data into a single stream for communication or storage application. The essential difference between the program stream and transport stream is that the transport stream is designed for applications with noisy media, such as in terrestrial broadcasting. Since the program stream is designed for applications in the relatively error-free environment, such as in the digital video disk (DVD) and digital storage applications, the overhead in the program stream is less than in the transport stream. A program stream contains one or more elementary streams. The data from elementary streams are organized in the form of PES packets. The PES packets from different elementary streams are multiplexed together. The structure of a program stream is shown in Figure 20.8. FIGURE 20.8 Structure of program stream. [...]... audio descriptor, and hierarchy descriptor For the detail on these descriptors, the reader is referred to the standard document (ISO/IEC, 1996) 20.2.5 TIMING MODEL AND SYNCHRONIZATION The principal function of the MPEG system is to define the syntax and semantics of the bitstreams that allow the system decoder to perform two operations among multiple elementary streams: demultiplexing and resynchronization... AAL5/ATM, and H223/Mux Only the interface to the TransMux layer is part of the standard Usually, the interface is the DMIF application interface (DAI), which is not specified in the system part, but in Part 6 of the MPEG-4 standard The DAI specifies the data that need to be exchanged between the SL and the delivery layer The DAI also defines the interface for signaling information required for session and channel... for packetizing the elementary streams to the SL packets The SL packets contain an SL packet header and an SL packet payload The SL packet header provides the information for continuity checking in case of data loss and also carries the timing and synchronization information as well as fragmentation and random access information The SL packet does not contain its length information Therefore, SL packets... graphics, and synthetic video Since MPEG-4 is the first object-based coding standard, reconstructing or composing a multiple audiovisual scene is quite new The decoder not only needs the elementary streams for the individual audiovisual objects, but also synchronization timing information and the scene structure This information is called the scene description and it specifies the temporal and spatial... streams and descriptors to make protected IS 14496 content available to the terminal An application may choose not to use an IPMP system, thereby offering no management and protection features 20.4 SUMMARY In this chapter, the MPEG system issues are discussed Two typical systems, MPEG-2 and MPEG-4, are introduced The major task of the system layer is to multiplex and demultiplex video, audio, and other... contained in the bitstreams and used in the decoding processes The system part of the MPEG-4 standard specifies the overall architecture of a general receiving terminal Figure 20.12 shows the basic architecture of the receiving terminal The major elements of this architecture are delivery layer, sync layer (SL), and compression layer The delivery layer consists of the FlexMux and TransMux At the encoder,... and other data with the synchronization and scene description information, are multiplexed to the FlexMux streams The FlexMux streams are transmitted to the © 2000 by CRC Press LLC FIGURE 20.12 The MPEG-4 system terminal architecture (From ISO/IEC 14496-1, 1998 With permission.) TransMux of the delivery layer from the network The function of TransMux is not within the scope of the system standard and. .. the program stream and their relationship to one another The data structure of program stream map is shown in Figure 20.9 Other special types of PES packets include program stream directory and program element descriptors The major information contained in the program stream directory includes the number of access units, packet stream id, and presentation time stamp (PTS) The program and program descriptors... length, rate bound, audio bound, video bound, stream id, and other system parameters The rate bound is used to indicate the maximum rate in any pack of the program stream, and it may be used to assess whether the decoder is capable of decoding the entire stream The audio bound and video bound are used to indicate the maximum values of audio and video in the program stream There are some other flags... FlexMux tool At the decoder, the SL packets are demultiplexed to the elementary streams in the SL At the same time, the timing and the synchronization information as well as fragmentation and random access information are also extracted for synchronizing the decoding process and subsequently for composition of the elementary streams At the compression layer, the encoded elementary streams are decoded . Audio, and Data Multiplexing In this chapter, we present the methods and standards requiring how to multiplex and synchronize the MPEG-coded video, audio, and. bitstreams for storage and transmission. 20.1 INTRODUCTION ISO/IEC MPEG has completed work on the ISO/IEC 11172 and 13818 standards known as MPEG-1 and MPEG-2, respectively,