A REALTIME SOFTWARE SOLUTION FOR RESYNCHRONIZING FILTERED MPEG2 TRANSPORT STREAM Bin Yu, Klara Nahrstedt Department of Computer Science University of Illinois at Urbana-Champaign DCL, 1304 W Springfield, Urbana IL 61801 binyu, klara@cs.uiuc.edu ABSTRACT With the increasing demand and popularity of multimedia streaming applications over the current Internet, manipulating MPEG streams in a real-time software manner is gaining more and more importance In this work, we studied the synchronization problem that arises when a gateway changes the data content carried in an MPEG2 Transport stream In short, the distance between original time stamps will be changed non-uniformly when the video frames are resized, and decoders will fail to reconstruct the encoding clock from the resulting stream We propose a cheap software real-time approach to solve this problem, which basically reuses the original time stamp packets and adapts their distance to accommodate the changes in bit rate Experimental results from a real-time HDTV stream filter shows that our approach is correct and efficient INTRODUCTION Video streaming is gaining more and more attention from both the academy and industry world, and primarily things are behind this popularity: a widely-accepted video compression standard – MPEG2 [4], a widely available Internet with high bandwidth becoming commonplace, and an ever-growing user demands for the more easily understood visual presentation of information Beyond simply sending the video content, people are working on adapting the content at intermediate gateways before it reaches the client, either to tackle heterogeneity in resource availability or to increase client customization and interaction Example prototype systems include ProxyNet [1], IBM Transcoding proxy [8], UC-Berkeley TranSend [5] and Content Service Network [10] There could be many kinds of of video editing services, such as watermarking, frequency-domain low-pass filtering, frame/color dropping, external content embedding [11] and so on As we focus on the case of HDTV streaming, the problem of streaming vs decoding becomes obvious On the one hand, the Internet is bringing to end hosts video streams above 10Mbits thanks to the technologies like IP multicast on the mBone, Fast Ethernet [9] and Gigabit Ethernet [3] in office buildings and xDSL [2] and Cable modem [6] at home On the other hand, PCs and even gateway servers are still not able to decode or non-trivial video manipulation on the high volume HD streams in real-time for lack of enough computing power and real-time support For example, using ordinary 30frame per second HDTV stream at 18Mbits, a 100M local area network could afford or high definition video conference sessions in the office building, but even the most advanced desktop computer could only decode and render two frames per second Also, the PC monitor could never match the great experience rendered by TV screens and big video walls, and in many situations the high definition video needs to be shown on large screens for a large audience In such a situation, we propose to combine the software video delivery channel with the hardware decoding/rendering interfaces by using desktop PCs to receive/process the HD video streams and then feed the resulting streams into a hardware decoding board For example, in [11], we presented how we implemented software real-time Picture-inPicture for HDTV streams in this way However, one key problem we are facing is that hardware decoding boards rely on the time-stamps contained in MPEG2 Transport Layer Streams to maintain their hardware clock, while almost all software editing operations would compromise these timestamps This problem has to be solved before any similar software video manipulations could be applied to HD video streams, and in this paper we will present our solutions to it Our solutions are cheap in the sense that it is simple and easy to implement, and no hardware real-time support is necessary This way, it could be adopted by desktop PCs or intermediate gateways servers with minimum extra cost This paper is organized as follows: In section 2, we will briefly introduce how the synchronization between MPEG encoder and decoder works according to the MPEG2 standard and the problem of re-synchronization that arises after Figure 1: Synchronization between the encoder and decoder video editing operations Our solution is then discussed in detail in section and experiment results follow in section Finally we discuss some related work and conclude this paper in section and THE SYNCHRONIZATION PROBLEM In this section, we will first briefly review how the timestamp encoded in MPEG2 Transport stream is used by the decoder to reconstruct the encoder’s clock, and then we introduce what kind of video editing system we are focusing on and how it affects the synchronization between the encoder and the decoder ✝✁✟ ✂✁☎☛ s and the PTS s After that, as all of these packetized elementary stream packets are further multiplexed together, the final stream is time-stamped with Program Clock Reference(PCR), which is given by periodically sampling the encoder clock This resulting transport layer stream is then sent over the network to the receiver, or stored in storage devices for the decoder to read in the future As long as the delay the whole stream experiences remains constant from the receiver’s point of view, the receiver should be able to reconstruct the sender’s clock that has been used when the stream was encoded The accuracy and stability of this recovered clock is very important, since the decoder will try to match the PTS and DTS against this clock to guide its decoding and displaying activities 2.1 The MPEG2 Transport Layer Stream Timestamps Figure shows how MPEG2 Transport streams manage to maintain synchronization between the sender, which encodes the stream, and the receiver, which decodes it As the elementary streams carrying video and audio content are packetized, their target Decoding Time Stamp (DTS) and Presentation Time Stamp (PTS) are determined based on the current sender clock and inserted into the packet headers For video streams, the access unit is a frame, and both DTS and PTS are given only for the first bit of each frame after its picture header These time stamps are later used by the decoder to control the timing at which it starts to ✂✁☎✄✝✆ decoding and presentation For example, if at time s an encoded frame comes to the multiplexing stage at the sending side and the encoder believes (based on calculation using predefined parameters) that the decoder should begin ✄✞✁✟ to decode this frame s after it receives it and output the ✄✞✁✡✠ decoded frame s thereafter Assuming the decoder could reconstruct the encoder clock and the time it receives this ✝✁☎✄✝✆ , then the DTS should be set to frame would also be Figure 2: MPEG2 Transport Stream Syntax Knowing the general idea in timing, we now introduce how the Transport Layer syntax works, as shown in Figure All sub-streams (video, audio, data and time stamps) are segmented into small packets of constant size (188 bytes), and the Packet ID(PID) field in the 4-byte header of each packet tells which sub-stream that packet belongs to The PCR packets are placed at constant intervals, and they form a running time line along which all other packets are positioned at the target time point On this time line, each 188byte packet occupies one time slot, and the exact time stamp of each packet/slot could be interpolated using neighboring PCR packets Data packets arrive and are read into the de- Figure 3: Layered coding scheme of MPEG-2 Transport Stream coder buffer at constant rate, and this rate can be calculated by dividing the number of bits between any two consecutive PCR packets by the time difference between their time stamps In other words, if the number of packets between any two PCR packets remains constant, then the difference between their time stamps should also be constant In an ideal state, packets are read into the decoder at the constant bit rate, and whenever a new PCR packet arrives, its time stamp should match exactly with the receiver clock, which confirms the decoder that so far it has successfully reconstructed the same clock as the encoder However, since PCR packets may have experienced jitter in network transmission or storage device accessing before they arrive at the receiver, we can not simply set the receiver’s local clock to be the same as the time stamp carried by the next incoming PCR no matter when it comes To smooth out the jitter and maintain a stable clock with a limited buffer size at the receiver, generally the receiver will resort to some smoothing technique like the Phase-Locked-Loop(PLL) [7] to generate a stable clock from the jittered PCR packets PLL is a feedback loop that uses an external signal(the incoming PCR packets in our case) to tune a local signal source(generated by a local oscillator in our case) to generate a relatively more stable result signal(the receiver’s reconstructed local clock in our case) So long as the timing relation between PCR packets is correct, the jitter can be smoothed out with PLL 2.2 HDTV Stream Editing/Streaming Test Bed In the following discussion, we will base our discussion on a video editing/streaming test bed as shown in Figure Live High definition digital TV stream from the satellite or the HD storage device is feed into the server PC, which then encodes it into MPEG2 Transport stream and multicasts this stream over the high speed Local Area Network Players on the client PC’s join this multicast group to receive the HD stream, and then feed this stream into the decoding board The decoded analogue signal is then sent to the wide-screen TV for display Our filter receives this stream in the same way as a normal player, and performs various kinds of video editing operations on this stream in real time, such as low pass filtering, frame/color dropping and visual information embedding [11] There could be multiple editing operations done to the same stream in a chain, and the resulting streams at all stages are available to clients through other multicast groups 2.3 How Video Editing Affects Clock Reconstruction Since the timing and spacing of PCR packets are very important for clock reconstruction, it is obvious that video editing operations will cause malfunctions since it changes both First, all intermediate operations, a video stream goes through before it reaches the decoder, contribute to the delay and jitter of the PCR sub-stream Different filtering operations, such as low pass filtering and Picture-In-Picture, or even the same operation, take varying processing time to the necessary calculation for different frames or different parts of the same frame In compensation, traditional solutions would either try to adjust the resulting stream at each intermediate point or push all the trouble to the final client The former solution would suffer from the fact that processing time for different operations/frames tends to be quite different and varying, which makes it very hard to find a local optimal answer The latter solution implies that the client needs to have a very large buffer and a long waiting time because of the unpredictable delay and jitter of the incoming stream We will see later how our solutions solve this problem by utilizing the inherent PCR time stamps of the streams The second problem of changing spacing between PCR packets is even more intractable As we said above, each access unit (video frame or audio packet) should be positioned within the time line formed by the PCR sub-stream If a video frame arrives at the receiver at its destined time point, the decoder would be able to correctly schedule where and how long to buffer it before decoding it However, normally after the filtering operations, a video frame becomes smaller or larger It takes less or more packets to carry, and so its following frames are dragged earlier or pushed later along the time line In such circumstances, if we keep both the time stamp and the spacing of the PCR packets unchanged, then the receiver’s clock can still be correctly reconstructed, but the arriving time of each frame will be skewed along the time line For example, if the stream is low pass filtered, then every frame becomes shorter, and so following frames are dragged forward to pack up the vacancy spared out If the decoder still reads in data at the original speed, it feels that more and more future frames begin to come earlier and earlier Since they are all buffered until their stamped time for decoding, the buffer will be overflowed in the long run no matter how large it is The fundamental problem is that after the filtering, the actual bit rate becomes lower or higher, but the data is still read in by the decoder at the original rate since the timing and spacing of PCR packets are not changed So if the new rate is lower, more and more future frames are read in by the decoder, causing the receiving buffer for the network connection to be emptied while the decoder’s decoding buffer is overflowed; on the other hand, if the new rate is higher, then at some point in the future, the data will remain in the receiving buffer and not be read in by the decoder even at its decoding time OUR SOLUTIONS To solve the problems described above, an immediate thought would be to the same kind of clock reconstruction as the decoder does at the filter, and then re-generate the PCR packets to reflect the changes at the filter output However, we know that the smoothing mechanisms like PLL are implemented in hardware circuits containing a voltage controlled oscillator that generates high frequency signals to be tuned with the incoming PCR time stamps This is not easy, if not impossible, to be done in software on computers without real-time support in hardware Therefore, a pure software mechanism that does not require hardware real-time support would enable us to distribute the video editing service across the network to any point on the streaming path Another goal is to achieve a cheap and efficient solution that could be easily implemented and carried out by any computer with modest CPU and memory resource available The key idea behind our solution comes from the observation that the DTS and PTS are only associated with the beginning bit of each frame Consequently, so long as we manage to fix that point to the correct position on the time line, the decoder should be working fine even if the remaining bits of that frame following the starting point are stretched shorter or longer 3.1 Simple Solution: Padding Following the discussion above, we have designed a simple solution that works for bit rate reducing video editing operations We not change the timestamp and the position of any PCR packet along the time line within the stream, and we also preserve the position of the frame header and so that of the beginning bit of every frame What is changed here is the size of each frame in terms of number of bits, and we just pack the filtered bits of a frame closely following the picture header Since each frame takes less 188-byte packets to carry, yet the frame headers are still positioned at their original time points, we can imagine that there would be some “white space” left between the last bit of one frame and the first bit of the header of the next frame Actually the capacity of this space is the same as the reduction in the number of bits used to encode this frame as a result of the video editing operations, and we can simply pad this space with empty packets (NULL packets) This solution is very simple to understand and implement, and it preserves the timing synchronization, since we only need to pack the filtered bits of each frame continuously after the picture header and then insert NULL packets until the header of the next frame No new time stamps need to be generated in real-time, and the bit rate remains stably at the original rate However, it inevitably has some drawbacks First, it can only handle bit rate reduction operations We only try to fix the header of each frame to its original position on the time line, which means the changed frames should not occupy more bits larger than the distance between the current frame header and the next This property does not always hold, since some filtering operations like information embedding and watermarking may increase the frame size in bits Secondly, the saved bits are padded with NULL packets to maintain the original constant bit rate and the starting point of each frame, and this ironically runs counter to our initial goal of bit rate reduction for some operations like low pass filtering and color/frame dropping The resulting stream contains the same number of packets as the original one The only difference is that the number of bits representing each frame has been shrunk, yet this saving is spent immediately by padding NULL packets at the end of each frame Here we want to mention that there does exist another approach to bypass the second problem Up to now we have been using a filter model that is transparent to the client player, which confines us strictly to the MPEG2 standard syntax However, if some of the filtering intelligence is exported to the end hosts, then some saving can be expected For example, instead of inserting NULL packets, we may compress them by insert only a special packet saying the next packets should be NULL packets At the end host, a stub proxy could be watching this incoming stream, and on seeing this packet, it replaces this packet with the supposed amount of padding packets before sending the stream to the client player Note that this padding is important to maintain correct timing, especially if the client is using some standard hardware decoding board This way, the bandwidth is indeed saved, but at the price of relying on non-standard protocol outside MPEG2 Of course, this will in turn introduce problems associated with non-standardized solutions, such as difficulty in software maintenance and upgrading Therefore, we only consider this as a secondary choice, and not as a major solution Figure 4: Example: 2/3 shrinking the number of packets between any PCR pair to another constant value This way, we can scale the PCR packets’ distance and achieve a fixed new bit rate, as if the time line is scaled looser or tighter to carry more or less packets, yet we not need to re-generate new PCR time stamps which relies on hardware real-time support All non-video stream packets can be simply mapped to the new position on the scaled output time line that corresponds to the same time point as on the original input time line In case that no exact mapping is available because the packets are aligned at unit of 188-byte, we could simply use the nearest time point on the new time line without introducing any serious problem For video stream, the same kind of picture header fixing and frame data packing are conducted as in the first solution but in a scaled way An example of shrinking the stream to its 2/3 bandwidth is given in Figure All non-video packets and video packets that carry picture headers are mapped to their corresponding position on the new time line, and so their distance is also shrunk to 2/3 of the original After video editing operations, the resulting video packets are packed closely and as early as possible within the new stream following the frame header Intuitively, the filtered video data is squeezed into the remaining space between all non-video packets and picture header packets For example, if in the input stream, packet is a frame header, and packet is an audio packet, and packet 7, 8, 10 through 24 are video data from the same frame of packet After 2/3 shrinking, packet is positioned in slot 4, and packet goes to slot The other video data packets are processed by a video editing filter, and the resulting bits are packed again into packets of 188byte each Therefore, all empty slots, such as slot and through 16, are used to carry the resulting bits If the filter shrinks the video frame to occupy less than 2/3 of its original number of bits, then the new slots should be enough to carry the resulting frame This algorithm is also very simple to implement For each non-video packet, its distance (in number of packets) from the last PCR packet is multiplied by a scaling factor ✆ , and the result is used to set the distance between this packet and the last PCR packet in the output stream For video frames, the header containing DTS and PTS is scaled and positioned in the same way, and the remaining bits are closely appended to the header in the result stream Note ✆ that when is set to always be 1, this reduces to the simple solution above ✆ 3.2 Enhanced Solution: Time-Invariant Bitrate Scaling To ultimately solve the synchronization problem, a more general algorithm has been designed The key insight behind it is that we can change the bit rate to another constant value while preserving the PCR time stamps by changing Now the only problem is how to determine for a specific streaming path If we shrink the time line too much and for some frames the bit rate reducing operation does not have a significant effect, then again we will not have enough space to squeeze in this frame, which will push the beginning bit of the next frame behind schedule On the other hand, if we shrink the time line too little or expand ( Figure 5: Result of time line scaling ✁ ✆ ✠ ) it too much, then more space will be padded using NULL packets to preserve important time points, leading to a waste of bandwidth There exists one optimal scale fac✆ tor that can balance these two strengths if it fulfills the condition that ✄✂✆☎✞✝ ✟ ✟ the filtered frame data will always be squeezed into the scaled stream; the number of NULL packets for padding purpose is minimum However, this optimal scale factor is hard to estimate in advance since for different operations with various parameters have quite varying effect on distinct video clips in terms of bit rate changing Therefore, in our current implementation, we simply use a slightly exaggerated scale factor based on the operation type and parameters For example, for low pass filtering with a threshold of 5, a scaling factor of 0.80 will work almost for all streams Even if we meet a frame that still occupies more than 0.9 number of packets after the filtering, only the next few frames may be slightly affected Since a smaller-than-average frame is expected to follow shortly, this local skew can be absorbed by the decoder easily and does not have any chain effect Our next step will be looking into how to “learn” this optimal scale factor by analyzing history bit rate change of ✆ a stream and adjust this factor on the fly It is not specified by the MPEG standard how a decoder, especially hardware decoding board, should react if the incoming stream changes from one constant bit rate to another, and it is also an open question how quickly it would adapt to the new rate EXPERIMENTAL RESULT Figure shows the effect of the time line scaling approach for a Low Passing Filter with threshold Each point on the x axis represents an occurrence of a PCR packet, and the y axis shows in three colors how many video packets, NULL packets or packets for other data stream are in between each two PCR packets We can see that the distribution of the three areas is kept almost constant for the original frame except for more NULL packets at the end of a frame However, without scaling, the number of video packets varies across different PCR intervals and a lot of extra space is padded with NULL packets as shown in the upper right subfigure On the other hand, if we scaling with a scaling factor of 80%, then the padding occurs mostly only at the end of frames and the stream contains mostly only useful Original BR (Mbps) Average resulting BR (Mbps) Average relative change ✆ Suggested LP (10) 18.0 15.45 0.86 0.90 LP (5) 18.0 13.71 0.76 0.80 PIP 18.0 19.04 1.06 1.10 Table 1: Final Statistics data One thing we need to point out here is that the skew of video access units along the time line still exists with this scaling approach What happens is that after the filtering operation, each frame shrinks to a size mostly less than 80% of its original size If we mask out all other packets, we can see that in the video stream, frames are packed closely one after another If one frame takes more space than its share, then the next frame may be pushed behind its time point, but this skew will be compensated later by another frame with a larger shrink effect As we said before, this kind of small jitter around the exact time point on the scaled time line is acceptable, and it is the change in the bit rate at which the decoder reads in the data that fundamentally makes our scaling algorithm able to solve the problem Another experiment is done on how to determine the ✆ scaling factor for a particular kind of video editing operation Three kinds of operations are tested: Low Pass Filtering with threshold being 10 and and Picture-in-Picture The original stream is a HD stream “stars1.mpg” with bit rate 18 Mbps in MPEG2 Transport Layer format The embedded frame used for Picture-In-Picture is a shrunk version of another HD stream “football1.mpg” Since the content of this stream is more condense than the background stream (i.e., more DCT coefficients are used to describe each block), it is expected that the bit rate will increase after this Picture-In-Picture operation The final statistics are shown in Table Experiment results have shown that with the suggested scaling factor based on real world statistics, our time-invariant scaling algorithm could successfully solve the synchronization problem CONCLUSION In this paper, we focus on the scenario of streaming HD video in MPEG2 Transport Layer syntax with software streaming/processing and hardware decoding, which would be commonplace until the processing power of personal computers becomes strong enough to cope with high bandwidth/definition video streams We have studied an important problem that decoders may lose synchronization and fail to reconstruct the encoder’s clock because of video editing operations on the streaming path We have proposed two solutions to this problem, both based on the idea of reusing the original time stamp packets (PCR packets) and adjusting the num- ber of packets between them to reflect the changes in bit rate caused by video editing operations Experimental results have shown that our solutions are efficient and work fine without any requirement on real-time support from the system As far as we know, our work is among the first efforts in promoting real-time software filtering of High Definition MPEG2 streams, and can be beneficial to many real-time applications that work with MPEG2 system streams like HDTV broadcast ACKNOWLEDGMENT This work was supported by the NASA grant under contract number NASA NAG 2-1406, National Science Foundation under contract number NSF CCR-9988199 and NSF CCR 0086094, NSF EIA 99-72884 EQ, and NSF EIA 98-70736 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and not necessarily reflect the views of the National Science Foundation or NASA REFERENCES [1] ProxiNet http://www.proxinet.com [2] Emerging high-speed xDSL access services: architectures, issues, insights, and implications IEEE Communications Magazine , Volume: 37 Issue: 11 , Nov 1999, Page(s): 106 -114, 1999 [3] Gigabit Ethernet Circuits and Systems, 2001 Tutorial Guide: ISCAS 2001 The IEEE International Symposium on , 2001, Page(s): 9.4.1 -9.4.16, 2001 [4] I I S 13818 Generic coding of moving pictures and associated audio information 1994 [5] Y C A Fox, S.D Gribble and E Brewer Adapting to network and client variation using active proxies: lessons and perspectives IEEE Personal Communication, Vol 5, No 4, pp 10C19, 1998 [6] A Dutta-Roy An overview of cable modem technology and market perspectives IEEE Communications Magazine , Volume: 39 Issue: , June 2001, Page(s): 81 -88, 2001 [7] C E Holborow Simulation of Phase-Locked Loop for processing jittered PCRs ISO/IEC JTC1/SC29/WG11, MPEG94/071, 1994 [8] R M J Smith and C Li Scalable multimedia delivery for pervasive computing ACM Multimedia 1999, 1999 [9] J Spragins Fast Ethernet: Dawn of a New Network [New Books and Multimedia] IEEE Network , Volume: 10 Issue: , March-April 1996, Page(s): 4, 1996 [10] B S W Y Ma and J Brassil Content Services Network: The Architecture and Protocols Proceedings of the Sixth International Workshop on Web Caching and Content Distribution, 2001 [11] B Yu and K Nahrstedt A Compressed-Domain Visual Information Embedding Algorithm for MPEG2 HDTV Streams ICME 2002, 2002 ... Definition MPEG2 streams, and can be beneficial to many real-time applications that work with MPEG2 system streams like HDTV broadcast ACKNOWLEDGMENT This work was supported by the NASA grant under... packets PLL is a feedback loop that uses an external signal(the incoming PCR packets in our case) to tune a local signal source(generated by a local oscillator in our case) to generate a relatively... in the same way as a normal player, and performs various kinds of video editing operations on this stream in real time, such as low pass filtering, frame/color dropping and visual information embedding