See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/286683112 Region-of-interest tracking method for video plus depth coding Conference Paper · September 2014 DOI: 10.1109/ELINFOCOM.2014.6914363 CITATIONS READS 12 5 authors, including: Hung Quang Bui Ha Le Vietnam National University, Hanoi Vietnam National University, Hanoi 36 PUBLICATIONS 25 CITATIONS 38 PUBLICATIONS 115 CITATIONS SEE PROFILE SEE PROFILE All content following this page was uploaded by Ha Le on 15 January 2016 The user has requested enhancement of the downloaded file All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately ICEIC 2014, Jan 15 - 18, 2014, Kota Kinabalu, Malaysia Region-of-Interest Tracking Method for Video Plus Depth Coding Nam Pham Thanh, Thanh Nguyen Xuan, Hung Luu Viet, Hung Bui Quang, Ha Le Thanh Human-Machine Interaction Laboratory University of Engineering and Technology, Vietnam National University, Hanoi {nampt.mi12, thanhnx_54, hunglv_54, hungbq, ltha}@vnu.edu.vn result that unimportant region is detected as ROI in some cases, and consequently reduce the efficiency of coding without increasing perceived quality Using a lot of information to extract ROI like depth, illumination, motion and contour, [6] and [8] give a complex algorithm to extract ROI in 3D multi-view video In videos, face is usually the most interesting region but extracting only face as ROI like in [7] is not enough, the visual attention of human can focus on more objects in a video From the above reasons, in this paper, we propose a method for tracking ROI efficiently in order to encode video plus depth There are three main steps to analyze a video: detect the ROIs, track those ROIs and predict ROIs’ movement To be more specific, ROI detection is based on flood-fill algorithm while ROI tracking uses either block matching algorithm or motion vectors from H.264 encoder The rest of our paper is organized as following: ROI detection, tracking, some related problems and solutions are presented in section In section 3, some performed experiments are depicted to demonstrate the efficiency of the proposed method Last but not least, section will conclude all the remarkable aspects of this research Abstract 3D video can bring to viewers an exhilarating experience, however, it contains more details compared to 2D video; thus, it demands higher bandwidth to transmit information to the viewers In order to overcome this hurdle, our paper aims to propose a resolution that helps to detect and track video regions interested by viewers These video regions may then be coded with higher bitrate than others to preserve the visual quality while the total video bitrate is reduced Experimental results showed the effectiveness of our method and it can be applied to interactive video coders Keywords: video plus depth, ROI, tracking, detection Introduction In video communication, visual attention has been proved that it affects the perceived content of video for viewers In addition, managing the quality of videos’ content is extremely challenging, especially when it comes to reducing the size Regions of Interest (ROI) coding is an useful approach to shrink the size of video transmitted without decreasing perceived quality for views at ROI regions To bring viewers the highest perceived quality of a video, it is necessary to detect and track ROI Video plus depth is a kind of video containing both color information and depth information which can detect and track ROI more efficiently To model the attention objects in [1], Han et al used three attributes: attention value, edge set, and homogeneity measure which makes this method fairly complicated Due to the complexity of algorithm of ROI detection, it is difficult to apply in portable devices in [1], [2] In [3], ROI is detected as a pre-process of video coding by using both luminance and chrominance information with skin color matching They consider not only human faces but also hands as ROI There are two methods for ROI detection in [4] based on depth images, with or without skin-color detection In [5], ROI extraction showed an unexpected Proposed method for ROI detection and tracking In order to encode video with ROI, it is necessary to detect the position of ROI in the first frame and track ROI in subsequent frames In this research, we build an interactive video coding model in which viewers are able to choose their ROIs in the form of touching or clicking on the video display screen Then, the touched location is recorded and signaled to video encoder Given an initial touching location, the video encoder detects and tracks ROIs basing on both color and depth information obtained from video plus depth sequences Finally, the detected ROI regions are coded at higher bitrate than the non-ROI regions in order to significantly reduce the total encoded information while the video quality in ROIs is still acceptable by viewers However, the detail of ROIs coding is beyond the scope of this work 96 ICEIC 2014, Jan 15 - 18, 2014, Kota Kinabalu, Malaysia 2.1 ROI detection Each frame of 2D plus depth video sequence consists of a monoscopic 2D color frame and a depth image frame, which contains the information about depth of objects in the video Observations show that any two adjacent points belonging to a same object share the similarity in depth values From the inputted point, called anchor point, ROI is detected by using flood-fill algorithm The basic idea of this algorithm is to visit every pixel in the frame that is neighbor to the anchor point and mark these points as part of ROI if they have similar features to anchor point or other points of ROI Two adjacent neighbor pixels are considered as in the same region if the difference between their depth values is smaller than a defined threshold Fig A depth frame of “Ballet” video The point at location (x,y) is inputted from user as a feature of ROI The pixels in transition region between dancer’s legs and the floor have same depth values hence our floodfill algorithm recognizes these two regions as ones 2.2 ROI-off problem and solution In some cases, two distinct objects share the same depth value Therefore, flood fill algorithm probably enlarges the ROI region to cover two or more objects, which results in coding efficiency degradation In flood-fill algorithm, we used a queue Q to store points before they are checked to belong ROI region or not Our solution for this problem is based on the observation of the increment of the pixels in queue Q In the correct detection case, the size of this queue has just one explosion period which its size increases sharply In ROI-off detection case, there are two explosion increment periods, and the second increment indicates that ROI extraction is expanding unexpected regions Fig The sizes of queue Q in 10 continuous frames in “Ballet” video Fig shows the case of ROI detection in “Ballet” video when flood-fill algorithm extracts the region of the dancer In this case, the floor would be mistakenly considered as the same region with the dancer in the original algorithm, resulting in a much larger ROI When ROI is extracted from the dancer in this case, the sizes of queue Q in ten continuous frames are represented in Fig It can be easily seen from this figure that in all examined frames, the size of queue Q always has two explosion periods, in which the first period shows the expansion the dancer while the second one demonstrates the extraction of the floor From the observation, it is considered that when the size of queue Q starts to explode the second time, the detection process begins to recognize the wrong region rather than the expected ROI Therefore, we apply a cut-off method inside flood-fill algorithm to observe the size of queue The flood-fill algorithm will stop to extract the ROI when we observe the second explosion of the size of queue Q, and consequently the unexpected region could not be considered as ROI The size of queue Q in the same 10 continuous frames is shown in Fig Fig The development of queue Q’s size in 10 frames in “Ballet” video when the cut-off method is applied 2.3 ROI tracking In order to improve ROI extraction efficiency, ROI is tracked after being detected After the information of ROI in the first frame is collected, it is necessary to keep track of this ROI in subsequent frames since this information 97 ICEIC 2014, Jan 15 - 18, 2014, Kota Kinabalu, Malaysia will be used for encoding ROI There are two methods utilized to follow the movement of ROI TABLE I ROI DETECTION AND TRACKING RESULT Firstly, an independent module was implemented to estimate block-based motion vectors based on blockmatching algorithm The main idea of block-matching technique is to divide each frame to so-called macroblocks and find the best match block by comparing the absolute difference of all pixels of two blocks This module worked well but consumed much time because of executing many complex calculations Another solution with low computational complexity and high performance is to reuse motion block-based motion vectors from H.264 encoder The whole ROI is divided into macroblocks For a macroblock at location ሺܽǡ ܾሻ which has motion vector of ሺ݉ǡ ݊ሻ, its location at next frame is predicted as below: ሺܽᇱ ǡ ܾԢሻ ൌ ሺܽǡ ܾሻ ሺ݉ǡ ݊ሻ ROI Moving dancer ballet Moving rectangular box Moving ball Standing man Moving man (disappears) (1) Environment Result Indoor, stable camera, resolution ͳͲʹͶ ൈ ͺ Indoor, stable camera, resolution ͶͲ ൈ ͶͺͲ Detection: 80% Tracking: good Indoor, stable camera, resolution ͶͲ ൈ ͶͺͲ Indoor, moving camera, resolution ͶͲ ൈ ͶͺͲ Indoor, stable camera, resolution ͶͲ ൈ ͶͺͲ Detection: 100% Tracking: good Detection: 100% Tracking: good Detection: 100% Tracking: good Detection: 90% Tracking: medium after computing vector ߥҧ , we took an additional step to check whether the current object is the one we are looking for by comparing the depth value of anchor-point to the recorded average depth value of ROI in the previous frame with a threshold value from the experiment Then the object detection which uses flood fill algorithm is reapplied to extract ROI and anchor-point‘s location is defined by the average location of all pixels in ROI: Fig Motion estimation of ROI ሺݔǡ ݕሻ ൌ In the Fig 4, the frame is divided into macroblocks, which are represented by green dots The man is moving hence all macroblocks in that region have their own motions When the ROI moves, in two continuous frames of a video sequence, the object features such as shape, color usually remain or change slightly Therefore, the motion vector of the whole ROI region is estimated by motions of all macroblocks in that ROI as below: σ సభሺ௫ ǡ௬ ሻ (4) in which ሺݔǡ ݕሻ is the current location of anchor point, ݊ is the total number of pixels in ROI 2.5 Occlusion problem In ROI tracking process, there is a situation that the under tracked ROI suddenly disappears As can be seen from Fig 5, ROI is the man in white region who is moving behind another man In Fig 6, ROI almost disappears and, theoretically, system can no longer track this ROI In order to improve the robustness of our approach, we predict the most likely object’s location when they reappear When an object disappears, our approach enters a special mode called Prediction Mode (PM) The basic idea of the PM rests on two assumptions: ഥ ቁ σ ഥ ೣ ǡఠ సభቀఠ ഥ௬ ൯ ൌ (2) ൫߱ ഥ௫ ǡ ߱ where ݊ is the total number of macroblocks in ROI, ഥ௬ ൯ is the motion vector of ݅ ௧ macroblock To ൫߱ ഥ௫ ǡ ߱ represent the movement of ROI after each frame, we estimate the movement of anchor point by the approximate motion vector of ROI as the following formula: ഥ௫ ǡ ߱ ഥ௬ ൯ (3) ൫ܽାଵǡ ܾାଵ ൯ ൌ ሺܽ ǡ ܾ ሻ ൫߱ a) Video object normally moves with a constant velocity In previous frames, the object’s velocity data is collected In fact, the velocity of the object is proportional to the movement of the object through each frame, thus we ഥ to represent object velocity Basing on the use the vector ࣏ velocity data, we predict the new position where it can reappear b) We only apply PM in a defined period of time From the observation, it can be seen that the probability ROI reappears is very small Hence, a threshold constant TIME_OUT = (seconds) is defined After this period, we in which ሺܽ ǡ ܾ ሻ is the location of anchor point at ݇ ௧ frame and ሺܽଵ ǡ ܾଵ ሻ is the inputted point at the first frame 2.4 Cumulative error When ROI is tracked in video sequence by motion vectors, there is a small error through each frame If these errors are not solved after each frame, they can cause cumulative error This error is repeated after a large number of frames and obviously, causes the cumulative error and hence ROI is tracked incorrectly At each frame, 98 ICEIC 2014, Jan 15 - 18, 2014, Kota Kinabalu, Malaysia consider that ROI disappear permanently and stop predicting its location if it does not reappear Acknowledgment This work was supported by the basic research projects in natural science in 2012 of the National Foundation for Science & Technology Development (Nafosted), Vietnam (102.01-2012.36, Coding and communication of multiview video plus depth for 3D Television Systems) References [1] Junwei Han, King N Ngan, Mingjing Li, and HongJiang Zhang (2006, Jan.) Unsupervised extraction of visual attention objects in color images IEEE Trans Circuits and Systems for Video Technology [Online] 16(1), pp 141–145 [2] Yang Wang, Kia-Fock Loe, Tele Tan, and Jian-Kang Wu (2005, Jul.) Spatiotemporal video segmentation based on graphical models IEEE Trans Image Processing [Online] 14(7), pp 937–947 [3] Minghui Wang, Tianruo Zhang, Chen Liu, and Satoshi Goto (2009) Region-of-interest based H.264 encoding parameter allocation for low power video communication Presented at 5th International Colloquium on Signal Processing & Its Applications (CSPA) [4] L S Karlsson, M Sjӧstrӧm (2008, May.) Regionof-interest 3D video coding based on depth images Presented at 3DTV Conference [5] D V S X De Silva, W A C Fernando, and S L P Yasakethu (2009, Aug.) Object based coding of the depth maps for 3D video coding IEEE Trans Consumer Electronics [Online] 55(3), pp 1699– 1706 [6] Yun Zhang, Mei Yu, Gang-Yi Jiang (2009, Jul.) Depth based region of interest extraction for multiview video coding Presented at 2009 International conference on machine learning and cybernetics [7] Tianruo Zhang, Chen Liu, Minghui Wang, Satoshi Goto (2009, Oct.) Region-of-interest based H.264 encoder for videophone with a hardware macroblock level face detector Presented at IEEE International workshop on multimedia signal processing (MMSP 2009), pp 1-6 [8] Yun Zhang, gangyi Jiang, Mei Yu, You Yang, Zongju Peng, Ken Chen (2010, Jul.) Object based coding of the depth maps for 3D video coding, Journal of visual communication and image representation [Online] 21(5-6), pp 498-512 Fig ROI is moving into invisible area, behind the standing man Fig ROI disappears Experimental results Several videos with different conditions (camera position, resolution) and different kinds of ROI (moving person, standing person, ball, rectangular box ) have been tested Table shows the result in details The detection accuracy is assessed by capturing the size of queue Q As mentioned above (ROI detection using depth frame), frames that have one explosion of queue size or applied successful cut-off method are considered as true detection frames The result of detection and tracking is shown in Table Conclusion A method for detecting and tracking ROI in 2D video plus depth by using depth information is briefly presented in this paper In our method, viewer can provide the prior information of the interesting video regions by clicking or touching on the display screen; flood-fill algorithm is responsible to detect the ROI exactly Afterwards, motion vectors of ROIs are extracted to predict ROI’s movement We succeed in detecting the correct ROIs and tracking them in subsequent frames which help not only increasing the perceived quality of video in ROI region but also reduce the bit rate of video coding 99 View publication stats ... Malaysia Region- of- Interest Tracking Method for Video Plus Depth Coding Nam Pham Thanh, Thanh Nguyen Xuan, Hung Luu Viet, Hung Bui Quang, Ha Le Thanh Human-Machine Interaction Laboratory University of. .. result of detection and tracking is shown in Table Conclusion A method for detecting and tracking ROI in 2D video plus depth by using depth information is briefly presented in this paper In our method, ... detection Each frame of 2D plus depth video sequence consists of a monoscopic 2D color frame and a depth image frame, which contains the information about depth of objects in the video Observations