Context based visual object segmentation 3

Chapter Background Context Augmented Hypothesis Graph for Object Segmentation In this chapter, we address the problem of semantic segmentation. Inspired by the significant role of the context information in this task, our solution makes use of semantically meaningful overlapping object hypotheses augmented by contextual information, which is obtained from a novel background mining procedure. More precisely, a fully connected conditional random field is considered over a set of overlapping segment hypotheses, and the unlabeled background regions are learned from a training set and applied in the unary terms corresponding to the foreground regions. The final segmentation result is obtained via maximum-a-posteriori inference, where the segments are merged based on a sequential aggregation followed by morphological hole filling and super-pixel refinement serving as post-processing. Moreover, by incorporating other kinds of contextual cues, like global image classification and object detection cues, new state-of-the-art performance is achieved by our proposed solution as experimentally verified on the challenging PASCAL VOC 2012 and MSRC-21 object segmentation datasets. 84 4.1 Introduction Semantic object segmentation or scene labelling is one of the central problems in computer vision, which has drawn much interest in recent years [11, 16, 65, 72, 85, 99, 105]. The goal is to assign a class label taken from a predefined set or the background for each pixel in the image, which is quite challenging in general due to large intra-class pose and appearance diversity as well as occlusions. Over the past few years, various approaches have been proposed to solve this problem. The bottom-up segment ranking approaches generate a large pool of object hypotheses. The regions are then scored and ranked based on their “objectness”, and finally they are combined to obtain the final segmentation [9, 11, 65]. However, the inter-segment relationships and the segment background information are generally not very well modelled, especially for visually confusing categories. Hence these approaches still cannot guarantee the perfect classification and ranking of the segments. The detection-based methods [64, 83, 85] utilize top-down guidance obtained from object detectors and refine the coarse object localization within the predicted bounding boxes. The main shortcoming of those methods is that the poor detection results or mis-detection will deteriorate the segmentation performance especially in the case of interacting objects. Besides, there are also some methods which consider the graphical representation of the problem, where the nodes represent pixels or super-pixels, and the graph is partitioned into several sub-graphs corresponding to di↵erent object regions. These models are very prosperous in semantic object segmentation, which mainly consider a conditional random field (CRF) [16,72,104,106]. Although these methods have a great generalization ability, the main bottleneck, as proved in [107], is the lack of rich features that can discriminate the local patches from similar categories. Since the breakthrough of deep learning in image classification [20], great progress has also been witnessed in other visual recognition tasks, like in semantic segmentation [39, 108, 109]. Recently, it is a popular approach to use convolutional neural network (CNN) trained from raw pixels in order to extract feature vectors which have a great representation power. 85 In this chapter, we propose a CRF model based on a fully connected segment hypothesis graph in order to incorporate the interaction between the segments. It is motivated by the observation that when the hypotheses are classified independently without considering the inter-segment relationship as well as other high-level cues, it is hard to distinguish some confusing classes [11]. On the other hand, using the semantically more meaningful segment hypotheses as the nodes for the CRF model will in turn alleviate its low discriminating power among local patches. Furthermore, the unlabeled regions (i.e. background) are often discarded, but they may contain a large portion of the pixels. For example, in the PASCAL VOC 2012 TrainVal segmentation dataset [1], 69.3% of the pixels belong to the background. Li et al. [77] showed that the background regions actually contain useful contextual information for accurate recognition. For instance, plane and bird are more likely to occur in the presence of the sky background (see Fig. 4.1). Intuitively, the contextual information obtained by learning the relationship between the foreground objects and their background regions of interest can augment the CRF model. The main contributions of this work are summarized as follows: • We propose a CRF-based solution over a hypothesis graph that utilizes the relationships between the overlapping object-level segments which are more semantically informative than disjoint local regions, like pixels and superpixels (see Section 4.3). A fully connected graph is also employed to enhance the interaction between the segments. • Obviously, with more annotated data, the learned model will be more accurate. Nevertheless, in many situations, there does not exist a large set of training data. Therefore, we also propose a novel background-aware approach to help re-score the unary term for segment hypotheses by extracting contextual cues from the background regions without explicit labelling of the background categories (see Section 4.3.2), which alleviates the annotation burden to the cluttered background categories. Moreover, due to the great generalization ability of the proposed CRF model, various contextual cues, like global im86 Figure 4.1: Illustration of the role of background context information (e.g. sky or indoor). In many cases it can help recognize the objects (e.g. the bird instead of boat or the potted plant instead of tree). age classification and object detection cues can easily be integrated into our method to further boost the performance. • We conduct a comprehensive analysis to verify the roles of di↵erent contextual cues and the improvement provided by the proposed background context (see Section 4.5.1). And we demonstrate the superiority of the proposed method over the state-of-the-arts in benchmark datasets like PASCAL VOC [1] and MSRC-21 [4] (see Section 4.5.2). 4.2 Related Work Bottom-up segment ranking methods Carreira et al. [11] proposed a method where figure-ground hypotheses are generated by solving the constrained parametric min-cut (CPMC) [56] problem with various choices of a parameter. The hypotheses are then ranked and classified using Support Vector Regression (SVR) [11]. In [59] a generative model is introduced, which maximizes the composite likelihood of the underlying statistical model by applying the expectation-maximization algorithm. 87 Figure 4.2: Overview of the proposed solution. First, a pool of object hypotheses are generated. A fully connected hypothesis graph is then built to model the relationship between the possible overlapping segments. A novel background contextual cue is predicted for the segments via sub-category classifiers. The scores are fed into the CRF model together with other cues like image classification and object detection. Finally, the coarse segmentations obtained via MAP inference are merged and postprocessed to achieve the final segmentation result. Analogous to average and max-pooling, second-order pooling (O2 P) is applied in [9] to encode the second-order statistics of local descriptors inside a region. The segments generated by CPMC [56] generally have a very high overlap ratio with the ground-truth1 , and the O2 P is proved to have a significant discriminative ability [9], achieving the state-of-the-art performance. Compared with our proposed framework, however, those methods mainly focus on the foreground hypothesis segments and ignore the background regions that might be informative, furthermore, no intersegment relation is modelled. CRF-based methods Ladick´ y et al. [72] introduced a hierarchical CRF model to incorporate the information from di↵erent scales, like object detectors in [12] and object occurrence in [104]. Boix et al. [16] also incorporated the global classification as an extra higher order potential, called harmony potential, in the CRF formulation. Yadollahpour et al. [107] introduced a two-stage approach by discriminatively The maximum overlap with the ground-truth is 81.2% in average on the PASCAL VOC 2012 TrainVal dataset [1]. 88 re-ranking the M -best diverse segmentations obtained by the CRF model. Generally, those approaches utilize di↵erent contextual cues to help classify the local patches, which are intrinsically not as discriminative as the object-level hypotheses used in our framework. If the local classification is not accurate enough, many mislabelling cannot be recovered even with carefully designed optimization algorithms. Ion et al. [99, 110, 111] also considered a CRF model over the set of possibly overlapping figure-ground hypotheses. Given a bag of figure-ground segmentations, a joint probability distribution is provided over the compatible image interpretation as well as the labellings of the composite tilings, which are cast as sets of maximal cliques. Some contextual information like the pairwise compatibilities among the spatially neighboring segments are also modeled. However, in contrast to our method, in [99, 110] only the valid compositions of hypotheses are used (i.e. the maximal clique tilings have no spatial overlap), but we consider all the generated segments in a fully connected CRF model. Furthermore, there is no contextual modeling from the unlabeled background regions as well as the global classification and detection cues as compared in the proposed model. CNN-based methods Farabet et al. [108] proposed a method that uses a multiscale CNN to extract dense feature vectors that captures texture, shape and contextual information. By making use of such representation, multiple post-processing methods are applied (e.g. CRF model) to produce the final labelling from a pool of segmentation components. In [109] a recurrent CNN is proposed for scene labelling allowing for a larger input context while limiting the capacity of the model. Trained in an end-to-end manner over raw pixels that is yet not dependent on any segmentation technique or any task-specific features, the system could identifies and corrects its own errors, leading to the state-of-the-art performance in several scene labelling benchmarks. Girshick et al. [39] applied a CNN framework pre-trained on the ImageNet classification dataset [31] to extract features on the object segment proposals, and used linear SVM to classify the segment proposals. Although the CNN-based features have a great representational power compared to hand-crafted 89 features, those models usually have millions of parameters to learn thus they require tremendous quantity of annotated training data, which is quite difficult to obtain in some cases. Non-parametric methods Liu et al. [112] proposed a non-parametric label transfer model for scene labelling, to transfer and warp the annotations in the training set to a test image by matching dense SIFT flow between the training and test samples. Tighe and Lazebnik [113] presented another non-parametric approach by matching a test image against the training set, followed by super-pixel level matching and Markov random field (MRF) optimization to incorporate the neighborhood context. Myeong and Lee [105] applied higher-order semantic contextual relationships between the objects in a non-parametric manner. In [114] the relevance of individual feature channels is learned by using a locally adaptive feature metric based on small patches and simple gradient, color and location features. Contextual modeling Numerous contextual cues, like the global scene layout [74] and the interaction between objects and regions [77–80], were successfully integrated in object recognition frameworks. In [74], a holistic CRF model is presented, which integrates di↵erent levels of contextual cues like scene labelling and detection. Li et al. [77] extracts contextual cues from the unlabeled regions in order to boost the traditional object detection. Heitz and Koller [78] proposed a probabilistic “things and stu↵” model to consider the contextual relationship between the regions and detected objects to boost the detection performance. The method proposed by Cinbis and Sclaro↵ [80] makes use of relative locations and scores between pairs of detections. In [115], stacked sequential scale-space Taylor coefficients are proposed to gather contextual information by sampling the posterior label field sequentially, which achieved the state-of-the-art performance in MSRC-21 benchmark [4]. In [79], the context information is obtained in a supervised manner. However, the background annotations are generally very difficult and time-consuming to obtain in practice due to the huge clutterness. Although, employing the background infor- 90 mation to help classify the foreground objects is not a new idea, the novelty of our approach mainly lies in how the background context (BC) information is obtained: in contrast to the previous methods, it is extracted from the background regions without knowing the exact labelling of the background categories. 4.3 Proposed Solution Given a test image I : ⌦ ! R3 , a set of object hypotheses {Si ✓ ⌦}m i=1 is extracted by applying the method proposed in [11,56], which provides visually coherent segments. In this work m is set to 150 as in [11]. We aim to assign a class label li L for each segment Si (1  i  m) via CRF-based formulation (see Section 4.3.1), where L is a finite predefined label set. The contextual information of the background regions for each class, called background context (see Section 4.3.2), as well as other kinds of cues is also extracted and applied to augment the unary term. After calculating the optimal labelling for the segments, they are projected back to I and are merged into the final segmentation result followed by some simple post-processing techniques (see Section 4.3.3). The proposed pipeline is shown in Fig. 4.2. 4.3.1 CRF-based Formulation Here, the graphical representation of the labelling problem is briefly introduced. We consider a complete graph G = {V, E} where the set of nodes consists of the segment hypotheses. For each segment node Si V a random variable xi is assigned, which takes a label from L. The CRF model has the energy function defined over all possible labellings x = (x1 , x2 , . . . , xm ) Lm in the following form [16] E(x) = ↵ X 'u (xu ) + Su 2V where ↵ and X uv (xu , xv ) , (4.1) (Su ,Sv )2E are global weighting parameters. Note that the variable x follows a Gibbs distribution, i.e. p(x) = Z1 exp( E(x)), where the partition function Z = P x2Lm exp( E(x)). The first term 'u (xu ) is called the unary term that expresses 91 the local confidence of the label xu L for the segment Su . uv (xu , xv ) is the pairwise term expressing the compatibility of the labels xu and xv for adjacent nodes. The goal is to find the optimal labelling: x⇤ = arg minm E(x). x2L (4.2) Unary term In this term we incorporate di↵erent kinds of cues. For this sake, for each segment Si (1  i  m), we extract n di↵erent kinds of feature vectors, denoted by f ij (1  j  n). These descriptors are classified by making use of |L| (l) binary linear classifiers2 , like SVR [11], providing the scores sij for Si to have the class label l L based on f ij . The scores are put in a sigmoid function and the negative log-likelihood is applied [16]: ⌘ Y ⇣ (x ) (x ) (x ) log p xu |suju ; wuju , buju |Su | n 'u (xu ) = = log |Su | j=1 n Y j=1 1 + exp (x ) (x ) (x ) wuju suju +buju . (4.3) |Su | denotes the number of pixels inside the given segment. There are two parameters (x ) (x ) wuju and buju for each sigmoid function which are learned simultaneously on the validation set (for more details please refer to Section 4.4). Pairwise term For uv (xu , xv ) we apply the Potts model that has the form [16, 106]: uv (xu , xv ) = [xu 6= xv ] n ˜ X j j (f uj , f vj ) , (4.4) j=1 where [xu 6= xv ] is the indicator function taking the value of if xu 6= xv and otherwise. j is the weight for the Gaussian kernel j for all j = 1, . . . , n ˜ , where n ˜ is the number of the involved kernels defined as: j (f uj , f vj ) = exp ⇣ (f uj f vj )T ⌃j (f uj f vj ) ⌘ , We remark that one can use any classifier, like multi-class classifiers, to obtain the scores for the segments belonging to a certain class label. 92 Figure 4.3: Exemplar sub-category clusters for the horse category from the PASCAL VOC 2012 TrainVal dataset [1]. Each row shows some images with a certain subcategory. It is observed that each cluster shares significant consistency among both the foreground horse objects and the background regions. where ⌃j stands for a positive-definite matrix. By applying Gaussian kernels in the pairwise term a very efficient inference can be performed as shown in [106]. As it is noted, the function [xu 6= xv ] introduces a penalty for nearby similar segments that are assigned di↵erent labels, but it is insensitive to compatibility between the labels. Instead, one can learn a symmetric compatibility function µ(xu , xv ) that also considers the interactions between labels. 4.3.2 Background Context Modeling In this section, we introduce how we model and obtain the contextual information from unlabeled background regions in a weakly-supervised manner, which is used in the unary term in (4.3). Assume that we are given a set of training images (j) {I˜j : ⌦ ! R3 }rj=1 with annotated ground truth regions {Ri m j ✓ ⌦}i=1 , where mj is the number of the objects on I˜j . For ease of the notation we will only consider a single indexing by i for all the ground-truth regions as m ˜ [ i=1 {Ri } := mj n r [ [ j=1 i=1 (j) Ri o , where and the class label for Ri is denoted by yi L. 93 m ˜ = r X j=1 mj , (4.5) [16] Xavier Boix, Josep Gonfaus, Joost van de Weijer, Andrew Bagdanov, Joan Gual, and Jordi Gonz` alez. Harmony potentials - Fusing global and local scale for semantic image segmentation. International Journal of Computer Vision, 96(1):83–102, January 2012. [17] Zheng Song, Qiang Chen, ZhongYang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classification. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 1585– 1592, Colorado Springs, CO, USA, June 2011. IEEE. [18] Qiang Chen, Zheng Song, Yang Hua, ZhongYang Huang, and Shuicheng Yan. Hierarchical matching with side information for image classification. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3426–3433, Providence, RI, USA, June 2012. IEEE. [19] Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, and Shuicheng Yan. Subcategory-aware object classification. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 827–834, Portland, OR, USA, June 2013. IEEE. [20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, NV, US, December 2012. MIT Press. [21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring midlevel image representations using convolutional neural networks. In Proceedings of International Conference of Computer Vision and Pattern Recognition, Columbus, OH, USA, June 2014. IEEE. [22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfet: Intgrated recognition, localization and detection using convolutional networks. In arXiv, 2013. 118 [23] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features o↵-theshelf: an astounding baseline for recognition. In arXiv, 2014. [24] N. Nasrabadi and R. King. Image coding using vector quantization: A review. IEEE Trans. Communications, 36(8):957–971, 1988. [25] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas Huang, and Shuicheng Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):1031–1044, 2010. [26] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3360–3367, San Francisco, CA, USA, June 2010. IEEE. [27] Florent Perronnin, Jorge Snchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In Proceedings of European Conference of Computer Vision, volume 6314 of LNCS, pages 143–156, Crete, Greece, October 2010. Springer. [28] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 1794–1801, Miami, FL, USA, June 2009. IEEE. [29] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [30] C. Chang and C. Lin. Libsvm: a libray for support vector machines. ACM Trans. on Intelligent Systems and Technology, 2(3):27, 2011. [31] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 248–255, Miami, FL, USA, June 2009. IEEE. 119 [32] Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. Cnn: Single-label to multi-label. arXiv preprint arXiv:1406.5726, June 2014. [33] Pedro Felzenszwalb, Ross Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transaction on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, September 2010. [34] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascaded object detection with deformable part models. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 2241–2248, San Francisco, CA, June 2010. IEEE. [35] Long Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. Latent hierarchical structural learning for object detection. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 1062–1069, San Francisco, CA, USA, June 2010. IEEE. [36] Yuanhao Chen, Long Zhu, and Alan Yuille. Active mask hierarchies for object detection. In Proceedings of European Conference of Computer Vision, pages 43–56, Crete, Greece, September 2010. Springer. [37] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In Proceedings of Advances in Neural Information Processing Systems, 2013. [38] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable object detection using deep neural networks. In Proceedings of International Conference of Computer Vision and Pattern Recognition, 2014. [39] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of Inter120 national Conference of Computer Vision and Pattern Recognition, Columbus, OH, USA, June 2014. IEEE. [40] J. Uijlings, K. Van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154– 171, 2013. [41] J. Muerle and D. Allen. Experimental evaluation of techniques for automatic segmention of objects in a complex scene. Pictorial Pattern Recognition, pages 3–13, 1968. [42] D. Comaniciu and P. Meer. A robust approach toward feature space analysis. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(5):603– 619, 2002. [43] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000. [44] B. Russell, W. Freeman, A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In Proceedings of International Conference of Computer Vision, volume 2, pages 1605–1614. IEEE, June 2006. [45] T. Malisiewicz and A. Efros. Improving spatial support for objects via multiple segmentations. In British Machine Vision Conference, September 2007. [46] A. Rabinovich, S. Belongie, T. Lange, and J. M. Buhmann. Model order selection and cue combination for image segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, volume 1, pages 1130–1137. IEEE, June 2006. [47] Eitan Sharon, Achi Brandt, and Ronen Basri. Segmentation and boundary detection using multiscale intensity measurements. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 469– 476, Kauai, HI, USA, December 2001. IEEE. 121 [48] E. Sharon, M. Gulun, R. Basri, and A. Brandt. Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104):719–846, June 2006. [49] Pablo Arbel´ aez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011. [50] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natual images. In Proceedings of International Conference of Computer Vision and Pattern Recognition, volume 0, pages 1–8. IEEE, June 1008. [51] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect a salient object. In Proceedings of International Conference of Computer Vision and Pattern Recognition, 2007. [52] T Nguyen, Bingbing Ni, Hairong Liu, Wei Xia, Jiebo Luo, Mohan Kankanhalli, and Shuicheng Yan. Image re-attentionizing. IEEE Transactions on Multimedia, 15(8):1910–1919, December 2013. [53] X. Ren and J. Malik. Learning a classification model for segmentation. In Proceedings of International Conference of Computer Vision, volume 1. IEEE, October 2003. [54] I. Endres and A. Hoiem. Category independent object proposals. In Proceedings of European Conference of Computer Vision, 2010. [55] A. Levenshtein, C. Sminchisescu, and S. Dickinson. Optimal contour closure by superpixel grouping. In Proceedings of European Conference of Computer Vision, 2010. [56] Joao Carreira and Cristian Sminchisescu. CPMC: Automatic object segmentation using constrained parametric min-cuts. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(7):1312–1328, July 2012. 122 [57] Jamie Shotton, John M. Winn, Carsten Rother, and Antonio Criminisi. TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors, ECCV, pages 1–15, Graz, Austria, May 2006. Springer. [58] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. In Proceedings of International Conference of Computer Vision, volume 1, pages 377–384, Kerkyra, Greece, September 1999. IEEE. [59] Fuxin Li, Joao Carreira, Guy Lebanon, and Cristian Sminchisescu. Composite statistical inference for semantic segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3302–3309, Portland, OR, USA, June 2013. IEEE. [60] Daniel Kuttel and Vittorio Ferrari. Figure-ground segmentation by transferring window masks. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 558–565, Providence, RI, USA, June 2012. IEEE. [61] Jaechul Kim and Kristen Grauman. Shape sharing for object segmentation. In Proceedings of European Conference of Computer Vision, pages 444–458, Firenze, Italy, October 2012. Springer. [62] Yi Yang, Sam Hallman, Deva Ramanan, and Charless Fowlkes. Layered object detection for multi-class segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3113–3120, San Francisco, CA, USA, June 2010. IEEE. [63] Nhat Vu and B. Manjunath. Shape prior segmentation of multiple objects with graph cuts. In Proceedings of International Conference of Computer Vision and Pattern Recognition, Anchorage, AK, USA, June 2008. IEEE. 123 [64] Thomas Brox, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Object segmentation by alignment of poselet activations to image contours. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 2225–2232, Colorado Springs, USA, June 2011. IEEE. [65] Pablo Arbel´ aez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik. Semantic segmentation using regions and parts. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3378–3385, Providence, RI, USA, June 2012. IEEE. [66] Gabriela Csurka and Florent Perronnin. An efficient approach to semantic segmentation. International Journal of Computer Vision, 95(2):198–212, 2011. [67] Wenbin Zou, Kidiyo Kpalma, and Joseph Ronsin. Semantic segmentation via sparse coding over hierarchical regions. In Proceedings of International Conference of Image Processing, pages 2577–2580, Orlando, FL, USA, September 2012. IEEE. [68] Patrick Etyngier, Renaud Keriven, and Jean Pons. Towards segmentation based on a shape prior manifold. In Scale Space and Variational Methods in Computer Vision, pages 895–906, Ischia, Italy, May 2007. Springer. [69] Yogesh Rathi, Samuel Dambreville, and Allen Tannenbaum. Comparative analysis of kernel methods for statistical shape learning. In Computer Vision Approaches to Medical Image Analysis, pages 96–107, Graz, Austria, May 2006. Springer. [70] Christian Walder and Bernhard Schölkopf. Di↵eomorphic dimensionality reduction. In Proceedings of Advances in Neural Information Processing Systems, pages 1713–1720, Vancouver, British Columbia, Canada, December 2009. Curran Associates, Inc. [71] James Malcolm, Yogesh Rathi, and Allen Tannenbaum. Graph cut segmentation with nonlinear shape priors. In Proceedings of International Conference 124 of Image Processing, pages 365–368, San Antonio, TX, USA, September 2007. IEEE. [72] Lubor Ladick´ y, Christopher Russell, Pushmeet Kohli, and Philip Torr. Associative hierarchical CRFs for object class image segmentation. In Proceedings of International Conference of Computer Vision, pages 739–746, Kyoto, Japan, September 2009. IEEE. [73] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object localization and image classification. In Proceedings of International Conference of Computer Vision, 2009. [74] Jian Yao, Sanja Fidler, and Raquel Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 702–709, Providence, RI, USA, June 2012. IEEE. [75] Qiang Chen, Zheng Song, Jian Dong, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classification. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2014. [76] Jian Dong, Qiang Chen, Shuicheng Yan, and Alan Yuille. Towards unified object detection and semantic segmentation. In Proceedings of European Conference of Computer Vision, pages 299–314. Springer, September 2014. [77] Congcong Li, Devi Parikh, and Tsuhan Chen. Extracting adaptive contextual cues from unlabeled regions. In Proceedings of International Conference of Computer Vision, pages 511–518, Barcelona, Spain, Nov 2011. IEEE. [78] Geremy Heitz and Daphne Koller. Learning spatial context: Using stu↵ to find things. In Proceedings of European Conference of Computer Vision, volume 5302 of LNCS, pages 30–43, Marseille, France, October 2008. Springer. [79] Peter Kontschieder, Pushmeet Kohli, Jamie Shotton, and Antonio Criminisi. GeoF: Geodesic forests for learning coupled predictors. In Proceedings of In125 ternational Conference of Computer Vision and Pattern Recognition, pages 65–72, Portland, OR, USA, June 2013. IEEE. [80] Ramazan Cinbis and Stan Sclaro↵. Contextual object detection using setbased classification. In Proceedings of European Conference of Computer Vision, volume 7577 of LNCS, pages 43–57, Florence, Italy,, October 2012. Springer. [81] Zhuowen Tu and Xiang Bai. Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 32(10):1744–1757, October 2010. [82] S. Rota Bul, P. Kontschieder, M. Pelillo, and H. Bischof. Structured local predictors for image labelling. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3530–3537, 2012. [83] Wei Xia, Zheng Song, Jiashi Feng, Loong Fah Cheong, and Shuicheng Yan. Segmentation over detection by coupled global and local sparse representations. In Proceedings of European Conference of Computer Vision, volume 7576 of LNCS, pages 662–675, Firenze, Italy, October 2012. Springer. [84] Wei Xia, Csaba Domokos, Loong-Fah Cheong, and Shuicheng Yan. Segmentation over detection via optimal sparse reconstructions. IEEE Trasactions on Circuits and Systems for Video Technology, 2014. [85] Wei Xia, Csaba Domokos, Jian Dong, Loong-Fah Cheong, and Shuicheng Yan. Semantic segmentation without annotating segments. In Proceedings of International Conference of Computer Vision, pages 2176–2183, Sydney, Australia, December 2013. IEEE. [86] Wei Xia, Csaba Domokos, Loong Fah Cheong, and Shuicheng Yan. Background context augmented hypothesis graph for object segmentation. IEEE Transactions on Circuits and Systems for Video Technology, September 2014. 126 [87] Long Zhu, Yuanhao Chen, and Alan L. Yuille. Learning a hierarchical deformable template for rapid deformable object parsing. IEEE Transaction on Pattern Analysis and Machine Intelligence, 32(6):1029–1043, 2010. [88] John Wright, Allen Yang, Arvind Ganesh, Shankar Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2009. [89] Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 2790–2797, Miami, FL, USA, June 2009. IEEE. [90] Xiaotong Yuan and Shuicheng Yan. Visual classification with multi-task joint sparse representation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3493–3500, San Francisco, CA, USA, June 2010. IEEE. [91] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. ”GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3):309–314, March 2004. [92] Pawan Kumar, Philip Torr, and Andrew Zisserman. OBJ CUT. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 18–25, San Diego, CA, USA, June 2005. IEEE. [93] Yisong Chen, Antoni Chan, and Guoping Wang. Adaptive figure-ground classification. In CVPR, pages 654–661, Providence, RI, USA, June 2012. IEEE. [94] M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586–598, 2007. [95] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267–288, 1996. 127 [96] Jun Liu, Shuiwang Ji, and Jieping Ye. Slep: Sparse learning with efficient projections, 2009. [97] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to Journal of Optimization, May 2008. [98] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programing, 103(1):127–152, 2005. [99] Adrian Ion, Jo˜ ao Carreira, and Cristian Sminchisescu. Probabilistic joint image segmentation and labeling. In Proceedings of Advances in Neural Information Processing Systems, pages 1827–1835, Granada, Spain, December 2011. MIT Press. [100] Eran Borenstein and Jitendra Malik. Shape guided object segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 969–976, New York, NY, USA, June 2006. IEEE. [101] Timothée Cour and Jianbo Shi. Recognizing objects by piecing together the segmentation puzzle. In Proceedings of International Conference of Computer Vision and Pattern Recognition, Minneapolis,MN, USA, June 2007. IEEE. [102] Anat Levin and Yair Weiss. Learning to combine bottom-up and top-down segmentation. International Journal of Computer Vision, 81(1):105–118, 2009. [103] John Winn and Nebojsa Jojic. Locus: Learning object classes with unsupervised segmentation. In Proceedings of International Conference of Computer Vision, pages 756–763, Beijing, China, October 2005. IEEE. [104] Lubor Ladick´ y, Christopher Russell, Pushmeet Kohli, and Philip Torr. Graph cut based inference with co-occurrence statistics. In Proceedings of European Conference of Computer Vision, volume 6315 of LNCS, pages 239–253, Crete, Greece, September 2010. Springer. 128 [105] Heesoo Myeong and Kyoung Lee. Tensor-based high-order semantic relation transfer for semantic scene segmentation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3073–3080, Portland, OR, USA, June 2013. IEEE. [106] Philipp Kr¨ ahenb¨ uhl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of Advances in Neural Information Processing Systems, pages 109–117, Granada, Spain, Dec 2011. MIT Press. [107] Payman Yadollahpour, Dhruv Batra, and Gregory Shakhnarovich. Discriminative re-ranking of diverse segmentations. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 1923–1930, Portland, OR, USA, June 2013. IEEE. [108] Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(8):1915–1929, August 2013. [109] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. Journal of Machine Learning Research, 1(32):82–90, December 2014. [110] Adrian Ion, Joao Carreira, and Cristian Sminchisescu. Image segmentation by figure-ground composition into maximal cliques. In Proceedings of International Conference of Computer Vision, pages 2110–2117. IEEE, June 2011. [111] Adrian Ion, Joao Carreira, and Cristian Sminchisescu. Probabilistic joint image segmentation and labeling by figure-ground composition. International Journal of Computer Vision, 107(1):40–57, March 2014. [112] Jenny Yuen Ce Liu and Antonio Torralba. Nonparametric scene parsing via label transfer. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33(12):2368–2382, 2011. 129 [113] J. Tighe and S. Lazebnik. Superparsing - scalable nonparametric image parsing with superpixels. International Journal of Computer Vision, 101(2):329–349, January 2013. [114] G. Singh and J. Kosecka. Nonparametric scene parsing with adaptive feature relevance and semantic context. In Proceedings of International Conference of Computer Vision and Pattern Recognition, pages 3151–3157, Portland, OR, USA, June 2013. IEEE. [115] Carlo Gatta and F Ciompi. Stacked sequential scale-space Taylor context. IEEE Transaction on Pattern Analysis and Machine Intelligence, 36(8):1694 – 1700, August 2014. [116] Santosh Divvala, Alexei Efros, and Martial Hebert. How important are ”deformable parts” in the deformable parts model? In Proceedings of European Conference of Computer Vision, volume 7585 of LNCS, pages 31–40, Florence, Italy, October 2012. Springer. [117] Julia Bergbauer, Claudia Nieuwenhuis, Mohamed Souiai, and Daniel Cremers. Proximity priors for variational semantic segmentation and recognition. In Proceedings of International Conference of Computer Vision, pages 15–21, Sydney, Australia, December 2013. IEEE. [118] Junshi Huang, Wei Xia, and Shuicheng Yan. Deep search with attribute-aware deep network. In ACM Multimedia, Orlando, FL, USA, November 2014. [119] Jian Dong, Qiang Chen, Wei Xia, and Shuicheng Yan. A deformable mixture parsing model with parselets. In Proceedings of International Conference of Computer Vision, pages 3408 – 3415, Sydney, Australia, December 2013. IEEE. [120] Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. In Proceedings of International Conference of Computer Vision and Pattern Recognition, 2014. 130 List of Publications 1. Wei Xia, Zheng Song, Jiashi Feng, Shuicheng Yan, Loong Fah Cheong, Segmentation over Detection by Coupled Global and Local Sparse Representations, European Conference of Computer Vision(ECCV) 2012. 2. Wei Xia,Csaba Domokos, Jian Dong, Loong Fah Cheong, Shuicheng Yan, Semantic Segmentation without Annotating Segments, In International Conference of Computer Vision (ICCV) 2013. 3. Jian Dong, Qiang Chen, Wei Xia, Shuicheng Yan, A deformable mixture part model with parselets. In International Conference of Computer Vision (ICCV) 2013. 4. Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classification. In International Conference of Computer Vision and Pattern Recognition (CVPR) 2013. 5. Tam Nguyen, Bingbing Ni, Hairong Liu,Wei Xia, Jiebo Luo, Mohan Kankanhalli, and Shuicheng Yan. Image Re-Attentionizing. IEEE Transactions on Multimedia (TMM), 2013. 6. Junshi Huang? , Wei Xia? , Shuicheng Yan. Deep Search: Attribute-aware Neural Network for Clothes Retrieval. ACM Multimedia (ACM MM) Demo, 2014. (? equal contribution) 7. Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan. Segmentation over Detection via Optimal Sparse Reconstructions. In IEEE Transaction on 131 on Circuits and Systems for Video Technology (TCSVT) 2014. 8. Wei Xia, Csaba Domokos, Loong Fah Cheong, Shuicheng Yan. Background Context Augmented Hypothesis Graph for Object Segmentation. In IEEE Transaction on on Circuits and Systems for Video Technology (TCSVT) 2014. 9. Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Yao Zhao, Shuicheng Yan. CNN: Single Label to Multi-Label. In IEEE Trans. of Pattern Anal. and Mach. Intell. (TPAMI) 2014. (In review) 132 List of Challenge Awards 1. Wei Xia, Csaba Domokos, Jian Dong, Loong Fah Cheong, Shuicheng Yan, Zhongyang Huang, Shengmei Shen. DM2:Detection, Mask transfer, MRF pruning. PASCAL VOC Challenge, Workshop of ECCV 2012. (Winner of the segmentation competition) 2. Jian Dong, Qiang Chen, Zheng Song, Yan Pan, Wei Xia, Shuicheng Yan, Yang Hua, Zhongyang Huang, Shengmei Shen. Sub-class-aware Object Classification. PASCAL VOC Challenge, Workshop of ECCV 2012. (Winner of the classification competition) 3. Wei Xia, Zheng Song, Qiang Chen, Shuicheng Yan, Loong Fah Cheong. Object Segmentation Using CRF with Detection Mask. PASCAL VOC Challenge, Workshop of ICCV 2011 (Ranking the 3rd in the segmentation competition) 4. Min LIN, Qiang Chen, Jian Dong, Junshi Huang, Wei Xia, Shuicheng Yan. Workshop of ImageNet Large Scale Visual Recognition Challenge (ILSVRC), ICCV 2013. (Runner-up Winner in the classification competition) 5. Jian Dong, Yunchao Wei, Min Lin, Qiang Chen, Wei Xia, Shuicheng Yan. Workshop of ImageNet Large Scale Visual Recognition Challenge (LSVRC), ECCV 2014. (Winner Prize in the detection competition with provided data) 133 [...]... (54 .3) (58.6) (55.1) (14.5) (49.0) (30 .9) (46.1) (52.6) (58.2) ( 53. 4) (32 .0) (44.5) (34 .6) (45 .3) ( 43. 1) (46.8) O2P-CPMCFGT-SEGM 85.1 (85.2) 65.4 ( 63. 4) 29 .3 (27 .3) 51 .3 (56.1) 33 .4 (37 .7) 44.2 (47.2) 59.8 (57.9) 60 .3 (59 .3) 52.5 (55.0) 13. 6 (11.5) 53. 6 (50.8) 32 .6 (30 .5) 40 .3 (45.0) 57.6 (58.4) 57 .3 (57.4) 49.0 (48.6) 33 .5 (34 .6) 53. 5 ( 53. 3) 29.2 (32 .4) 47.6 (47.6) 37 .6 (39 .2) 47.0 (47.5) Proposed method... 52.2 53. 5 50.1 (51.5) person 51.9 plant 35 .7 38 .2 36 .6 33 .7 (34 .6) 55 .3 49.1 50.9 43. 7 (44.1) sheep sofa 40.8 35 .5 30 .1 29.5 (29.9) 54.2 53. 7 50.2 47.5 (50.5) train tv 47.8 53. 5 46.8 44.7 (44.5) avg 47 .3 48.0 48.1 44.8 (46.7) O2P-CPMC-CSI 85.0 59 .3 27.9 43. 9 39 .8 41.4 52.2 61.5 56.4 13. 6 44.5 26.1 42.8 51.7 57.9 51 .3 29.8 45.7 28.8 49.9 43. 3 45.4 (85.0) ( 63. 6) (26.8) (45.6) (41.7) (47.1) (54 .3) (58.6)... 62.7 60.0 ( 63. 9) bike 31 .0 29.5 25.6 27 .3 ( 23. 8) 39 .8 50.6 46.9 46.4 (44.6) bird boat 44.5 35 .6 43. 0 40.0 (40 .3) 58.9 59.8 54.8 41.7 (45.5) bottle bus 60.8 64.4 58.4 57.6 (59.6) 52.5 55.5 58.6 59.0 (58.7) car cat 49.0 54.7 55.6 50.4 (57.1) 22.6 22.0 14.6 10.0 (11.7) chair cow 38 .1 38 .7 47.5 41.6 (45.9) 27.5 24 .3 31.2 22 .3 (34 .9) table dog 47.4 48 .3 44.7 43. 0 ( 43. 0) 52.4 55.6 51.0 51.7 (54.9) horse m/bike... Proposed method 85.5 (85.7) 68.1 (68.5) 29.5 (29.6) 46.6 (46.9) 44.6 (45.1) 45.8 (47.2) 65.4 (66.1) 65.5 (65.9) 58.7 (59.2) 14.0 (14.7) 45.7 (46 .3) 23. 3 ( 23. 7) 45 .3 (46.2) 45.6 (45.9) 55.9 (58 .3) 51.2 (51.5) 37 .2 (37 .4) 52.1 (52 .3) 31 .5 (31 .9) 60.7 (60.6) 49 .3 (47.5) 48.6 (49.0) all the three di↵erent cues, referred to as the Full model The detailed results are presented in Fig 4.5 and Fig 4.6 qualitatively... 69 8 ) (4 79 tv 58 ( ) n ai 29 tr 7 (2 05) fa 5 so (4 ) p ee 41 sh ) 33 t ( 75 an 48 ( pl ) on 24 rs 2 (5 pe e 7) ik 3 /b m (38 e rs 2) ho 2.6 (4 5) g 2 do 19 ( ) e bl 11 ta 9 (3 9) w 8 co 11 r( ai 4) ch 6 .3 5 t( 1) ca 0 65 ) r( 3 ca 4.7 ) (6 41 s bu (41 ) le tt 06 bo 42 ( 2) at bo 3. 7 (4 ) rd 85 bi 3 ) (2 5 8 ke bi (67 e 1) an pl 3. 7 (8 g b/ Figure 4.5: The improvement of the IoU accuracy on the PASCAL... 86 83 82 88 88 89 90 87 90 84 88 94 94 96 95 98 96 77 76 27 48 53 85 96 97 37 61 71 49 78 83 80 48 55 65 80 68 20 22 17 Boix Gatta Bergbauer et al [16] et al [117] et al [115] 81 87 91 83 87 90 81 87 96 78 91 88 86 83 76 94 94 94 96 84 90 87 62 76 48 44 57 90 93 84 81 67 69 82 83 89 75 57 60 70 74 84 52 26 44 PropPropFullySup WeaklySup 89 88 87 86 88 88 88 87 89 87 92 93 98 98 86 85 56 56 92 91 83 82... method, we conduct experiments on the latest PASCAL VOC 2012 object segmentation dataset [1] which consists of 20 object classes Due to the large intra-class variability and object interaction, this dataset is among the most challenging ones in the semantic segmentation field The average image size is 4 73 ⇥ 38 2 pixels and on average 2 .38 objects are contained per image For quantitative evaluation, the... 2012 Segmentation Challenge is a detection -based approach, which first computes the optimal sparse representation of the training objects [ 83] and provides an initial segmentation mask for each bounding box These masks are then used in MRF formulation to obtain the final result The method proposed by Xia et al [85] is also detection -based, which estimates a shape guidance for each object bounding box based. .. 85 56 56 92 91 83 82 77 76 75 76 72 71 54 55 building grass tree sky water road 82 95 88 100 92 93 75 99 91 95 71 90 71 98 90 93 86 89 66 87 84 93 82 82 69 97 92 97 82 86 67 95 92 95 73 82 75 95 92 96 77 86 – – – – – – FgAvg FullAvg 70.8 77.8 75.1 78 .3 75.9 79 .3 78.9 80.0 74.6 78.2 79.2 80.5 81.7 83. 2 81 .3 – MSRC-21 dataset In order to test the generalization ability of the proposed method, we also... Chapter 4: Background context augmented hypothesis graph for object segmentation, we proposed a unified fully connected CRF model over a set of semantically meaningful overlapping object hypotheses augmented by di↵erent contextual cues, including image classification cues, object detection cues as well as a novel background context cues obtained from the unlabled background regions The final segmentation result . 2.5 3 3. 5 4 b/g ( 83. 71) plane (67.85) bike ( 23. 85) bird ( 43. 72) boat (42.06) bottle (41.41) bus (64. 73) car (65.01) cat (56 .34 ) chair (11.89) cow (39 .11) table (19.25) dog (42.62) horse (38 .37 ) m/bike. are merged into the final segmentation result followed by some simple post-processing t echniques (see Section 4 .3. 3). The proposed pipeline is shown in Fig. 4.2. 4 .3. 1 CRF -based Formulation Here,. 15 20 25 30 35 40 45 50 0.1 0.2 0 .3 0.4 0.5 0.6 0.7 0.8 0.9 1 IoU accuracy (%) τ 3 Figure 4.4: Illustration of the e↵ects of post-processing parameters: ⌧ 1 , ⌧ 2 (Top) and ⌧ 3 (Bottom)

Định dạng
Số trang	50
Dung lượng	1,14 MB