A novel distribution based feature for r

Neurocomputing 74 (2011) 2767–2779 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A novel distribution-based feature for rapid object detection Jifeng Shen a,n, Changyin Sun a, Wankou Yang a, Zhenyu Wang a, Zhongxi Sun a,b a b School of Automation, Southeast University, Nanjing 210096, China College of Science, Hohai University, Nanjing 210098, China a r t i c l e i n f o abstract Article history: Received June 2010 Received in revised form 16 February 2011 Accepted 17 March 2011 Communicated by Tao Mei Available online 23 May 2011 The discriminative power of a feature has an impact on the convergence rate in training and running speed in evaluating an object detector In this paper, a novel distribution-based discriminative feature is proposed to distinguish objects of rigid object categories from background It fully makes use of the advantage of local binary pattern (LBP) that specializes in encoding local structures and statistic information of distribution from training data, which is utilized in getting optimal separating hyperplane The proposed feature maintains the merit of simplicity in calculation and powerful discriminative ability to distinguish objects from background patches Three LBP-based features are derived to adaptive projection ones, which are more discriminative than original versions The asymmetric Gentle Adaboost organized in nested cascade structure constructs the final detector The proposed features are evaluated on two different object categories: frontal human faces and side-view cars Experimental results demonstrate that the proposed features are more discriminative than traditional Haarlike features and multi-block LBP (MBLBP) features Furthermore they are also robust in monotonous variations of illumination & 2011 Elsevier B.V All rights reserved Keywords: Object detection LBP Adaptive projection-MBLBP Asymmetric Gentle Adaboost Introduction In the last few years, the problem of localizing specified objects in still images or video frames has received a lot of attentions It is widely used in biometric verification, video surveillance and automatic driver aided system [1,2], etc Object detection in arbitrary scenes is rather challenging since objects vary differently in appearance even in the same category The main factors are roughly divided into three aspects: photometric variation, viewpoint variation and Intra-class variability For example, faces can have different expression and under different illumination Among many of the object detection methods, appearance-based method using boosting algorithm is the most popular one, which is attributed to Viola and Jones [3] who propose the first real-time face detection system The object detection research topics can be divided into two categories The first category is feature representation The most famous feature is Haar wavelet [3,4], which encodes the difference between joint areas locating at specified position But it only reflects change of intensity in horizontal, vertical and diagonal directions Extended Haarlike feature [5] is proposed to enrich the feature set, which decreases about 10% of the false positives at the same recall rate After that, disjoint Haarlike feature [6,7], n Corresponding author E-mail addresses: shenjifeng@gmail.com (J Shen), cysun@seu.edu.cn (C Sun), wankou_yang@yahoo.com.cn (W Yang), seuwzy@gmail.com (Z Wang), sunzhx@hhu.edu.cn (Z Sun) 0925-2312/$ - see front matter & 2011 Elsevier B.V All rights reserved doi:10.1016/j.neucom.2011.03.032 which breaks up the connected sub-regions, is proposed to deal with multi-view face detection Huang et al [8] propose a new granular feature, which is also applied to multi-view face detection and proves to be the current state-of-art Recently, it is also applied to the human detection [9] field and gets a considerable result Another important feature, which is worth to mention, is LBP-based feature The LBP [10] is firstly used in face recognition and extended to MBLBP [11] for face detection, which greatly decreases the feature number in training a detector Similar feature is proposed, such as [12] using different coding rule to define features More recently, more variants [13] appear; they are more robust to illumination variations Covariance feature [14] firstly used in human detection [15] is also applied to face detection This feature outperforms the Haarlike feature when the sample size is small but has high calculation complexity Histogram of gradient orientation (HOG) [16] exploits the histogram of gradient orientation weighted by amplitude to model human silhouette and becomes the state-of-art algorithm in human detection field It derives from EOH [17] and SIFT [18], which is widely used in object detection and interest points detection, respectively Others use color information [19,20] to aid object detection The other category is related to classifier construction There are many classifiers such as the Markov process [21], SVM [20,22,23], NN [24], SNoW [25], etc in the earlier research, but most of them cannot be applied to real-time applications and have high false positive rate The first real-time object detector comes from cascade structure [3], which derives from the coarse to fine strategy [26] But this method needs to set parameter 2768 J Shen et al / Neurocomputing 74 (2011) 2767–2779 empirically and use thousands of features to exclude a large amount of the negative samples After that, more research works are concerned about how to improve detection speed and adaptively choose optimal parameters, such as booting chain [27], waldboost [28], soft cascade [29], dynamic cascade [30], multiexit boosting [31], MSL [32], etc Many of the state-of-art object detection systems suffer from two main issues First, it is rather time consuming to train a robust detector Second, extracted features are not discriminative enough especially in the later cascade nodes to exclude hard negative samples To overcome these problems, we propose a new distribution-based feature, which possesses strong discriminative ability, is very efficient to train and can be applied to real-time detection The partial work of this idea is published in [33], but is not evaluated as thoroughly This remaining of this paper is organized as follows Section presents an overview of LBP-based method exploited in current object detection Section describes our proposed new features in detail Section demonstrates the hierarchy of our detector Experimental result and performance analysis are presented in Section Finally, conclusions are given in Section Related works as follows: fR,N,T ¼ N=21 X spi pi ỵ N=2ị ị2i , sxị ẳ iẳ0 ( 1, xZT 0, others 4ị where pi ,pi ỵ N=2ị correspond to the gray values of center-symmetric pairs of pixels of N equally spaced pixels on a circle of radius R The main difference between those aforementioned non-parametric local binary encoding features is its coding sequence All of these features use binary string to represent local feature instead of using pixel intensity directly More recently, Zhang et al [11] propose a MBLBP feature, which extends the coding of single pixel to block and successfully used in face detection A MBLBP feature is composed of cells and each cell contains a set of pixels The feature is an 8-bits encoding string in a circular manner from number to (see in Fig 1(a)) where each bit reflects whether its average intensity in a cell is larger or smaller than the center one Its multi-scale version is shown in Fig 1(b) and (c) MBLBP is less sensitive to intensity variation in local area than original LBP The MBLBP feature, which is used in face detector, only uses 10% quantity of Haarlike features and more discriminative than Haarlike features at the same time Yan et al [44] propose a similar feature called LAB and utilize feature-centric method to improve efciency in calculation 2.1 LBP-based features ă [10] propose the local binary pattern Ojala and Pietikainen (LBP), which is widely used in texture classification It encodes the difference between center pixel and its surrounding ones in a circular sequence manner It characterizes the local spatial structure of image in Eq (1): ( N 1 X 1, x Z0 fR,N ¼ 1ị spi pc ị2i , sxị ẳ 0, x o0 i¼0 where pi is one of the N neighbor pixels around the center pixel pc, on a circle or square of radius R LBP favors its usage as a feature descriptor of its tolerance against illumination changes and computational simplicity It is successfully applied to many computer vision and pattern recognition fields, such as face recognition [10,34], face detection [11], facial expression recognition [35,36], background subtraction [37], dynamic texture analysis [38,39], gender classification [40,41] and so on Zabih and Woodfill [42] propose a non-parametric transform called census transform (CT), which is a summary of local spatial structure It defines an ordered set of comparisons of pixel intensities in a local area, which pixels have lesser intensity than the pixel in the center The coding rule is defined as follows: CðxÞ ¼ y A NðxÞ zðIðxÞ,IðyÞÞ ð2Þ where denotes the concatenation operation, I(x) is the intensity of pixel x, N(x) is the neighborhood of pixel x and zðx,yÞ is a concatenate function Froba and Ernst [12] extend the original work and propose modified census transforms (MCT), which is applied to face detection It differentiates from census transform by taking the center pixel intensity and average pixel intensity into consideration that generates a longer binary string The formula for encoding MCT is as follows: txị ẳ y A N0 xị zIxị,Iyịị, N0 xị ẳ Nxị [ x 3ị where Ixị is the average intensities in N0 (x) Recently, Heikkil et al [43] propose a center-symmetric LBP (CSLBP) descriptor for interest regions detection, which combines the good properties of SIFT and LBP It also uses only half of the coding length of traditional LBP descriptors CSLBP is defined 2.2 Distribution-based features Pavani et al [45] propose a new optimally weighted rectangles for face and heart detection; they integrate the distribution information of positive and negative training samples into Haarlike features in order to find an optimal separating hyperplane, which maximizes the margin between positive class and negative class It can be formulated as follows: wopt ¼ arg max w w T Sb w wT Sw w 5ị Sb ẳ m ỵ m , Sw ẳ S ỵ ỵ S X Sỵ ẳ mi m ỵ ịmi m ỵ ịT 8i,yi ẳ S ẳ mi m ịmi m ịT X 8i,yi ẳ 1 where wopt is the optimal projection to maximize the between-class distance and minimize the within-class distance, m ỵ and m are the mean vector of positive and negative samples, respectively, Sb and Sw are the between-class and within-class scatter matrix, respectively Fig 2(b) shows the two-dimensional distribution of positive and negative samples with default projection direction and optimal projection direction This kind of feature fully makes use of the distribution information of training samples in order to improve discriminative power of classifiers and decrease the evaluation time to exclude negative patches The Haarlike feature Fig MB-LBP feature J Shen et al / Neurocomputing 74 (2011) 2767–2779 2769 can achieve good performance in object detection, ignoring distribution imbalance between positive and negative samples makes the boosting algorithms converge much slower than asymmetric ones Asymmetric boosting [46,47] utilizes cost function to penalize more on false negatives than false positives, so as to get a better separating hyperplane Recently Gentle Adaboost is extended to asymmetric one, which further improves the performance in classification problems [48] The goal function of asymmetric Gentle Adaboost, which uses asymmetric additive logistic regression model, is formulated as follows: Jasym Fị ẳ EẵIy ẳ 1ịeyC1 Fxị ỵ Iy ẳ 1ịeyC2 Fxị 7ị where C1 and C2 are cost factors of false negatives and false positives, respectively, and F(x) is a signed real confidence value Using Newton update technique, the optimal weak classifier can be worked out as follows: Fig Distributions between positive and negative samples fopt xị ẳ C1 Pw y ẳ 19xịC2 Pw y ẳ 19xị C12 Pw y ẳ 19xị ỵ C22 Pw y ẳ 19xị 8ị where Pw y ẳ 19xị and Pw y ẳ 19xịare the cumulative weight distributions from positive and negative samples, respectively The asymmetric Gentle Adaboost algorithm used in this paper is shown in Fig 3 Adaptive projection of block-based local binary features Inspiring from the effectiveness of optimal weight embedded in Haarlike features, we combine the block-based LBP and adaptive projection strategy, which merge the distribution information of training samples into descriptors to enhance its discriminative ability The experimental result indicates that the adaptive projection (AP)-based MBLBP is more discriminative than weighted Haarlike and MBLBP features, which inherit the advantage of both features at the same time Except that, we also introduce the block-based version of modified census transform, CSLBP and its AP versions AP-MBLBP, AP-MBMCT and AP-MBCSLBP will be explained in detail in Sections 3.1, 3.2 and 3.3, respectively 3.1 AP-MBLBP The idea of AP-MBLBP [33] is to get optimal features in each direction to maximize the margin between positive and negative feature vectors In order to combine these feature assembles, we encode rectangle regions of every direction similar to LBP The AP-MBLBP operator is defined as follows: Fig Asymmetric Gentle Adaboost fmblbp ¼ comprising k connecting areas can be formulated as follows: f¼ k X wi mi ¼ wT u, k ¼ 2,3,4 N X gpi ,pc ,wi ,wc ị2i 9ị iẳ0 6ị i¼1 The default Haarlike feature, which is shown in Fig 2(a), can be represented as ½1 1½u1 u2 T , where u1, u2 are the average intensities in white and black area, respectively The optimally weighted feature has a different projection direction from vector ( 1, 1), which is the key improvement in performance 2.3 Asymmetric Gentle Adaboost Asymmetric characteristic inherently exists in many object detection fields where positive targets need to be distinguished from enormous background patterns Although symmetric Adaboost where pc is the average intensity of center area, which is the gray area shown in Fig 4(a); pi is the ith direction around pc; wi is the weight of pi; gðpi ,pc ,wi ,wc Þ is a piecewise function, which reflects the relationship between pi and pc, which is defined as follows: ( 1, wi pi 4wc pc 10ị gpi ,pc ,wi ,wc ị ẳ 0, wi pi rwc pc The weight wi is calculated by Eq (5), which maximizes between-class variation and minimizes within-class variation Fig shows the details of the process to generate AP-MBLBP features Firstly a block, which is composed of cells, is shown in Fig 4(a) Then the block is separated into center-adjacent features (CAF), which is demonstrated in Fig 4(b) In the following, the optimal weight of each cell is calculated by 2770 J Shen et al / Neurocomputing 74 (2011) 2767–2779 -0.6 0.3 -0.6 -0.1 -0.2 0.2 0.2 -0.1 1 0 Fig AP-MBLBP features Fig Selected features on average face: (a) average face on training data, (b) first select feature, (c) second selected feature, (d) third selected feature -0.6 0.3 -0.6 1 -0.1 0.3 -0.2 0.2 -0.1 0.2 0 Fig AP-MBMCT features surrounding cells of center cell, but also the average intensities of all cells in a block So the total length of encoding bit string is longer than MBLBP The coding method is defined as follows: Eq (5) Finally, encoding of all the cells is determined by Eq (10) We can see that the last step in Fig 4(d) is our AP-MBLBP operator, which is similar to LBP We can see in Fig 4(b), eight two-adjacent cells reflect the variations of eight directions between center cell and adjacent ones, which encode local intensity variation information of different direction Instead of utilizing only one optimal direction like Haarlike features, the other seven less discriminative features, which are neglected, can also aid more local variation information to enhance total discriminative power of APLBP features The key step in Fig 4(c) fully makes use of distribution of samples to maximize the margin between positive and negative samples by each CAF in order to promote the global discriminative power of AP-MBLBP It is worth to mention that AP-MBLBP is different from MBLBP, since the former have different weights for surrounding rectangle regions The weight of the former one encodes the discriminative information between positive and negative samples AP-MBLBP can be considered as a generalized MBLBP We denote weight vector w ¼ ðw1 ,w2 , .,w8 ị, when w ẳ 1,1, .,1ị, and AP-MBLBP is degenerated to MBLBP AP-MBLBP has all the advantages of MBLBP; furthermore it is more discriminative than MBLBP In order to find the intrinsic reason of its strong discriminative ability, we visualize the first three AP-MBLBP features from our trained detector From Fig 5(b), we can see that the first feature is symmetric along the center of face, and it covers the most important region in face area such as eyes and nose It encodes the symmetric context information of human face, which inherently exists in nature Features in Fig 5(c) and (d) encode the geometric information of eyes, nose, mouth and face silhouette information, which is very discriminative to exclude negative patches More discussion will be given in Section 5.2 The process of generating AP-MBCSLBP is demonstrated in Fig 7, where the length of the final bit string is half of the length of AP-MBLBP Four pairs of sub-regions are central symmetry with respect to the center block, which is different from MBLBP and MBMCT 3.2 MBMCT and AP-MBMCT 3.4 Time complexity of generating adaptive projection features We propose the block-based version of modified census transform (MBMCT) and its AP version (AP-MBMCT), which is similar to the MBLBP and AP-MBLBP MBMCT encodes not only the The adaptive projection features utilize N training samples; each sample is d d size, and the dimension of feature vector is K The final optimal weight of the feature is wopt ¼ S1 b m ỵ m ị, where Sb fmbct ẳ N X gpi ,p,wi ị2i 11ị iẳ0 gpi ,p,wi ị ẳ ( 1, wi pi p 0, wi pi r p , p¼ 1 NX p N i The process of generating AP-MBMCT is shown in Fig 6, which is similar to generate AP-MBLBP except that the length of final bit string is longer than the former one and it uses average block intensity instead of center block 3.3 MBCSLBP and AP-MBCSLBP We also propose the block-based version of CSLBP and AP-MBCSLBP, and its definition is described as follows: fmbcslbp ẳ N=21 X gpi ,pi ỵ N=2ị ,wi ,wi ỵ N=2ị ị2i 12ị iẳ0 gpi ,pj ,wi ,wj Þ ¼ ( 1, wi pi wj pj 0, wi pi r wj pj 2771 J Shen et al / Neurocomputing 74 (2011) 2767–2779 3 -0.6 -0.1 0.2 0.3 -0.6 -0.2 -0.1 -0.2 Fig AP-MBCSLBP features Feature-centric image prymid Build image prymid Test images Training samples (objects, background) Train: Original Feature set Non maximun suppresion Sliding window Test: object Adaptive projection Feature set Train nested cascade Classifier Object detector distribution information of training samples true true false true false true SCn SC3 SC2 false Passed SC1 false Rejected Fig Object detection architecture is K K between-class matrix, and m ỵ , m are K-dimensional positive and negative average feature vectors, respectively We use LU decomposition algorithm to solve the inverse of matrix, so the complexity is (2/3)K3 The complexity of computing each feature is OK ỵ K ỵ NKị, where K equals in this paper So it is very efficient to calculate, but little slower than original MBLBP features in the training section Object detection architecture The proposed feature is designed to detect multiple object instances in different scales at different locations in an input image The object detection architecture is shown in Fig The process of constructing an object detector comprises two phases, one is training based on a large amount of training data, and the other is testing in a multi-scale image pyramids In the training section, adaptive projection feature set is calculated based on original feature set and training data; then a nested cascade is constructed using asymmetric Gentle Adaboost It is worth to mention that adaptive projection feature set has to be recalculated after the training data is changed such as after bootstrap negative sample in each stage This result in the final object detector is shown in the bottom part of Fig In the test section, trained object detector evaluates in test database Firstly an image pyramid for every test image is constructed; secondly sliding window technique is utilized to check every possible window whether it contains an object Thirdly, non-maximum suppression method is used to merge multiple detection windows around one object This procedure is shown in the top of Fig PASCAL dataset [50] All of them are freely available on the Internet 5.1 Experimental setup In the face detection experiment, the training set consists of approximately 10 000 faces, which covered out-of-plan rotation and in-plan rotation in the range of [ 20, 20] Through mirroring, random shifting operation, the size of positive sample set is enlarged to 40 000 Except that more than 20 000 large images, which not contain any faces, were used for bootstrap All samples are normalized to 24 24 size We use N ¼8 in Eqs (9), (11) and (12) The number of MBLBP generated in face detection experiment is 8464 The size of MBLBP varies from to 24 pixels in width and height, respectively The parameters of cost factor are C1¼1 and C2 ¼0.25 in asymmetric Gentle Adaboost 256-bin, 512-bin and 16-bin LUT algorithms [8] are utilized as weak classifier for AP-MBLBP, AP-MBCT and AP-MBCSLBP features, respectively The minimum detection rate and maximum false positive rate are set to 0.9995 and 0.4, respectively In the car detection experiment, we get training data from UIUC side view car dataset, which contain 550 car image and 500 negative images in training set The original size of training samples is at a resolution of 100 40; we manually cropped and resized it to 64 32 in order to reduce the negative effect from background The number of MBLBP generated in the car detection experiment is 30 513 The size of the MBLBP varies from to 63 pixels in width and to 30 in height The minimum detection rate and maximum false positive rates are set to 0.995 and 0.4, respectively Other parameter settings are the same as the above mentioned 5.2 Features comparison Experimental results and analysis We evaluate our proposed features on three well-know databases, MIT ỵCMU face dataset [24], UIUC car dataset [49] and In order to evaluate the effectiveness of our proposed features, we conduct a comparison among Haar [3], Fldaỵ Haar [45], LBP, MBLBP [11], MBMCT, MBCSLBP, AP-MBLBP, AP-MBMCT and 2772 J Shen et al / Neurocomputing 74 (2011) 2767–2779 Fig Feature comparison (a) False negative rate vs weak classifier number, (b) false positive rate vs weak classifier number, (c) error rate vs weak classifier number AP-MBCSLBP features based on face dataset The result is shown in Fig In this experiment, we randomly choose 20 000 face samples and 20 000 negative samples for training, and the validation set uses all the left 20 000 faces for cross validation Eight single node detectors with each containing 50 weak classifiers are trained and evaluated in the validation set From Fig 9, we conclude that AP-MBMCT, AP-MBLBP and AP-MBCSLBP are more discriminative than MBMCT, MBLBP and J Shen et al / Neurocomputing 74 (2011) 2767–2779 MBCSLBP, respectively AP-MBMCT and AP-MBLBP have the lowest false negative rate and false positive rate with the same weak classifiers, where AP- MBMCT is slightly better than AP-MBLBP at the first 30 rounds and convergence to the same after that The MBCSLBP is less discriminative compared to the LBP, MBMCT and MBLBP, but better than Haar and Fldaỵ Haar features (difference in about 10%) Through Fig 9(a), we can see Haarlike features cannot distinguish faces from difficult negative samples, but LBPbased features are able to discriminate it effectively This is due to the advantage of LBP, which possesses nice discriminative powers in classifying weak texture object In Fig 9(b), we can see that both of the Haarlike feature and LBP-based features can exclude negative patches effectively (difference in about 2%) There are slight differences among AP-MBLBP, AP-MBMCT and AP-MBCSLBP in excluding negative samples, and AP-MBLBP and AP-MBMCT also perform better than AP-MBCSLBP Fig 9(c) is the total error rate, which comprises false positives and false negatives AP-MBLBP and AP-MBMCT 2773 have little difference in discriminative power, and all distributionbased features (AP version) perform better than the original ones So features making use of distribution of the training samples can improve the performance in the classification We visualized the top 12 FS2 features in Fig 10 In Fig 10, the most discriminative features are located at the eyes, corner of mouth and nose and face silhouette So the symmetric context of human face is utilized implicitly The symmetric geometric relationship of two eyes in first feature can exclude more than 30% negative patches in the initial training set with zero false negatives Furthermore, the left half part (top 2) and the right half part (top 8) features describe the concurrence of one eye, corner of mouth and half silhouette of face So the feature makes use of both symmetric context information and geometric relationship of components in face implicitly, which improves the discriminative ability To sum up, most of the discriminative features are laid in the face area or overlap with the face and background So the silhouette of face and eyes is the most important area to Fig 10 Top 12 features on FS2 set selected by our detector Fig 11 ROC of test result on MITỵ CMU dataset (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.) 2774 J Shen et al / Neurocomputing 74 (2011) 2767–2779 distinguish from non-faces That is why our proposed feature is superior to others 5.3 Evaluation on MITỵCMU dataset 5.3.1 Comparison of different features We evaluate our detector on CMUỵMIT frontal face database, which consists of 130 files containing 507 faces Due to their similarity of discriminative power among MBLBP, MBMCT and CSLBP we evaluate these feature sets together instead of one by one One feature set (FS-1) comprises MBLBP, MBMCT and MBCSLBP; the other set (FS-2) includes AP-MBLBP, AP- MBMCT and AP-MBCSLBP The ROC curve is displayed in Fig 11 The data of red solid curve in Fig 11 is adopted from original papers and the others are our implementation The final trained detectors based on LBP-like features contain nodes about 400 weak classifiers Some of the test results in MITỵ CMU database are shown in Fig 13 From Fig 11, we can see that the discriminative ability of FS-2 is much better than FS-1 It is due to the distribution information, which is embedded into the features to enhance the discriminative ability The ROC curve of our detector FS-2 is also higher than the Yan [44], which is the state-of-art During the experiment, we have observed that it is very effective to use MSL [44] method to generate a well-distributed positive training set; the finally positive training set (about 9000 samples), which is generated by bootstrapping in each stage, has a very nice representation of the whole 40 000 original training sets We also train on this distilled 9000 samples and get nearly the same result We also find that the well-distributed training set has an important effect on the accuracy of detector So it is very important to collect a well-distributed positive and negative training sets in order to train a robust detector 5.3.2 Comparison of different training method In this section we compare Nested cascade with MSL method in training face detector Both of them use real value inheriting techniques to decrease the number of features The difference between them is that the former needs to bootstrap positives incrementally to get a well-distributed face subset, which effectively represents all the face images in a huge validation set In ours method, we use a pre-selected face dataset, which is much smaller than the former one In order to using MSL method to train face detector, we collect a huge face dataset, which contains about 390 000 faces from many well-know face databases and the quantity is even larger than paper [44] used We have implemented five different detectors (MSL with different features including LAB, FS1 and FS2, and nested cascade with FS1 and FS2) and the result of comparison is shown in Fig 12 From Fig 12, we can see that when training with FS1 feature set, both MSL and nested cascade have very similar results In the case of training FS2 feature set, nested cascade is slightly better than MSL method We think that the MSL method with FS2 feature set needs to bootstrap positives in each round of every stage in an incremental manner, which is more sensitive to the distribution changing of positive and negative data especially in earlier stage Furthermore, much of the computation load focus on recalculating adaptive projection feature set after positive samples are bootstrapped, so it is more time consuming to train with MSL structure From Fig 12, we can see that the ROC of our implementation of MSL with LAB feature is slightly lower than its published result [44] We think it is may be due to different training data, because how to generate a well-distributed face manifold as training data is still a difficult problem [51], which is out of the scope of this paper Furthermore, the MSL method is more complicated to implement and our method is easy to implement with reasonable performance 5.4 Evaluation on UIUC dataset We also evaluated our features on UIUC side view car dataset, which consists of single scale set with 170 images containing 200 cars and multi-scale test dataset with 108 images containing 139 cars We denote them Test Set I and Test Set II and use the procedure provided in this dataset to evaluate the detection result The car detector is trained on the UIUC car training set, which comprises 550 car images and 500 background images using the same method as face Two detectors with different features are evaluated in the UIUC datasets The final trained detector contains only about 90 weak classifiers The recall-precision curve of single-scale and multi-scale test dataset is demonstrated in Fig 14 We can see that FS-2 feature set has high recall rate at the same precision; it is superior to FS-1 feature set, which is Fig 12 Comparison of different training method J Shen et al / Neurocomputing 74 (2011) 2767–2779 2775 Fig 13 Face detector output on test images Fig 14 1-precision vs recall curve of test result consistent as the result evaluated on face dataset We compare our approach with previous approaches following the equal precision and recall rate (EPR) method The results are listed in Table From that we can see the FS-1 and FS-2 feature set have the moderate detection rate in detecting low resolution and complex background images 2776 J Shen et al / Neurocomputing 74 (2011) 2767–2779 We visualize the top 12 features from FS2 detector in Fig 15 We can see that most discriminative features focus on bottom of the car, especially in the area of wheels In this experiment, we find that nearly no features lay on the top of the car, we think this due to the training data have both left side view and right side view mixed together So we manually separate the training set into two datasets (left side view and right side view), and we train on these two datasets We observe slight improvement in detection rate, so it is effective to detect left and right side view separately Table Face detection results on PASCAL VOC dataset Table EPR rates of different methods on UIUC dataset Methods Agarwal et al [49] Mutch and Lowe [52] Wu and Nevatia [53] Fritz et al [54] Lampert et al [55] MB-(LBP,MCT,CSLBP) AP-MB-(LBP,MCT,CSLBP) From Table 1, we can see that our method does not get the state-of-art result; we think that the side view car has smaller discriminative area than faces and all the positions mainly lay on the silhouette of car, so shape information predominates the importance of feature, that is why other shape-based descriptor gets the state-of-art result Some of the car detection results are displayed in Fig 16 Single-scale (Test Set I) (%) Multi-scale (Test Set II) (%) 76.5 99.94 97.5 88.6 98.5 85.5 90.5 39.6 90.6 93.5 87.8 98.6 81.3 87 Dataset name VOC 2007 test set VOC 2008 test set VOC 2009 test set Description 4952 images 4336 images 6925 images Feature Haar FS1 FS2 Haar FS1 FS2 Haar FS1 FS2 False positives 94 87 85 66 77 57 112 125 103 True positives 853 1083 1135 1026 1246 1312 1352 1531 1597 The bold numbers in Table means the number of false positives and true positives found by our detector with feature FS2 in the VOC 2007, 2008 and 2009 dataset respectively For the number of false positives, the less the better For the number of true positives, the more the better Fig 15 Top 12 features selected by our detector Fig 16 Car detector output on test images 2777 J Shen et al / Neurocomputing 74 (2011) 2767–2779 5.5 Evaluation on Pascal VOC dataset 5.6 Speed comparison of object detectors We also evaluate our trained face detector on PASCAL VOC dataset in order to test its generalization ability We compared three detectors with Haarlike, FS1 and FS2 features All the detectors are terminated training when all the negative images are used out Because there are no annotations on face images and there is only a bounding box on person So we run our detectors on the test images of VOC data and manually count the true positive and false positives The test dataset contains thousands of images, where some of them are very challenge to detect The result is shown in Table We can see from Table 2, the true positives of detector with the FS2 is better than FS1 and Haar feature in all the three VOC datasets with smaller false positives The FS1 feature detector ranked second We also obtained two other observations from this experiment The first observation is that the FS1 and FS2 features are much robust with illumination variation We carefully choose these missed faces from Haar feature and detect by FS2 feature; then we found that most of the faces are in bright or darker environment rather than regular lighting conditions So it validates the effectiveness of LBP-based feature is robust to monotonous variations of illumination The second one is that the false positives are very similar to human faces We show some false positives from our trained detector in Fig 17 It is easy to see that most of the images are very similar to human faces such as cat and dog, or it has some common structures with face Some detection results in which many of them are difficult for Haar and FS1 are shown in Fig 18 We evaluate the speed of the face detectors with different features using average number of weak classifiers to reject a background patch The evaluation criterion used in [45] is defined as follows: c¼ n X ci pi , pi ¼ i¼1 ei N where ci is the number of weak classifiers in node i, pi is the probability of evaluating a background patch in node i, ei is number of background patches evaluated in node i and N is the total negative patches in the evaluation We randomly generated 10 000 negatives patches from the 15 000 big images without faces in it and evaluated the final detector on this negative patch set We also evaluate the speed in frames per seconds at the 320 240 resolutions from digital camera The comparison results are shown in Table We can see from Table that both of the features can run in the real-time application Table Comparison of c and speed for different detectors Weak classifier type c Fps MB(LBP,MCT,CSLBP) AP-MB(LBP,MCT,CSLBP) 5.63 3.27 16 18 Fig 17 Some false positives on Pascal VOC dataset (Three rows from VOC 2007, VOC 2008 and VOC 2009, respectively.) Fig 18 Some detection results on VOC dataset 2778 J Shen et al / Neurocomputing 74 (2011) 2767–2779 Conclusion In this paper, we propose a novel adaptive projection feature, which fully makes use of the distribution information of training sample to enhance discriminative power of block-based LBP We extend census transform and CSLBP into block-based version and utilize the distribution information of training samples to enhance their discriminative power These features possess good properties as strong discriminative ability and simplicity in calculation The comparison between these features indicates that AP-MBMCT is superior to AP-MBLBP, AP-MBCSLBP, MBMCT, MBLBP and MBCSLBP Experiments on CMUỵMIT face, PASCAL dataset and UIUC car dataset show that our proposed features are fast to calculate and possess powerful discriminative ability Furthermore, there is still a lot of work to in our future work such as how to extend them to other descriptor, which is applied to non-rigid object detection in more complicated dataset with more object categories Acknowledgments This project is supported by NSF of China (90820009, 60803049 and 60875010), Program for New Century Excellent Talents in University Of China (NCET-08-0106), China Postdoctoral Science Foundation (20100471000) and the Fundamental Research Funds for the Central Universities (2010B10014) References [1] M Enzweiler, D.M Gavrila, Monocular pedestrian detection: survey and experiments, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (12) (2009) 2179–2195 [2] R Kasturi, D Goldgof, P Soundararajan, V Manohar, J Garofolo, R Bowers, M Boonstra, V Korzhova, Z Jing, Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 319–336 [3] P Viola, M Jones, Rapid object detection using a boosted cascade of simple features, IEEE Conference on Computer Vision and Pattern Recognition 511 (2001) 511–518 [4] C.P Papageorgiou, M Oren, T Poggio, A general framework for object detection, in: Sixth International Conference on Computer Vision, 1998, pp 555–562 [5] R Lienhart, J Maydt, An extended set of Haar-like features for rapid object detection, International Conference on Image Processing (2002) 900–903 [6] M Jones, P Viola, Fast multi-view face detection, /http://www.merl.com/ papers/docs/TR2003-96.pdfS [7] S Li, L Zhu, Z Zhang, A Blake, H Zhang, H Shum, Statistical learning of multi-view face detection, in: Proceedings of the 7th European Conference on Computer Vision—Part IV, 2002, pp 67–81 [8] C Huang, H Ai, Y Li, S Lao, Learning sparse features in granular space for multi-view face detection, in: International Conference on Automatic Face and Gesture Recognition, 2006, pp 401–406 [9] W Gao, H Ai, S Lao, Adaptive contour features in oriented granular space for human detection and segmentation, IEEE Conference on Computer Vision and Pattern Recognition (2009) ă [10] T Ojala, M Pietikainen, T Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7) (2002) 971–987 [11] L Zhang, R Chu, S Xiang, S Liao, S Li, Face detection based on multi-block LBP representation, in: 2nd International Conference on Biometrics, 2007, pp 11–18 [12] B Froba, A Ernst, Face detection with the modified census transform, IEEE International Conference on Automatic Face and Gesture Recognition (2004) 91–96 [13] A Roy, S Marcel, Haar local binary pattern feature for fast illumination invariant face detection, BMVC 2009 (2009) [14] C Shen, S Paisitkriangkrai, J Zhang, Face detection from few training examples, IEEE International Conference on Image Processing (2008) 2764–2767 [15] O Tuzel, F Porikli, P Meer, Human detection via classification on Riemannian manifolds, IEEE Conference on Computer Vision and Pattern Recognition (2007) 1–8 [16] N Dalal, B Triggs, Histograms of oriented gradients for human detection, IEEE Conference on Computer Vision and Pattern Recognition 881 (2005) 886–893 [17] K Levi, Y Weiss, Learning object detection from a small number of examples: the importance of good features, IEEE Conference on Computer Vision and Pattern Recognition 52 (2004) 53–60 [18] D.G Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110 [19] Z Jin, et al., Face detection using template matching and skin-color information, Neurocomputing 70 (4–6) (2007) 794–800 [20] C.F Juang, W.K Sun, G.C Chen, Object detection by color histogram-based fuzzy classifier with support vector learning, Neurocomputing 72 (10–12) (2009) 2464–2476 [21] A Colmenarez, T Huang, Face detection with information-based maximum discrimination, IEEE Conference on Computer Vision and Pattern Recognition 782 (1997) [22] E Osuna, R Freund, F Girosit, Training support vector machines: an application to face detection, IEEE Conference on Computer Vision and Pattern Recognition (1997) 130–136 [23] C.F Juang, S.-J Shiu, Using self-organizing fuzzy network with support vector learning for face detection in color images, Neurocomputing 71 (16–18) (2008) 3409–3420 [24] H.A Rowley, S Baluja, T Kanade, Neural network-based face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1) (1998) 23–38 [25] M.H Yang, D Roth, N Ahuja, A SNoW-based face detector, Advances in Neural Information Processing Systems 12 (2000) 855–861 [26] F Fleuret, D Geman, Coarse-to-fine face detection, International Journal of Computer Vision 41 (1) (2001) 85–107 [27] R Xiao, L Zhu, H.J Zhang, Boosting chain learning for object detection, IEEE International Conference on Computer Vision 701 (2003) 709–715 [28] J Sochman, J Matas, WaldBoost—learning for time constrained sequential detection, IEEE Conference on Computer Vision and Pattern Recognition 152 (2005) 150–156 [29] L Bourdev, J Brandt, Robust object detection via soft cascade, IEEE Conference on Computer Vision and Pattern Recognition 232 (2005) 236–243 [30] R Xiao, H Zhu, H Sun, X Tang, Dynamic cascades for face detection, IEEE International Conference on Computer Vision (2007) 1–8 [31] M.T Pham, V.D Hoang, T.J Cham, Detection with multi-exit asymmetric boosting, IEEE Conference on Computer Vision and Pattern Recognition (2008) 1–8 [32] S Yan, S Shan, X Chen, W Gao, J Chen, Matrix-structural learning (MSL) of cascaded classifier from enormous training set, computer vision and pattern recognition, IEEE International Conference on Computer Vision (2007) 1–7 [33] J Shen, W Yang, C Sun, Learning discriminative features based on distribution, International Conference on Pattern Recognition (2010) 1401–1404 [34] B Zhang, S Shan, X Chen, W Gao, Histogram of Gabor phase pattern: a novel object representation for face recognition, IEEE Transactions on Image Processing (1) (2007) 5768 ă [35] G Zhao, M Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6) (2007) 915–928 [36] T Kim, D Kim, MMI-based optimal LBP code selection for facial expression recognition, in: IEEE International Symposium on Signal Processing and Information Technology, 2009, pp 384391 ă M Pietikainen, ă [37] M Heikkila, A texture-based method for modeling the background and detecting moving objects, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4) (2006) 657662 ă [38] Y Guo, G Zhao, J Chen, M Pietikainen, Z Xu, Dynamic texture synthesis using a spatial temporal descriptor, IEEE International Conference on Image Processing (2009) [39] Z Guo, L Zhang, D Zhang, Rotation invariant texture classification using LBP variance (LBPV) with global matching, Pattern Recognition 43 (3) (2010) 706719 ă [40] A Hadid, M Pietikainen, Combining appearance and motion for face and gender recognition from videos, Pattern Recognition 42 (11) (2009) 2818–2827 [41] Z Guo, L Zhang, D Zhang, A completed modeling of local binary pattern operator for texture classification, IEEE Transactions on Image Processing 19 (6) (2010) 1657–1663 [42] R Zabih, J Woodfill, A non-parametric approach to visual correspondence, IEEE Transactions on Pattern Analysis and Machine Intelligence (1996) ă [43] M Heikkil, M Pietikainen, C Schmid, Description of interest regions with local binary patterns, Pattern Recognition 42 (3) (2009) 425–436 [44] S Yan, S Shan, X Chen, W Gao, Locally assembled binary (LAB) feature with feature-centric cascade for fast and accurate face detection, IEEE Conference on Computer Vision and Pattern Recognition (2008) 1–7 [45] S.K Pavani, D Delgado, A.F Frangi, Haar-like features with optimally weighted rectangles for rapid object detection, Pattern Recognition 43 (1) (2009) 160–172 [46] P Viola, M Jones, Fast and robust classification using asymmetric adaboost and a detector cascade, Advances in Neural Information Processing Systems (2002) 1311–1318 [47] H Masnadi-Shirazi and N Vasconcelos, Asymmetric boosting, in: Proceedings of the Twenty-Fourth International Conference in Machine Learning, 2007, vol 227, pp 609–619 [48] Q Li, Y Mao, Z Wang, W Xiang, Cost-sensitive boosting: fitting an additive asymmetric logistic regression model, Advances in Machine Learning 5828 (2009) 1611–3349 J Shen et al / Neurocomputing 74 (2011) 2767–2779 [49] S Agarwal, A Awan, D Roth, Learning to detect objects in images via a sparse, part-based representation, IEEE Transactions on Patterns Analysis and Machine Intelligence 26 (11) (2004) 1475–1490 [50] M Everingham, L Van Gool, C.K.I Williams, J Winn, A Zisserman, The PASCAL visual object classes (VOC) challenge, International Journal of Computer Vision 88 (2) (2010) 303–338 [51] J Chen, X Chen, J Yang, S Shan, R Wang, W Gao, Optimization of a training set for more robust face detection, Pattern Recognition 42 (11) (2009) 2828–2840 [52] J Mutch, D.G Lowe, Multiclass object recognition with sparse Localized features., IEEE Conference on Computer Vision and Pattern Recognition (2006) 11–18 [53] B Wu and R Nevatia, Cluster boosted tree classifier for multi-view, multipose object detection, in: IEEE 11th International Conference on Computer Vision, vol 1–8, 2007 [54] M Fritz, B Leibe, B Caputo, B Schiele, Integrating representative and discriminative models for object category detection, IEEE International Conference on Computer Vision (2005) 1363–1370 [55] C.H Lampert, M.B Blaschko, T Hofmann, Beyond sliding windows Object localization by efficient subwindow search, IEEE Conference on Computer Vision and Pattern Recognition (2008) 1–8 Jifeng Shen received his B.S and M.S degree in Computer Science from East China Shipbuilding Institute, and Jiangsu University of Science and Technology, Zhenjiang, China, in 2003 and 2006, respectively He is currently a Ph.D student with Department of Automation in Southeast University, Nanjing, China His research interests include object detection, scene classification and content-based image retrieval Changyin Sun is a professor in School of Automation at Southeast University, China He received the M.S and Ph.D degrees in Electrical Engineering from Southeast University, Nanjing, China, respectively, in 2001 and 2003 His research interests include intelligent control, neural networks, SVM, pattern recognition, optimal theory, etc He has received the First Prize of Nature Science of Ministry of Education, China He has published more than 40 papers He is an associate editor of IEEE Transactions on Neural Networks, Neural Processing Letters, International Journal of Swarm Intelligence Research and Recent Patents on Computer Science He is an IEEE Member 2779 Wankou Yang received his B.S., M.S and Ph.D degrees at the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), P.R China, in 2002, 2004 and 2009, respectively Now he is a Postdoctoral Fellow in the School of Automation, Southeast University, P.R China His research interests include pattern recognition, computer vision and digital image processing Zhongxi Sun received the B.S degree in Information and Computation Science and the M.S degree in Applied Mathematics from Harbin University of Science and Technology in 2002 and 2005, respectively He has worked in College of Science, Hohai University, since then Now he is pursuing his doctor’s degree in Control Theory and Control Engineering in Southeast University His current interests are in the areas of image processing, pattern recognition and wavelet analysis Zhenyu Wang received the B.Eng degree in Electrical Engineering from the Anhui University of Technology, Anhui, China, in 2001, and the M.Eng degree in control theory from Kunming University of Science and Technology, Yunnan, China, in 2008 He is currently pursuing the Ph.D degree at Southeast University, China His research interests include image/video processing, computer vision and super-resolution ... census transform, IEEE International Conference on Automatic Face and Gesture Recognition (2004) 91–96 [13] A Roy, S Marcel, Haar local binary pattern feature for fast illumination invariant face... quantity of Haarlike features and more discriminative than Haarlike features at the same time Yan et al [44] propose a similar feature called LAB and utilize feature- centric method to improve efciency... face, PASCAL dataset and UIUC car dataset show that our proposed features are fast to calculate and possess powerful discriminative ability Furthermore, there is still a lot of work to in our

Định dạng
Số trang	13
Dung lượng	1,16 MB