An efficient framework for pixel wise building segmentation from aerial images

An Efficient Framework for Pixel-wise Building Segmentation from Aerial Images Nguyen Tien Quang Nguyen Thi Thuy Hanoi University of Science and Technology Faculty of Information Technology Vietnam National University of Agriculture octagon9x@gmail.com Dinh Viet Sang ntthuy@vnua.edu.vn Huynh Thi Thanh Binh Hanoi University of Science and Technology Hanoi University of Science and Technology sangdv@soict.hust.edu.vn binhht@soict.hust.edu.vn ABSTRACT modeling, land cover classification, Internet applications The topic has been widely researched in the last decades The problem is challenging due to the natural complex of terrestrial scenes and the demanding for efficient processing of big image data sets Detection of buildings in aerial images is an important and challenging task in computer vision and aerial image interpretation This paper presents an efficient approach that combines Random forest (RF) and a fully connected conditional random field (CRF) on various features for the detection and segmentation of buildings at pixel level RF allows one to learn extremely fast on big aerial image data The unary potentials given by RF are then combined in a fully connected conditional random field model for pixelwise classification The use of high dimensional Gaussian filter for pairwise potentials makes the inference tractable while obtaining high classification accuracy Experiments have been conducted on a challenging aerial image dataset from a recent ISPRS Semantic Labeling Contest [9] We obtained state-of-the-art accuracy with a reasonable computation time The problem of building segmentation is difficult for many reasons Building are mostly located in urban scene with various objects in close proximity or disturbing, such as parking lots, vehicle, ground street, trees Some objects are occluded or cluttered Buildings may appear in complex shapes with various architectural details; building roofs show variant reflectance, the gray roof tops are very similar to street layer With the advance of aerial imaging technology, high resolution aerial images can be produced and made available for various tasks [8, 9, 18] Aerial images are usually taken over large areas on the ground, usually a city or some urban area of hundreds square-kilometers The ground sampling distance of aerial imagery may be at a pixel size of 10 cm, and such large urban area may be covered by thousands large-format aerial photographs at high overlaps [28] The high resolution of images makes it convenient for analysing in details of small objects, however, processing of big image data is computational demanding In this paper, we aim at a concrete task: to detect the appearance of buildings at pixel level, i.e building footprints extraction The detection and segmentation of buildings is necessity for many tasks, such as change detection for map revision or providing building footprints for the next steps of building extraction and reconstruction [4, 11, 31] Over the years, automated building detection from aerial image has been being an active research topic There have been a lots of proposed methods for solving the problem of building detection in literature [11, 21] These approaches are different in the use of data sources, the used models and the evaluation methods [23, 26, 34] However, how to exploit and integrate multiple sources of data efficiently in an efficient learning framework, to obtain satisfying performance of the detection and segmentation of buildings at pixel level, is still an open problem This paper propose an efficient approach that combines CCS Concepts •Computing methodologies → Image segmentation; Supervised learning by classification; Latent variable models; Deep belief networks; Keywords Aerial image, building detection, random forest, fully connected CRF, semantic segmentation, feature extraction INTRODUCTION Detection and segmentation of building objects from aerial images is important for aerial image analysis and interpretation Some applications to name are cartography, 3D city Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee Request permissions from permissions@acm.org SoICT 2015, December 03-04, 2015, Hue City, Viet Nam c 2015 ACM ISBN 978-1-4503-3843-1/15/12 $15.00 DOI: http://dx.doi.org/10.1145/2833258.2833311 282 OUR PROPOSED FRAMEWORK Random forest (RF) and a fully connected conditional random field (CRF) on various features for the detection and segmentation of building footprints Six informative feature types are extracted from rich source of image data Random forest can learn very fast on these feature sets and give output of high probability to pixels belonging to building class CRF is then employed to exploit the potential interactions of neighbor pixels, aim to improve the classification results given by RF CRF with Gaussian kernels can perform inference efficiently, allow to reduce computational time on big data sets Our framework consists of three steps: feature extraction, RF learning, and CRF inference For feature extraction, powerful feature extraction techniques are employed for extracting representative features from given sources aerial image data (including true orthophoto (TOP) and a Digital Surface Model (DSM)) [9] These feature types are NDVI, NDSM, texton, color, saturation and entropy RF is then learnt on these features CRF is finally performed inference on the classification output of RF Details of each step will be presented in the following 3.1 RELATED WORK Feature extraction We use the following features for the description of image data NDVI: the normalized digital vegetation index, computed from the first (IR) and the second channels (R) of the CIR true-orthophoto (TOP) Buildings detection and extraction is an active research topic in photogrammetry and computer vision [23,24,28,33] The approaches are typically different in type of image data and the used methods Some works use single intensity image only [19] Some works use data from multiple aerial images, including color and high field data [11,39,43] Early works mainly used geometric image features for feature extraction [7, 27] These approaches often fail when the building structures are complex [6] In some works, rooftops were used as an evidence of building’s present A perceptual grouping method or a geometric based method is then employed to detect and reconstruct buildings This approach allows the detection and reconstruction to be done at the same time The system is usually complicated and human user interaction is needed in many cases [19, 30] Matikainen et al [22] proposed a system for building detection from laser scanning data and aerial colour images The data from DSM is classified into ground and buildingsor-tree objects Buildings are then separated from trees [5] has shown the feasibility of classification-based method in building detection process and the possible automation of the approach Rottensteiner [32] proposed an approach for per-pixel classification for buildings change detection for map revision Xu et al [40] proposed a three-step pointbased method for detecting changes to buildings and trees using airborne light detection and ranging (LiDAR) data Some approaches have employed graphical models for integrating contextual information to improve classification result, cf Kumar and Hebert [15], Verbeek and Triggs [36] Korc and Forstner [14] used Markov random field model and showed that parameter learning methods can be improved There have been attempts to use conditional random field to model contextual information for detection of urban areas [42] or objects from aerial images [41] Meng et al [25] used a multi-directional ground filter on lidar data to obtain bare ground points, and then NDVI was employed to remove trees A supervised C4.5 decision tree analysis was then applied to classify building pixels from non-building pixels In the result, about 2.55 percents of tree pixels were misclassified as buildings Recently, the ISPRS benchmark data set for urban object detection has been released [9], which provide ground truth for evaluation of methods The results of very recent works reported in Rottensteiner et al [33] show efforts of many researches in developing efficient methods for automated object detection and 3D buildings reconstruction from aerial imagery Despite that, the problem of how to effectively detection and segmentation of building footprints at pixel-level from high resolution aerial images remains a challenge, especially in computational time N DV I = IR − R IR + R (1) The use of the NDVI is based on the fact that green vegetation has low reflectance in the red spectrum (R) due to chlorophyll and much higher reflectance in infrared spectrum (IR) due to its cell structure Hence, this is a good feature to distinguish green vegetation from other classes NDSM: the difference between the DSM and the derived DTM, which classifies pixel into ground and off-ground N DSM = DSM − DT M (2) This feature helps to distinguish the high object classes from the low object classes Texton: Texton is a unit of texture, reflecting the human perception of textured images It has been proven to be effective in image segmentation Therefore, representing images in the form of texton, the pixels will contain more useful information than in the form of normal color [38] Color: In this work we use the CIELab color space Unlike the RGB and CMYK color models, Lab color is designed to approximate human vision It aspires to perceptual uniformity, and its L component closely matches human perception of lightness Saturation of CIR image: some previous works have shown that the saturation is helpful to further support the separation of vegetation and impervious surfaces Entropy gathered over a × neighborhood from the DSM to exploit spatial context information of a pixel (neighboring) 3.2 3.2.1 Learning Model Random Forest With those extracted features, we used random forest classifier to train and build unary potentials for CRF models Random forest used in this work is Breiman’s CART-RF [3] The training algorithm for random forest applies the general technique of bootstrap aggregating (bagging) to tree learners Given a training set I = i1 , i2 , , in where ij is a feature vector at pixel j, with responses X = x1 , x2 , , xn where xj ∈ L = {1, , l}, bagging repeatedly selects a random sample with replacement of the training set and fits trees to these samples: for b = 1, , ntree 283 as weighted Gaussians: Sample with replacement n training samples (Ib , Xb ) from (I, X) Train a classification tree fb on (Ib , Xb ) endfor After training, predictions for unseen samples i can be made by averaging the predictions from all the individual classification trees on i : ntree fˆ = M (m) where each kG for m = 1, , M , is a Gaussian kernel applied on feature vectors The feature vector of pixel i, which denoted by fi , is computed from image features such as spatial location and color values [13] The function µ(., ), called the label compatibility function, introduces a penalty for nearby similar pixels that are assigned different labels Inference Algorithm: Minimizing the above CRF energy E(x) yields the most probable label assignment x for the given image, that is equivalent to the maximum a posteriori probability inference (MAP) Since the exact minimization is intractable, Mean-Field inference computes a distribution Q(X) that best approximates the probability distribution P (X) of the model Q(X) = i Qi (Xi ) is a product of independent marginals over each of the variables Each of the marginals is constrained to be a proper probability distribution: xi Qi (Xi = xi ) = and Qi (Xi ) ≥ The mean field approximation minimizes the KL-divergence: (3) b=1 It means to take the majority votes in the case of classification trees The use of random forests has several advantages including: the computational efficiency in both training and classification, the probabilistic output, the seamless handling of a large variety of visual features and the inherent feature sharing of a multi-class classifier However, by using this technique the image pixels are labeled independently without regarding interrelations between them Therefore, in the later process, we can further improve the segmentation results by employing an efficient inference model (CRF) that can exploit the interrelations between image pixels 3.2.2 Fully Connected Conditional Random Field D(Q In this subsection we provide a brief overview of fully connected Conditional Random Fields (full-CRF) for pixel-wise labelling and introduce the technique used in this paper A full-CRF, used in the context of pixel-wise label prediction, models pixel labels as random variables that are conditioned upon a global observation, and obey Markov property Here the global observation is usually taken to be the overall image Let X be a random field over the set of random variables X = {X1 , X2 , , XN }, where N is the number of pixels in the image, and Xi is the random variable associated with pixel i, which represents the label assigned to the pixel i and can take any value from a predefined set of labels L = {1, 2, , l} Let I be an image observation, which represents the features corresponding to pixels The pair (I, X) can be seen as a CRF model characterized by a Gibbs distribution: P (X = x|I) = exp(−E(x|I)), Z(I) ψp (xi , xj ) ψu (xi ) + unary Qi (xi ) Pi (xi ) ψi (xi ) (7) Qi (xi ) log Qi (xi ) + Qi (xi ) = i i ψp (xi , xj ) + log Z(I) + Qi (xi ) i

Định dạng
Số trang	6
Dung lượng	1,55 MB