Thefundamental basis for stereo is the fact that a single 3D physical location projects to a unique pair of image locations in two observing camera images, if it is possible tolocate the
Trang 1UNIVERSITY OF CALIFORNIA
SANTA CRUZ
LEARNING-BASED APPROACH FOR VISION PROBLEMS
A dissertation submitted in partial satisfaction of the
requirements for the degree ofDOCTOR OF PHILOSOPHY
inCOMPUTER ENGINEERING
byDan KongDecember 2006
The Dissertation of Dan Kong
is approved:
Professor Hai Tao, Chair
ho rofessor R O Manduchi (pms.
Prdfessor James Davis
LO Son
Lisa C Sloan
Vice Provost and Dean of Graduate Studies
Trang 2UMI Number: 3241208
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion
®
UMI
UMI Microform 3241208Copyright 2007 by ProQuest Information and Learning Company.All rights reserved This microform edition is protected againstunauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb Road
P.O Box 1346
Trang 3Copyright @ byDan Kong2006
Trang 4Table of Contents
List of Figures
List of Tables
Abstract
Dedication
Acknowledgments
I Learning-based Stereo
1 Introduction
11 The Problem 0 0.0.00 00004
12 Foundations of Stereo cv co ee ees
1.2.1 Calibration 0.0 2 ee eee es
12.2 Correspondence
123 Triangulation 000.4 1.3 Constraint 1 ns 2 Related Work and Motivation 2.1 Related Work 2.0 ee ee 2.1.1 Pixelwise Matchng
2.1.2 Window-based Matching
2.1.3 Cooperative Methods
2.14 Dynamic Programming
2.165 MRF-based Methods
2.1.6 Segmentation-based Methods 2.1.7 Symmetric Matchng
22 Motivation 2 ee et ee
a 1 8 R8
Cr 1 R8 1 Ê ca ca 1 1 C U
Cr 1 C1 A8 1
Cr —
1 {6 EU U C 9 RE} R8 8 1 C8 U
vii xi
xii
xV
xvi
Trang 52.2.1 Observation] 2 ee ee es
2.2.2 Observation I] 0 0.0.0 eee ee ee ee2.2.3 Observation TI 00 eee ees
The Approach
3.1 Representing and Learning Matching Behaviors
34.11 Representing matching behaviors 3.1.2 Learning the distribution -
3.1.3 Adaptive bin selection 0.00 e ee eee eens3.2 A Probabilistic Stereo Model 2-000,3.2.1 Stereo as an MAP-MRF problem
3.4 Using segmentation © HQ Q kg kg kia
Results and Discussion
4.1 Learning c c ch c ee ki k a4.2 Depthestimation - 0 HQ eee eee kg và
Primal Sketch Priors
6.1.1 Analytical Priors Q.26.1.2 Example-based Priors 2 0 eee evens
Trang 673 Scene-specific pIiOTS LH Q Q HQ nu Hà Q kg Tà kg
73.1 Generalization 0.0 00.0000 ce và ` ee
7.3.38 Predictability 0 0 ung và và7.3.4 Scenespecific priors 2 0 0 cee ee ee
7.4.1 Introduction to CRF 0 0.000 eee
7.4.22 PCRF formulation 0 000 ee ee eens74.3 Inference 0 0 Hdă.gậ<
Implementation and Results
III Human Detection and Counting
10 Introduction and Related Work
12 Human Counting: A Detection-based Approach
12.1 Convolutional Neural Network(CNN)
113114114115115
Trang 712.3 Results QC ne va
13 Conclusions and Future work
Bibliography
Trang 8(a) Tsukuba left image (b) Synthetically alteration of the Tsukuba
right image by increasing the intensity (c) Depth map computed ing multi-scale Belief Propagation (d) Depth computed using 9 x 9
us-correlation window 0 000 eee eee eee eee ees(Toy stereo image pair from CMU VASC image database: (a) Left (b)Right (c) Depth computed using graph cut (d) Depth computed using
9 x9 correlation window eee ee ee
All the experts used in the algorithm Black dot means the center ofthe matching window 0 000 eee eee eee eeeFor depth discontinuity regions, the accuracy of depth estimates de-
pends on the matching position of the correlation window In this
example, window A is betterthanB
Tsukuba color image (a) and the depth map computed using 7 x 7 NCC
(b) Typical errors in NCC-based stereo matching (a) probability of depth error as a function of distance to the near-est foreground object (b) probability of depth error as a function of
texture strength (c) Probability of estimating true depth for 36 perts on the textured and textureless foreground (d) Probability ofestimating true depth for 36 experts at depth discontinuity regions
ex-Disparity map using 9 x 9 correlation window and pixels A, B, C fromthree typical regions (a) Matching scores at different disparity levels
25
Trang 9The depth map for a scene of two objects: foreground (white
rect-angle) and background (gray rectrect-angle) The shade rectangle A is a
background region close to the foreground (a) Depth map with
fat-tening effect where A has the foreground depth (b) True map for theleft view where A has the background depth
The texture and structure attributes around a pixel
(a) The marginal likelihood density of the 3 x 3 scale texture strength
evaluated on Middlebury stereo data The vertical axis labels theprobaiblity density and the horizontal axis labels the texture strength.The vertical dashed lines indicate the position of the bin boundarieswhich are adaptively choosen (b) The posterior probability distribution
based on the adaptively chosen bỉns The expectation of the posterior entropy rapidly reaches an asymptotic
value as a function of the number of bins
Graphical model for stereo (a): Traditional MRF model (b): MRFmodel in this paper 2 2 cee ee
Illustration of how to compute the proposal probability ration for one
(a):Color segmentation using mean-shift (b): Depth segmentation
based on median filtered SSD depth map (c): Joint color and depthsegmentation 06 ưyaa
Computation of likelihood and smoothness change for one super-pixel
in segmentation-based approach .0 00 ee eee nes
The learned matching behavior for 7 x 7 correlation window (a)(c)(d)
Fattening effect (b) occlusion 2 0 ee ee ees
Dense disparity map for the ” Tsukuba” ,” Sawtooth” ,” Venus” and ” Map”
images (a)Left Image (b)Ground truth (c)Our method(pixel-based).(d)Our method(segmentation-base)
Intermediate results on Tsukuba data at different iterations Comparisons of the disparity maps for the ” Tsukuba”, ”Sawtooth”,
”Venus” and ”"Map” images using 7 x 7 NCC matching cost as thelikelihood (a) Our method (b) Belief propagation (c) Graph cut .Dense disparity maps for the ” Teddy” and ”Cones” images (a)Leftimage (b)Ground truth (c)Our result Comparisons of the disparity maps for the ”face” stereo pair (a) Leftimage (b) Right Image (c) Initial depth from 7 x 7 correlation window.(d) Belief propagation result (e) Graph cut result (f) Our result
Energy of estimated depth map and ground truth (a) Tsukuba (b)Venus (c) Teddy (d) Cones 1 2 ee
34
40
Trang 10Overview of our video super-resolution approach .
The ROC curves of primitive training data (a) and component training
data (b) at different sizes X-axis is match error and Y-axis is hit-rate
The prediction ROC curves of primitive training data (a) and
com-ponent training data (b) at different sizes X-axis is match error and
The ROC curves for scene-specific dictionary D, and general dictionary
D, that measures sufficiency (a) and predictability (b) The specific dictionary outperforms the general dictionary .Graphical model for super-resolution (a) Image hallucination (b)
Comparison of video super-resolution results Top: the original
ad-jacent low resolution frames (a)(b) Independent super-resolution of
each frame (c)(d) Super-resolution with temporal smoothing
Training phase of the algorithm .0 000004
Select the training frame using relative blurriness measure
Super-resolution results for frame 8 and 87 from the plant video
se-quence The input videos has resolution 240x160 Top: Bi-cubic
inter-polation results (720x480) Bottom: results using customized nary plus temporal constraint (720x480) Super-resolution results for frame 12 and 78 from the face video se-
dictio-quence The input videos has resolution 240x160 Top: Bi-cubic
interpolation results (720x480) Bottom: results using scene-specific
dictionary plus temporal constraint (720x480)
Super-resolution results for frame 9, and 121 from the keyboard video
sequence The input videos has resolution 160x120 Top: Bi-cubic
interpolation results (640x480) Bottom: results using customized tionary plus temporal constraint (640x480) Super-resolution results for frame 56, and 73 from the MPEG-4 en-
dic-coded video sequence The input videos has resolution 352x288 (a)(b):
Low resolution frame 117 x 96 (c)(d): Bicubic interpolation to 352 x
288 (e)(f): Super-resolution using our approach RMS errors for first 20 frames of testing video sequences (a) plant”(b) "face (c)"keyboard” 2 ee
Features for crowd counting: (a) one frame from the videos, (b)
fore-ground mask image, (c) edge map, (d) the edge map after the AND’
operation between (b) and (c) 2 0 ee ee
Trang 11The same person has different projected height in the image when
translates on the ground plane .2.- 2.0000 bee 125(a)Density estimation using homography (b) ROI in the image (c)
density map 2 ee ga 126
Three layer neural network architecture The input is the normalizedblob and edge orientation histograms The output is the crowdedness
MEASUTE 6 HQ HH HH ng k va k kg kg kia 130
Model selection: the cross validation errors for different number of
Crowd counting results from site A (a) 30 degree sequences (b) 70
Crowd counting results for sequence from site B
Architecture of our Convolutional Neural Networks for human detection 140
Multi-scale detector © ee ko 146
Some example images from MIT database 148Some example images from INRIA database 148CNN performance on MIT database with different scale (a) Detection
CNN performance on INRIA database with different scale (a) tion rate (b) False alarm rate ee ee 151Crowd counting results for video sequence from Beijing, China (a)(b):Initial detection results for two frames (c)(d): Results after bootstrap-ping the CNN using 'hard examples 152Crowd counting results for Bookstore, UCSC (a): Initial detectionresults for one frame (b): Results after bootstrapping the CNN usinghard examples’, Q0 Q HQ Vu v va va 153
Trang 12Detec-List of Tables
4.1 Performance comparisons using NCC matching cost
-`
12.1 Confusion matrix for MIT testing data of size 16 x 32
12.2 Confusion matrix for INRIA testing data of size 16 x 32
4.2 Performence of the proposed method for the new testbed images
Trang 13Learning-based Approach for Vision Problems
byDan Kong
Learning-based techniques have seen more and more successful application in puter vision ”Learning for vision” is viewed as the next challenging frontier forcomputer vision Technical challenges in applying learning-based methods in visioninclude picking the appropriate representation, model generalization and complexity.This dissertation investigated different vision problems together with the proposed
com-learning algorithms for them In particular, three vision problems are studied fromlow-level to high level: stereo, super-resolution and human detection
In the first part, we present a learning-based approach [73, 74]to address thevisual correspondence problems when the stereo images have different intensity level
The algorithm first learns the matching behaviors of multiple local-window methods
(called experts) using a simple histogram-based method The learned behaviors are
then integrated into a MAP-MRF depth estimation framework and the Hastings algorithm is used to find the MAP solution Segmentation is also used toaccelerate the computation and improve the performance Qualitative and quanti-tative experimental results are presented, which demonstrate that, for stereo imagepair having different intensity level, the proposed algorithm significantly outperforms
Trang 14Metropolis-the state-of-Metropolis-the-art methods.
Using prior knowledge can significantly improve the performance of low-levelimage processing and vision problems In the second part, we propose a learning-
based approach [72, 71] for video super-resolution The approach extends previousprimal sketch image hallucination method via learning a scene-specific priors usingexamples This is achieved by constructing training examples using the high resolution
images captured by still camera and use that to increase the low resolution videos
As a result, information from cameras with different spatio-temporal resolutions iscombined in our framework In addition, we use conditional random field (CRF)
to enforce smoothness constraint between adjacent super-resolved frames and the
video super-resolution is posed as finding the high resolution video that maximizethe conditional probability Extensive experimental results demonstrate that our
approach can produce high quality super-resolution videos
In the third part, we explore the problem of human detection and counting
using supervised learning [70, 69] We first propose a solution based on backgroundsubtraction and edge detection A three layer neural network is trained with novelfeature representation and used for online human counting Since the neural network
approach works by first segmenting the foreground region, it can not count the static
people To solve this problem, we further propose a detection based approach usingConvolutional Neural Network (CNN) This approach applies the detector to every
scale and position of the image and collect the total positive responses Experimental
Trang 15results show that CNN works extremely well for videos where the resolution of human
is very low
Trang 16To my family and Qclick
Trang 17First, I would like to thank my advisor, Professor Hai Tao, for his continuous support,
guidance, encouragement and understanding during my journey of doctoral study and
research The research methodology I learnt from you will benefit me forever
Second, I want to show my deep gratitude to my committee, ProfessorRoberto Manduchi and James Davis for their valuable advice and feedback I alsothank Mei Han from NEC research lab America and Jian Sun from Microsoft ResearchAsia for their help and guidance during my internship time
Third, I want to thank my friends in U.S including, but not limited to
Feng Tang, Qi Zhao, Dan Yuan, Xiaoye Lu, Feng Wang, Deyan Liu, Zuobing Xu, Yi
Zhang, Xianren Wu Their friendship has indeed made my graduate life in the U.S.very enjoyable
Finally, and most important of all, I want to thank my parents and my wife
Shuang for their never-ending love and faith in me throughout the years I love youforever
Trang 18Part I
Learning-based Stereo
Trang 19stud-putes a dense disparity or depth map from a pair of images under known camera
configuration In general, the scene is assumed Lambertian or intensity-consistentfrom different view points, without reflection and transparency The known camera
parameters can provide a epipolar geometry constraint for stereo matching Although
various methods have been proposed to solve the stereo matching, it remains one of
Trang 20the most difficult vision problems due to the following reasons First, there are alwayslight variations, image blurring and sensor noise during image formation Second, theintensity consistency constraint is useless in textureless regions and for scenes withrepetitive patterns Third, the object boundary should be well preserved in the re-
covered depth map Fourth, occluded pixels in one view should not be matched withpixels in the other view Overall, significant progress has been made in several areas,including new techniques for windows and feature-based matching, global optimiza-tion methods based on Markov Random Field (MRF) theory, scanline-based dynamicprogramming, methods for occlusion handling and segmentation-based stereo match-
ing We will review and categorize existing stereo algorithms in the next chapter
1.2 Foundations of Stereo
Computational stereo refers to the problem of determining three-dimensionalstructure of a scene from two or more images taken from different viewpoints Thefundamental basis for stereo is the fact that a single 3D physical location projects to
a unique pair of image locations in two observing camera images, if it is possible tolocate the image locations that correspond to the same physical point in 3D space,
then it is possible to determine its 3D location Thus, the three core problems need
to solved in stereo are calibration, correspondence, and triangulation
Trang 21Figure 1.1: The geometry of nonverged stereo.
1.2.1 Calibration
Calibration is the process of estimating external and internal camera
ge-ometry parameters The external parameters determine the relative positions and
orientations of each camera, while the internal parameters include focal lengths, cal centers and lens distortions Accurate estimate of these parameters are necessary
opti-in order to relate image opti-information to the external word coordopti-inate system The
cal-ibration problem is a well studies problem at this point and high quality toolkits areavailable online (e.g.,[1] For good discussions of recent work on camera calibration,see [49] From now on, we assume the cameras have been calibrated
Consider now the camera configuration shown in Fig 1.1 We define the
baseline of the stereo pair to be the line segment connecting the optical center Oz
Trang 22and Or For the nonverged geometry depicted in Fig 1.1, both camera coordinates
axes are aligned and the basline is parallel to the camera z coordinate axis Undersuch configuration, a point in space projects to two locations on the same scanline in
the left and right camera images We call the displacement of a projected point in
one image with respect to the other disparity The set of all disparities between twoimages is called a disparity map From this definition, it is clear that disparities canonly be computed for points visible in both images; features visible in one image butnot the other are said to be occluded How to handle occluded pixels is one of the keyproblems in computational stereo
1.2.2 Correspondence
In practice, we are given two images, and from the information contained
in the images, we must compute disparities The correspondence problem consists of
determining the locations in each camera image that are the projection of the same
physical point in the scene Thus, we see that accurately solving the correspondenceproblem is the key to accurately solving stereo problem No general solution tothe correspondence problem exists, due to ambiguous matches caused by occlusion,
specularities or textureless Thus, a variety constraints and assumptions are exploited
to make the problem tractable We will discuss some of the constraints commonlyused in stereo algorithms in next section
Trang 231.2.3 Triangulation
The triangulation problem consists of determining three dimensional ture from a disparity map, based on the know camera geometry The depth of a point
struc-in space P imaged by two cameras with optical centers Oz and Op is defstruc-ined by struc-
inter-secting the rays from the optical centers through their respective image observations
p and p’ Given the distance between O, and Op, baseline T and focal length f ofthe cameras, depth at a given point can be computed by similar triangles as (see Fig
The first constraint that is widely used in stereo is the color constancy
con-straint The fundamental hypothesis behind stereo correspondence is that the
ap-pearance of any sufficiently small region in the world changes little from image to
image In general, ”appearance” might emphasize higher-level descriptors over rawintensity values, but in its strongest sense, this hypothesis would mean that the color
of any world point remains constant from image to image In other words, if imagepoints p and p’ are both images of the same world point P, then the color values at
Trang 24Figure 1.2: Rectification can make parallel scanlines and enforce reduce the epipolarconstraint to 1D.
p and p’ are equal This color constancy or brightness constancy (in the case of grayscale images) hypothesis is in face true with identical cameras if all visible surfaces inthe world are perfectly diffuse or Lambertian In practice, given photometric camera
calibration and typical scenes, color constancy holds well enough to justify its use by
most algorithms for stereo correspondence
The geometry of the stereo imaging process also significantly prunes the set
of possible correspondences, from lying on potentially anywhere within the 2D image,
to lying necessarily somewhere along a 1D line embedded in that image, commonlyreferred to as the epipolar constraint Fig 1.2 shows the imaging geometry for two
cameras with optical centers O; and Or A point P in the scene is imaged by the leftand right cameras respectively as points p and ø' The baseline T and optical rays
Trang 25Órp_, to P and Op to P define the plane of projection for the point P, called epipolarplane This epipolar plane intersects the image planes in lines called epipolar lines.
The epipolar line through a point p’ is the image of the opposite ray, Oz to P throughpoint p The point at which an image’s epipolar lines intersect the baseline is calledthe epipole, and this point correspond to the image of the opposite camera’s opticalcenter as imaged by the corresponding camera Given this unique geometry, thecorresponding point p’ of any point p may be found along its respective epipolar line
In practice, it is difficult to build stereo systems with nonverged geometry However,
by rectifying the images such that corresponding epipolar lines lie along horizontalscanlines, the two-dimensional correspondence search problem is again reduced to a
scanline search, greatly reducing both computational complexity and the likelihood
the correspondence problem Thus, other constrain is needed to reconstruct a ingful three dimensional structure Marr and Poggio [87] proposed two additional
mean-rules to guide the stereo correspondence: uniqueness, which states that ”each item
Trang 26from each image may be assigned at most one disparity value”, and continuity, whichstates that "disparity varies smoothly almost everywhere.” These two rules further
disambiguate the correspondence problem Together with color constancy and the
epipolar constraint, uniqueness and continuity typically provide sufficient constraints
to yield a reasonable solution to the stereo correspondence problem
Trang 27(ARPA) Barnard and Fishler [9] reviewed stereo research through 1981, focusing on
the fundamentals of stereo reconstruction, criteria for evaluating performance, and asurvey of well-known approaches at that time Stereo continued to be a significant
focus of research in the computer vision community through the 1908s Dhond andAggarwal [32] reviewed many stereo advances in that decade, including a wealth of new
matching methods, the introduction of hierarchical processing, and the use of ular constraints to reduce ambiguity in stereo By the early 1990s, stereo research
Trang 28trinoc-had, in many ways, matured Although some general stereo matching research
con-tinued, much of the community’s focus turned to more specific problems Many stereo
techniques are developed including early research on occlusion and transparency, tive and dynamic stereo, real-time implementation etc Substantial progress in each ofthese lines of research has been made in the last decade and new trends have emerged
ac-A more detailed review and taxonomy of stereo correspondence algorithms is given
by Scharstein and Szeliski in [113], together with some algorithm implementation,test-bed and results [2]
No review can cite every paper that has been published In the followingsurvey, we have included what we believe to be a representative sampling of importantwork and current trends In particular, we classify the existing stereo methods into
following categorizes and review most representative methods in each category:
pix-elwise matching, window-based matching, cooperative stereo, dynamic programming,MRF-based methods, segmentation-based methods and symmetric matching
2.1.1 Pixelwise Matching
The simplest way to apply uniqueness on top of color constancy and the
epipolar constraint is to match each image pixel with the one image pixel of the mostsimilar color in the corresponding epipolar line This naive technique should workwell in an ideal, Lambertian world in which every physical point has a unique color.However, in practice, the discretized color values of digital images can cause trouble
Trang 29As an extreme example, consider a binary random-dot stereo image With pixelwise
matching, there will be a true match at the correct disparity, but there will also be
a fifty percent chance of a false match at any incorrect disparity The may reason
is that by simply looking at a single pixel does not provide enough information todisambiguate the false matches To make pixelwise matching insensitive to imagesampling and noise, Birchfield and Tomasi [11] proposed a symmetric pixel dissimi-
larity matching function, which is widely used in most global optimization methods
2.1.2 Window-based Matching
By using continuity constraint, which implies that neighboring image pixels
will likely have similar disparities, we can accumulate information from neighboringpoints to reduce ambiguity and false matches This is the basic idea behind window-based matching, on which many early stereo methods are based
To properly deal with the image ambiguity problem, local and area-based
methods generally use some kind of statistical correlation between color or intensity
patterns in the local support windows By using local support windows, image
ambi-guity is reduced efficiently while the discriminative power of the similarity measure is
increased In this approach, it is implicity assumed that all pixels in a support window
are from similar depth in a scene and, therefore, that they have similar disparities
Accordingly, pixels in homogeneous regions get assigned the disparities of neighboringpixels However, support windows that are located on depth discontinuities represent
Trang 30pixels from different depth, and this may result in the fattening effect.
To obtain more accurate results not only at depth discontinuities but also
in homogeneous regions, an appropriate support windows should be selected for eachpixel adaptively At this point, many methods have been proposed, they can beroughly divided into several categorizes
Adaptive-window methods [63][17][131][132] try to find an optimal supportwindow for each pixel by changing the size and shape of a windows adaptively Kan-dade and Okutomi [63]presented a method to select an appropriate window by eval-uating the local variation of intensity and disparity The shape of a support window
is constrained to a rectangle, which is not appropriate for the pixels near arbitrarilyshaped depth boundaries On the other hand, Boykov et al.[17] tried to choose anarbitrarily shaped connected window In [131] and [132], Veksler found a useful range
of window sizes and shapes to explore while evaluating the window cost, which workswell for comparing windows of different sizes
Multiple-window methods [40]{16] select an optimal among the pre-definedmultiple windows, which are located at different positions with the same shape [40]
performed the correlation with nine different windows for each pixel and retained the
disparity with the smallest matching cost
Other local methods include using implicit ” windows” formed by iterativenonlinear diffusion [112], methods [139][138] trying to assign appropriate support-
weights while fixing the shape and size of a local support windows
Trang 31In practice, window-based techniques work fairly well within smooth, tured regions, but tend to blur across any discontinuities Moreover, they generallyperform poorly in textureless regions, but they do not specifically penalize disconti-nuities in the recovered depth map Thus, although these methods assume continuity
tex-by their use of windows, they do not directly encourage continuity in the case of
ambiguous matches
2.1.3 Cooperative Methods
We have seen that window-based methods do not support occlusions: they
try to find unique disparity values for one reference image, but without checking for
“collisions” in the other image Inspired by biological nervous systems, cooperative
methods directly implement the assumptions of continuity and two-way uniqueness
in an iterative, local parallel manner These techniques [87][143] operate directly in
the space of correspondences, rather than in image space, evolving a 3D volume ofreal weights via mutual excitation and inhibition
With different models of smooth surface, cooperative algorithms use differentmodels of excitation Marr and Poggio [87] use a fixed, 2D excitation region for
constant surface model Zitnick and Kanade [143] use a fixed, 3D excitation regionfor continuous surfaces One problem with cooperative methods in practice is that
object boundaries may be rounded or blurred due to the use of fixed window forexcitation In addition, good initialization of the 3D matching volume is crucial for
Trang 32fast convergence of cooperative methods.
2.1.4 Dynamic Programming
Based on ordering constraint, dynamic programming can find the global
minimum for independent scanlines in polynomial time These approaches works
by computing the minimum-cost path through the matrix of all pairwise matching
costs between two corresponding scanlines Partial occlusion is handled explicitly byassigning a group of pixels in one image to a single pixel in the other image
Geiger et al.|43] and Ishikawa and Geiger [58] derived an occlusion processand a disparity field from a matching process Assuming an "ordering constraintand ” uniqueness constraint”, the matching process is transformed into a path-finding
problem where the global optimum is obtained by dynamic programming Belhumeur
[10] defined a set of priors from simple scenes to complex scenes A simplified
relation-ship between disparity and occlusion is used to solve scanline matching by dynamicprogramming Unlike Geiger and Belhumeur who enforced a piecewise-smooth con-
straint, Cox et al [27] and Bobick and Intille [16] did not require the smoothingprior Assuming corresponding features are normally distributed and a fixed cost forocclusion, Cox proposed a dynamic programming solution using only the occlusion
and ordering constraints Bobick and Intille incorporated the Ground Control Points
(GCP) constraint to reduce the sensitivity to occlusion cost and the computation
complexity of Cox’s method
Trang 33Problems with dynamic programming stereo include the selection of the
right cost for occluded pixels and the difficulty of enforcing inter-scanline consistency.Another problem is that the dynamic programming approach requires enforcing theordering constraint This constraint requires that the relative ordering of pixels on ascanline remain the same between the two views, which may not be the case in scenes
containing narrow foreground objects
2.1.5 MRF-based Methods
Markov Random Fields (MRF) is a powerful tool to model spatial
interac-tion Bayesian stereo matching can be formulated as a maximum a posterior MRF
(MAP-MRF) problem There are several methods to solve the MAP-MRF problem:
simulated annealing, Mean-field annealing, the Graduated Non-Convexity algorithm
(GNC), and Variational approximation Finding a solution by simulated annealingcan often take an unacceptably long time although global optimization is achievable
in theory MeField annealing is a deterministic approximation to simulated nealing by attempting to average over statistics of the annealing process GNC can
an-only be applied to some special energy functions Variational approximation
con-verges to a local minimum Recently, the Graph Cut (GC) method (16][66] has been
proposed based on the max flow algorithm in graph theory This method is a fast
efficient algorithm to find a local minimum for a MAP-MRF whose energy function is
Potts or Generalized Potts The absence of an efficient stochastic computing method
Trang 34has made probabilistic models less attractive In [121], a probabilistic stereo model
is proposed and solved by a Bayesian Belief Propagation algorithm An accelerated
there are no large disparity discontinuities inside homogeneous color segments The
main idea is that if a disparity hypothesis is correct, warping the reference image
to the other view according to its disparity will render an image that matches thereal view Therefore, the stereo matching problem is solved through minimizing theglobal image similarity energy Hong and Chen [54] proposed a segment-based stereomatching using graph cuts In their approach, the reference image is divided into
non-overlapping homogeneous segments and the scene structure is represented as a
set of planes in the disparity space The stereo matching problem is formulated as an
energy minimization problem in the segment domain instead of the traditional pixeldomain To improve the boundary localization, Zhang and Kambhamettu [141] use
a variable, 3D excitation region that is dependent on an initial color segmentation of
the input images Occlusion is not explicitly modeled in [54] and [141] So it is hard
Trang 35to identify occlusions and they use a robust error criteria before global matching orregion growing.
2.1.7 Symmetric Matching
Occlusion is one of the major challenge in stereo For a two-frame stereosystem, a point in an image is occluded if its corresponding point is invisible in theother image Computing of occlusion is is ambiguous, so prior constraints need to beimposed Ordering and uniqueness are two constraints typically used To deal withocclusions, researchers also formulate the stereo matching using both left and rightimages symmetrically [120][31] Jian et al [120] modified the uniqueness constraint to
a weaker constraint, visibility constraint, so that the problem of uniqueness caused
by sampling can be avoided when the scene contains horizontally slanted planes
In [31], a novel patch-based stereo algorithm that cuts the segments of one imageusing the segments of the other, and handles occlusion area in a proper way A
symmetric graph-cuts optimization framework is used to find correspondence andocclusion simultaneously
Trang 362.2 Motivation
2.2.1 Observation I
The visual correspondence problem is to compute the pairs of pixels from
two images that result from the same scene point Most state-of-the-art stereo ods compute the likelihood using single pixel dissimilarity based on the assumptionthat the corresponding pixels in the two images should have identical intensity values.Another advantage of using single pixel matching is avoid fattening effect at depth
meth-boundary However, this assumption holds only when the surfaces in the scene arelambertian and the mapping from reflectance to intensity captured by the camera
(e.g camera gain and bias) are identical among different views When the constraintbrightness assumption is violated, for example, in the presence of non-lambertian re-flectance or different camera gains for biases, corresponding scene elements in different
images can be poorly correlated, leading to incorrect depth results To show this, we
use the Tsukuba image pair [2] as an example and slightly increase the brightness ofthe right image Then, we apply the belief propagation (BP) algorithm implemented
in [37] to compute the depth and the result is shown in Fig 2.1(c) As a comparison,
we also compute a depth map by using the 9 x 9 normalized cross-correlation (NCC)
It can be observed from Fig 2.1(d) that NCC generates much better results Another
example is shown in Fig 2.2 to compare the graph cut method [18] and the
normal-ized cross-correlation The stereo pair is taken from CMU VASC image database [3]
Trang 37and the original stereo images have different overall intensity levels in the two views.
We can see that the traditional graph cuts algorithm gives very poor results
com-paring with NCC The main reason that belief propagation and graph cut methods
do not perform well in the above two examples is that when the constant brightness
assumption is violated due to imaging noise or different camera gains and biases,
finding the correspondences using single pixel evidence is not reliable The problem
we indicated here was also observed in [64] and the solution proposed was based onintegrating mutual information into an energy minimization framework, where the
energy is minimized using the graph cut method It should be noted that belief
prop-agation and graph cut are two successful global optimization methods widely nsed in
computer vision For the above two image pairs, both the belief propagation methodand the graph cut method can perform better if NCC scores are used as the likelihood
term Our main argument here is that using single pixel matching scores for stereoimages with different intensity levels, even with sophisticated optimization methods,
the results are not satisfactory
(a) (b) (c) (d)
Figure 2.1: (a) Tsukuba left image (b) Synthetically alteration of the Tsukuba right
image by increasing the intensity (c) Depth map computed using multi-scale Belief
Propagation (d) Depth computed using 9 x 9 correlation window
Trang 38(a) (b) (c) (d)
Figure 2.2: (Toy stereo image pair from CMU VASC image database: (a) Left (b)Right (c) Depth computed using graph cut (d) Depth computed using 9 x 9 corre-lation window
2.2.2 Observation II
Window-based methods such as the normalized cross correlation (NCC) gregate support in local image regions and are robust against intensity changes How-ever, window-based methods suffer from the well-known limitations of poor perfor-
ag-mance at depth discontinuities and in low-texture regions On the other hand, single
pixel matching can be used to produce accurate depth boundaries but fails when the
image intensity levels vary in the two views The question is how to overcome both
problems so that we can deal with the situation of intensity changes while at the
same time preserve the depth discontinuities and produce accurate results in
texture-less regions In [5], an algorithm is presented which combines the window-based localmatching into a global optimization framework to preserve discontinuities However,
to compute the local matching efficiently using graph cuts, [5] assumes that local
windows can have at most two disparities, which has limitations in practice
Inspired by recent research on learning Markov random field (MRF) priors
[25][140](108][109], We propose a novel learning-based approach to learn the matching
Trang 39at a WH He Ha zs.
Figure 2.3: All the experts used in the algorithm Black dot means the center of thematching window
behavior of local methods (SAD,SSD,NCC) and integrate the learned knowledge into
a global probabilistic framework to estimate the depth Thus, instead of learning
priors, the method proposed here can be regarded as the first attempt to learn the
the behaviors of a family of stereo matching algorithms We consider normalizedcross-correlation methods with different window sizes and matching centers in this
paper due to its robustness against image intensity changes Each NCC is called
an expert in the algorithm In the current work, we limit the expert shape to be
rectangular window with 4 scales (3 x 3, 5x 5, 7 x 7 and 9 x 9) and 9 matching centers Therefore, there are total of 36 experts and each expert makes local decision based
on which disparity level the maximum matching score is obtained (winner-takes-all).Fig 2.3 shows the experts used in the algorithm We develop the multiple expertsapproach with two reasons First, different window size suits different situations
Trang 40position Foreground
A
Background
Figure 2.4: For depth discontinuity regions, the accuracy of depth estimates depends
on the matching position of the correlation window In this example, window A isbetter than B
For example, small windows give accurate estimation in textured regions while largewindows are appropriate for low-texture area Second, different matching position
suits pixels on different sides of disparity discontinuities For example, in Fig 2.4,
windowA is better than window B for the pixel on the left side of the depth boundary
to the phenomenon that foreground objects appear to be bigger in the depth map