Luận án tiến sĩ: Learning-based approach for vision problems

Thefundamental basis for stereo is the fact that a single 3D physical location projects to a unique pair of image locations in two observing camera images, if it is possible tolocate the

Trang 1

UNIVERSITY OF CALIFORNIA

SANTA CRUZ

LEARNING-BASED APPROACH FOR VISION PROBLEMS

A dissertation submitted in partial satisfaction of the

requirements for the degree ofDOCTOR OF PHILOSOPHY

inCOMPUTER ENGINEERING

byDan KongDecember 2006

The Dissertation of Dan Kong

is approved:

Professor Hai Tao, Chair

ho rofessor R O Manduchi (pms.

Prdfessor James Davis

LO Son

Lisa C Sloan

Vice Provost and Dean of Graduate Studies

Trang 2

UMI Number: 3241208

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion

®

UMI

ProQuest Information and Learning Company

300 North Zeeb Road

P.O Box 1346

Trang 3

Copyright @ byDan Kong2006

Trang 4

Table of Contents

List of Figures

List of Tables

Abstract

Dedication

Acknowledgments

I Learning-based Stereo

1 Introduction

11 The Problem 0 0.0.00 00004

12 Foundations of Stereo cv co ee ees

1.2.1 Calibration 0.0 2 ee eee es

12.2 Correspondence

123 Triangulation 000.4 1.3 Constraint 1 ns 2 Related Work and Motivation 2.1 Related Work 2.0 ee ee 2.1.1 Pixelwise Matchng

2.1.2 Window-based Matching

2.1.3 Cooperative Methods

2.14 Dynamic Programming

2.165 MRF-based Methods

2.1.6 Segmentation-based Methods 2.1.7 Symmetric Matchng

22 Motivation 2 ee et ee

a 1 8 R8

Cr 1 R8 1 Ê ca ca 1 1 C U

Cr 1 C1 A8 1

Cr —

1 {6 EU U C 9 RE} R8 8 1 C8 U

vii xi

xii

xV

xvi

Trang 5

2.2.1 Observation] 2 ee ee es

2.2.2 Observation I] 0 0.0.0 eee ee ee ee2.2.3 Observation TI 00 eee ees

The Approach

3.1 Representing and Learning Matching Behaviors

34.11 Representing matching behaviors 3.1.2 Learning the distribution -

3.1.3 Adaptive bin selection 0.00 e ee eee eens3.2 A Probabilistic Stereo Model 2-000,3.2.1 Stereo as an MAP-MRF problem

3.4 Using segmentation © HQ Q kg kg kia

Results and Discussion

4.1 Learning c c ch c ee ki k a4.2 Depthestimation - 0 HQ eee eee kg và

Primal Sketch Priors

6.1.1 Analytical Priors Q.26.1.2 Example-based Priors 2 0 eee evens

Trang 6

73 Scene-specific pIiOTS LH Q Q HQ nu Hà Q kg Tà kg

73.1 Generalization 0.0 00.0000 ce và ` ee

7.3.38 Predictability 0 0 ung và và7.3.4 Scenespecific priors 2 0 0 cee ee ee

7.4.1 Introduction to CRF 0 0.000 eee

7.4.22 PCRF formulation 0 000 ee ee eens74.3 Inference 0 0 Hdă.gậ<

Implementation and Results

III Human Detection and Counting

10 Introduction and Related Work

12 Human Counting: A Detection-based Approach

12.1 Convolutional Neural Network(CNN)

113114114115115

Trang 7

12.3 Results QC ne va

13 Conclusions and Future work

Bibliography

Trang 8

(a) Tsukuba left image (b) Synthetically alteration of the Tsukuba

right image by increasing the intensity (c) Depth map computed ing multi-scale Belief Propagation (d) Depth computed using 9 x 9

us-correlation window 0 000 eee eee eee eee ees(Toy stereo image pair from CMU VASC image database: (a) Left (b)Right (c) Depth computed using graph cut (d) Depth computed using

9 x9 correlation window eee ee ee

All the experts used in the algorithm Black dot means the center ofthe matching window 0 000 eee eee eee eeeFor depth discontinuity regions, the accuracy of depth estimates de-

pends on the matching position of the correlation window In this

example, window A is betterthanB

Tsukuba color image (a) and the depth map computed using 7 x 7 NCC

(b) Typical errors in NCC-based stereo matching (a) probability of depth error as a function of distance to the near-est foreground object (b) probability of depth error as a function of

texture strength (c) Probability of estimating true depth for 36 perts on the textured and textureless foreground (d) Probability ofestimating true depth for 36 experts at depth discontinuity regions

ex-Disparity map using 9 x 9 correlation window and pixels A, B, C fromthree typical regions (a) Matching scores at different disparity levels

25

Trang 9

The depth map for a scene of two objects: foreground (white

rect-angle) and background (gray rectrect-angle) The shade rectangle A is a

background region close to the foreground (a) Depth map with

fat-tening effect where A has the foreground depth (b) True map for theleft view where A has the background depth

The texture and structure attributes around a pixel

(a) The marginal likelihood density of the 3 x 3 scale texture strength

evaluated on Middlebury stereo data The vertical axis labels theprobaiblity density and the horizontal axis labels the texture strength.The vertical dashed lines indicate the position of the bin boundarieswhich are adaptively choosen (b) The posterior probability distribution

based on the adaptively chosen bỉns The expectation of the posterior entropy rapidly reaches an asymptotic

value as a function of the number of bins

Graphical model for stereo (a): Traditional MRF model (b): MRFmodel in this paper 2 2 cee ee

Illustration of how to compute the proposal probability ration for one

(a):Color segmentation using mean-shift (b): Depth segmentation

based on median filtered SSD depth map (c): Joint color and depthsegmentation 06 ưyaa

Computation of likelihood and smoothness change for one super-pixel

in segmentation-based approach .0 00 ee eee nes

The learned matching behavior for 7 x 7 correlation window (a)(c)(d)

Fattening effect (b) occlusion 2 0 ee ee ees

Dense disparity map for the ” Tsukuba” ,” Sawtooth” ,” Venus” and ” Map”

images (a)Left Image (b)Ground truth (c)Our method(pixel-based).(d)Our method(segmentation-base)

Intermediate results on Tsukuba data at different iterations Comparisons of the disparity maps for the ” Tsukuba”, ”Sawtooth”,

”Venus” and ”"Map” images using 7 x 7 NCC matching cost as thelikelihood (a) Our method (b) Belief propagation (c) Graph cut .Dense disparity maps for the ” Teddy” and ”Cones” images (a)Leftimage (b)Ground truth (c)Our result Comparisons of the disparity maps for the ”face” stereo pair (a) Leftimage (b) Right Image (c) Initial depth from 7 x 7 correlation window.(d) Belief propagation result (e) Graph cut result (f) Our result

Energy of estimated depth map and ground truth (a) Tsukuba (b)Venus (c) Teddy (d) Cones 1 2 ee

34

40

Trang 10

Overview of our video super-resolution approach .

The ROC curves of primitive training data (a) and component training

data (b) at different sizes X-axis is match error and Y-axis is hit-rate

The prediction ROC curves of primitive training data (a) and

com-ponent training data (b) at different sizes X-axis is match error and

The ROC curves for scene-specific dictionary D, and general dictionary

D, that measures sufficiency (a) and predictability (b) The specific dictionary outperforms the general dictionary .Graphical model for super-resolution (a) Image hallucination (b)

Comparison of video super-resolution results Top: the original

ad-jacent low resolution frames (a)(b) Independent super-resolution of

each frame (c)(d) Super-resolution with temporal smoothing

Training phase of the algorithm .0 000004

Select the training frame using relative blurriness measure

Super-resolution results for frame 8 and 87 from the plant video

se-quence The input videos has resolution 240x160 Top: Bi-cubic

inter-polation results (720x480) Bottom: results using customized nary plus temporal constraint (720x480) Super-resolution results for frame 12 and 78 from the face video se-

dictio-quence The input videos has resolution 240x160 Top: Bi-cubic

interpolation results (720x480) Bottom: results using scene-specific

dictionary plus temporal constraint (720x480)

Super-resolution results for frame 9, and 121 from the keyboard video

sequence The input videos has resolution 160x120 Top: Bi-cubic

interpolation results (640x480) Bottom: results using customized tionary plus temporal constraint (640x480) Super-resolution results for frame 56, and 73 from the MPEG-4 en-

dic-coded video sequence The input videos has resolution 352x288 (a)(b):

Low resolution frame 117 x 96 (c)(d): Bicubic interpolation to 352 x

288 (e)(f): Super-resolution using our approach RMS errors for first 20 frames of testing video sequences (a) plant”(b) "face (c)"keyboard” 2 ee

Features for crowd counting: (a) one frame from the videos, (b)

fore-ground mask image, (c) edge map, (d) the edge map after the AND’

operation between (b) and (c) 2 0 ee ee

Trang 11

The same person has different projected height in the image when

translates on the ground plane .2.- 2.0000 bee 125(a)Density estimation using homography (b) ROI in the image (c)

density map 2 ee ga 126

Three layer neural network architecture The input is the normalizedblob and edge orientation histograms The output is the crowdedness

MEASUTE 6 HQ HH HH ng k va k kg kg kia 130

Model selection: the cross validation errors for different number of

Crowd counting results from site A (a) 30 degree sequences (b) 70

Crowd counting results for sequence from site B

Architecture of our Convolutional Neural Networks for human detection 140

Multi-scale detector © ee ko 146

Some example images from MIT database 148Some example images from INRIA database 148CNN performance on MIT database with different scale (a) Detection

CNN performance on INRIA database with different scale (a) tion rate (b) False alarm rate ee ee 151Crowd counting results for video sequence from Beijing, China (a)(b):Initial detection results for two frames (c)(d): Results after bootstrap-ping the CNN using 'hard examples 152Crowd counting results for Bookstore, UCSC (a): Initial detectionresults for one frame (b): Results after bootstrapping the CNN usinghard examples’, Q0 Q HQ Vu v va va 153

Trang 12

Detec-List of Tables

4.1 Performance comparisons using NCC matching cost

-`

12.1 Confusion matrix for MIT testing data of size 16 x 32

12.2 Confusion matrix for INRIA testing data of size 16 x 32

4.2 Performence of the proposed method for the new testbed images

Trang 13

Learning-based Approach for Vision Problems

byDan Kong

Learning-based techniques have seen more and more successful application in puter vision ”Learning for vision” is viewed as the next challenging frontier forcomputer vision Technical challenges in applying learning-based methods in visioninclude picking the appropriate representation, model generalization and complexity.This dissertation investigated different vision problems together with the proposed

com-learning algorithms for them In particular, three vision problems are studied fromlow-level to high level: stereo, super-resolution and human detection

In the first part, we present a learning-based approach [73, 74]to address thevisual correspondence problems when the stereo images have different intensity level

The algorithm first learns the matching behaviors of multiple local-window methods

(called experts) using a simple histogram-based method The learned behaviors are

then integrated into a MAP-MRF depth estimation framework and the Hastings algorithm is used to find the MAP solution Segmentation is also used toaccelerate the computation and improve the performance Qualitative and quanti-tative experimental results are presented, which demonstrate that, for stereo imagepair having different intensity level, the proposed algorithm significantly outperforms

Trang 14

Metropolis-the state-of-Metropolis-the-art methods.

Using prior knowledge can significantly improve the performance of low-levelimage processing and vision problems In the second part, we propose a learning-

based approach [72, 71] for video super-resolution The approach extends previousprimal sketch image hallucination method via learning a scene-specific priors usingexamples This is achieved by constructing training examples using the high resolution

images captured by still camera and use that to increase the low resolution videos

As a result, information from cameras with different spatio-temporal resolutions iscombined in our framework In addition, we use conditional random field (CRF)

to enforce smoothness constraint between adjacent super-resolved frames and the

video super-resolution is posed as finding the high resolution video that maximizethe conditional probability Extensive experimental results demonstrate that our

approach can produce high quality super-resolution videos

In the third part, we explore the problem of human detection and counting

using supervised learning [70, 69] We first propose a solution based on backgroundsubtraction and edge detection A three layer neural network is trained with novelfeature representation and used for online human counting Since the neural network

approach works by first segmenting the foreground region, it can not count the static

people To solve this problem, we further propose a detection based approach usingConvolutional Neural Network (CNN) This approach applies the detector to every

scale and position of the image and collect the total positive responses Experimental

Trang 15

results show that CNN works extremely well for videos where the resolution of human

is very low

Trang 16

To my family and Qclick

Trang 17

First, I would like to thank my advisor, Professor Hai Tao, for his continuous support,

guidance, encouragement and understanding during my journey of doctoral study and

research The research methodology I learnt from you will benefit me forever

Second, I want to show my deep gratitude to my committee, ProfessorRoberto Manduchi and James Davis for their valuable advice and feedback I alsothank Mei Han from NEC research lab America and Jian Sun from Microsoft ResearchAsia for their help and guidance during my internship time

Third, I want to thank my friends in U.S including, but not limited to

Feng Tang, Qi Zhao, Dan Yuan, Xiaoye Lu, Feng Wang, Deyan Liu, Zuobing Xu, Yi

Zhang, Xianren Wu Their friendship has indeed made my graduate life in the U.S.very enjoyable

Finally, and most important of all, I want to thank my parents and my wife

Shuang for their never-ending love and faith in me throughout the years I love youforever

Trang 18

Part I

Learning-based Stereo

Trang 19

stud-putes a dense disparity or depth map from a pair of images under known camera

configuration In general, the scene is assumed Lambertian or intensity-consistentfrom different view points, without reflection and transparency The known camera

parameters can provide a epipolar geometry constraint for stereo matching Although

various methods have been proposed to solve the stereo matching, it remains one of

Trang 20

the most difficult vision problems due to the following reasons First, there are alwayslight variations, image blurring and sensor noise during image formation Second, theintensity consistency constraint is useless in textureless regions and for scenes withrepetitive patterns Third, the object boundary should be well preserved in the re-

covered depth map Fourth, occluded pixels in one view should not be matched withpixels in the other view Overall, significant progress has been made in several areas,including new techniques for windows and feature-based matching, global optimiza-tion methods based on Markov Random Field (MRF) theory, scanline-based dynamicprogramming, methods for occlusion handling and segmentation-based stereo match-

ing We will review and categorize existing stereo algorithms in the next chapter

1.2 Foundations of Stereo

Computational stereo refers to the problem of determining three-dimensionalstructure of a scene from two or more images taken from different viewpoints Thefundamental basis for stereo is the fact that a single 3D physical location projects to

a unique pair of image locations in two observing camera images, if it is possible tolocate the image locations that correspond to the same physical point in 3D space,

then it is possible to determine its 3D location Thus, the three core problems need

to solved in stereo are calibration, correspondence, and triangulation

Trang 21

Figure 1.1: The geometry of nonverged stereo.

1.2.1 Calibration

Calibration is the process of estimating external and internal camera

ge-ometry parameters The external parameters determine the relative positions and

orientations of each camera, while the internal parameters include focal lengths, cal centers and lens distortions Accurate estimate of these parameters are necessary

opti-in order to relate image opti-information to the external word coordopti-inate system The

cal-ibration problem is a well studies problem at this point and high quality toolkits areavailable online (e.g.,[1] For good discussions of recent work on camera calibration,see [49] From now on, we assume the cameras have been calibrated

Consider now the camera configuration shown in Fig 1.1 We define the

baseline of the stereo pair to be the line segment connecting the optical center Oz

Trang 22

and Or For the nonverged geometry depicted in Fig 1.1, both camera coordinates

axes are aligned and the basline is parallel to the camera z coordinate axis Undersuch configuration, a point in space projects to two locations on the same scanline in

the left and right camera images We call the displacement of a projected point in

one image with respect to the other disparity The set of all disparities between twoimages is called a disparity map From this definition, it is clear that disparities canonly be computed for points visible in both images; features visible in one image butnot the other are said to be occluded How to handle occluded pixels is one of the keyproblems in computational stereo

1.2.2 Correspondence

In practice, we are given two images, and from the information contained

in the images, we must compute disparities The correspondence problem consists of

determining the locations in each camera image that are the projection of the same

physical point in the scene Thus, we see that accurately solving the correspondenceproblem is the key to accurately solving stereo problem No general solution tothe correspondence problem exists, due to ambiguous matches caused by occlusion,

specularities or textureless Thus, a variety constraints and assumptions are exploited

to make the problem tractable We will discuss some of the constraints commonlyused in stereo algorithms in next section

Trang 23

1.2.3 Triangulation

The triangulation problem consists of determining three dimensional ture from a disparity map, based on the know camera geometry The depth of a point

struc-in space P imaged by two cameras with optical centers Oz and Op is defstruc-ined by struc-

inter-secting the rays from the optical centers through their respective image observations

p and p’ Given the distance between O, and Op, baseline T and focal length f ofthe cameras, depth at a given point can be computed by similar triangles as (see Fig

The first constraint that is widely used in stereo is the color constancy

con-straint The fundamental hypothesis behind stereo correspondence is that the

ap-pearance of any sufficiently small region in the world changes little from image to

image In general, ”appearance” might emphasize higher-level descriptors over rawintensity values, but in its strongest sense, this hypothesis would mean that the color

of any world point remains constant from image to image In other words, if imagepoints p and p’ are both images of the same world point P, then the color values at

Trang 24

Figure 1.2: Rectification can make parallel scanlines and enforce reduce the epipolarconstraint to 1D.

p and p’ are equal This color constancy or brightness constancy (in the case of grayscale images) hypothesis is in face true with identical cameras if all visible surfaces inthe world are perfectly diffuse or Lambertian In practice, given photometric camera

calibration and typical scenes, color constancy holds well enough to justify its use by

most algorithms for stereo correspondence

The geometry of the stereo imaging process also significantly prunes the set

of possible correspondences, from lying on potentially anywhere within the 2D image,

to lying necessarily somewhere along a 1D line embedded in that image, commonlyreferred to as the epipolar constraint Fig 1.2 shows the imaging geometry for two

cameras with optical centers O; and Or A point P in the scene is imaged by the leftand right cameras respectively as points p and ø' The baseline T and optical rays

Trang 25

Órp_, to P and Op to P define the plane of projection for the point P, called epipolarplane This epipolar plane intersects the image planes in lines called epipolar lines.

The epipolar line through a point p’ is the image of the opposite ray, Oz to P throughpoint p The point at which an image’s epipolar lines intersect the baseline is calledthe epipole, and this point correspond to the image of the opposite camera’s opticalcenter as imaged by the corresponding camera Given this unique geometry, thecorresponding point p’ of any point p may be found along its respective epipolar line

In practice, it is difficult to build stereo systems with nonverged geometry However,

by rectifying the images such that corresponding epipolar lines lie along horizontalscanlines, the two-dimensional correspondence search problem is again reduced to a

scanline search, greatly reducing both computational complexity and the likelihood

the correspondence problem Thus, other constrain is needed to reconstruct a ingful three dimensional structure Marr and Poggio [87] proposed two additional

mean-rules to guide the stereo correspondence: uniqueness, which states that ”each item

Trang 26

from each image may be assigned at most one disparity value”, and continuity, whichstates that "disparity varies smoothly almost everywhere.” These two rules further

disambiguate the correspondence problem Together with color constancy and the

epipolar constraint, uniqueness and continuity typically provide sufficient constraints

to yield a reasonable solution to the stereo correspondence problem

Trang 27

(ARPA) Barnard and Fishler [9] reviewed stereo research through 1981, focusing on

the fundamentals of stereo reconstruction, criteria for evaluating performance, and asurvey of well-known approaches at that time Stereo continued to be a significant

focus of research in the computer vision community through the 1908s Dhond andAggarwal [32] reviewed many stereo advances in that decade, including a wealth of new

matching methods, the introduction of hierarchical processing, and the use of ular constraints to reduce ambiguity in stereo By the early 1990s, stereo research

Trang 28

trinoc-had, in many ways, matured Although some general stereo matching research

con-tinued, much of the community’s focus turned to more specific problems Many stereo

techniques are developed including early research on occlusion and transparency, tive and dynamic stereo, real-time implementation etc Substantial progress in each ofthese lines of research has been made in the last decade and new trends have emerged

ac-A more detailed review and taxonomy of stereo correspondence algorithms is given

by Scharstein and Szeliski in [113], together with some algorithm implementation,test-bed and results [2]

No review can cite every paper that has been published In the followingsurvey, we have included what we believe to be a representative sampling of importantwork and current trends In particular, we classify the existing stereo methods into

following categorizes and review most representative methods in each category:

pix-elwise matching, window-based matching, cooperative stereo, dynamic programming,MRF-based methods, segmentation-based methods and symmetric matching

2.1.1 Pixelwise Matching

The simplest way to apply uniqueness on top of color constancy and the

epipolar constraint is to match each image pixel with the one image pixel of the mostsimilar color in the corresponding epipolar line This naive technique should workwell in an ideal, Lambertian world in which every physical point has a unique color.However, in practice, the discretized color values of digital images can cause trouble

Trang 29

As an extreme example, consider a binary random-dot stereo image With pixelwise

matching, there will be a true match at the correct disparity, but there will also be

a fifty percent chance of a false match at any incorrect disparity The may reason

is that by simply looking at a single pixel does not provide enough information todisambiguate the false matches To make pixelwise matching insensitive to imagesampling and noise, Birchfield and Tomasi [11] proposed a symmetric pixel dissimi-

larity matching function, which is widely used in most global optimization methods

2.1.2 Window-based Matching

By using continuity constraint, which implies that neighboring image pixels

will likely have similar disparities, we can accumulate information from neighboringpoints to reduce ambiguity and false matches This is the basic idea behind window-based matching, on which many early stereo methods are based

To properly deal with the image ambiguity problem, local and area-based

methods generally use some kind of statistical correlation between color or intensity

patterns in the local support windows By using local support windows, image

ambi-guity is reduced efficiently while the discriminative power of the similarity measure is

increased In this approach, it is implicity assumed that all pixels in a support window

are from similar depth in a scene and, therefore, that they have similar disparities

Accordingly, pixels in homogeneous regions get assigned the disparities of neighboringpixels However, support windows that are located on depth discontinuities represent

Trang 30

pixels from different depth, and this may result in the fattening effect.

To obtain more accurate results not only at depth discontinuities but also

in homogeneous regions, an appropriate support windows should be selected for eachpixel adaptively At this point, many methods have been proposed, they can beroughly divided into several categorizes

Adaptive-window methods [63][17][131][132] try to find an optimal supportwindow for each pixel by changing the size and shape of a windows adaptively Kan-dade and Okutomi [63]presented a method to select an appropriate window by eval-uating the local variation of intensity and disparity The shape of a support window

is constrained to a rectangle, which is not appropriate for the pixels near arbitrarilyshaped depth boundaries On the other hand, Boykov et al.[17] tried to choose anarbitrarily shaped connected window In [131] and [132], Veksler found a useful range

of window sizes and shapes to explore while evaluating the window cost, which workswell for comparing windows of different sizes

Multiple-window methods [40]{16] select an optimal among the pre-definedmultiple windows, which are located at different positions with the same shape [40]

performed the correlation with nine different windows for each pixel and retained the

disparity with the smallest matching cost

Other local methods include using implicit ” windows” formed by iterativenonlinear diffusion [112], methods [139][138] trying to assign appropriate support-

weights while fixing the shape and size of a local support windows

Trang 31

In practice, window-based techniques work fairly well within smooth, tured regions, but tend to blur across any discontinuities Moreover, they generallyperform poorly in textureless regions, but they do not specifically penalize disconti-nuities in the recovered depth map Thus, although these methods assume continuity

tex-by their use of windows, they do not directly encourage continuity in the case of

ambiguous matches

2.1.3 Cooperative Methods

We have seen that window-based methods do not support occlusions: they

try to find unique disparity values for one reference image, but without checking for

“collisions” in the other image Inspired by biological nervous systems, cooperative

methods directly implement the assumptions of continuity and two-way uniqueness

in an iterative, local parallel manner These techniques [87][143] operate directly in

the space of correspondences, rather than in image space, evolving a 3D volume ofreal weights via mutual excitation and inhibition

With different models of smooth surface, cooperative algorithms use differentmodels of excitation Marr and Poggio [87] use a fixed, 2D excitation region for

constant surface model Zitnick and Kanade [143] use a fixed, 3D excitation regionfor continuous surfaces One problem with cooperative methods in practice is that

object boundaries may be rounded or blurred due to the use of fixed window forexcitation In addition, good initialization of the 3D matching volume is crucial for

Trang 32

fast convergence of cooperative methods.

2.1.4 Dynamic Programming

Based on ordering constraint, dynamic programming can find the global

minimum for independent scanlines in polynomial time These approaches works

by computing the minimum-cost path through the matrix of all pairwise matching

costs between two corresponding scanlines Partial occlusion is handled explicitly byassigning a group of pixels in one image to a single pixel in the other image

Geiger et al.|43] and Ishikawa and Geiger [58] derived an occlusion processand a disparity field from a matching process Assuming an "ordering constraintand ” uniqueness constraint”, the matching process is transformed into a path-finding

problem where the global optimum is obtained by dynamic programming Belhumeur

[10] defined a set of priors from simple scenes to complex scenes A simplified

relation-ship between disparity and occlusion is used to solve scanline matching by dynamicprogramming Unlike Geiger and Belhumeur who enforced a piecewise-smooth con-

straint, Cox et al [27] and Bobick and Intille [16] did not require the smoothingprior Assuming corresponding features are normally distributed and a fixed cost forocclusion, Cox proposed a dynamic programming solution using only the occlusion

and ordering constraints Bobick and Intille incorporated the Ground Control Points

(GCP) constraint to reduce the sensitivity to occlusion cost and the computation

complexity of Cox’s method

Trang 33

Problems with dynamic programming stereo include the selection of the

right cost for occluded pixels and the difficulty of enforcing inter-scanline consistency.Another problem is that the dynamic programming approach requires enforcing theordering constraint This constraint requires that the relative ordering of pixels on ascanline remain the same between the two views, which may not be the case in scenes

containing narrow foreground objects

2.1.5 MRF-based Methods

Markov Random Fields (MRF) is a powerful tool to model spatial

interac-tion Bayesian stereo matching can be formulated as a maximum a posterior MRF

(MAP-MRF) problem There are several methods to solve the MAP-MRF problem:

simulated annealing, Mean-field annealing, the Graduated Non-Convexity algorithm

(GNC), and Variational approximation Finding a solution by simulated annealingcan often take an unacceptably long time although global optimization is achievable

in theory MeField annealing is a deterministic approximation to simulated nealing by attempting to average over statistics of the annealing process GNC can

an-only be applied to some special energy functions Variational approximation

con-verges to a local minimum Recently, the Graph Cut (GC) method (16][66] has been

proposed based on the max flow algorithm in graph theory This method is a fast

efficient algorithm to find a local minimum for a MAP-MRF whose energy function is

Potts or Generalized Potts The absence of an efficient stochastic computing method

Trang 34

has made probabilistic models less attractive In [121], a probabilistic stereo model

is proposed and solved by a Bayesian Belief Propagation algorithm An accelerated

there are no large disparity discontinuities inside homogeneous color segments The

main idea is that if a disparity hypothesis is correct, warping the reference image

to the other view according to its disparity will render an image that matches thereal view Therefore, the stereo matching problem is solved through minimizing theglobal image similarity energy Hong and Chen [54] proposed a segment-based stereomatching using graph cuts In their approach, the reference image is divided into

non-overlapping homogeneous segments and the scene structure is represented as a

set of planes in the disparity space The stereo matching problem is formulated as an

energy minimization problem in the segment domain instead of the traditional pixeldomain To improve the boundary localization, Zhang and Kambhamettu [141] use

a variable, 3D excitation region that is dependent on an initial color segmentation of

the input images Occlusion is not explicitly modeled in [54] and [141] So it is hard

Trang 35

to identify occlusions and they use a robust error criteria before global matching orregion growing.

2.1.7 Symmetric Matching

Occlusion is one of the major challenge in stereo For a two-frame stereosystem, a point in an image is occluded if its corresponding point is invisible in theother image Computing of occlusion is is ambiguous, so prior constraints need to beimposed Ordering and uniqueness are two constraints typically used To deal withocclusions, researchers also formulate the stereo matching using both left and rightimages symmetrically [120][31] Jian et al [120] modified the uniqueness constraint to

a weaker constraint, visibility constraint, so that the problem of uniqueness caused

by sampling can be avoided when the scene contains horizontally slanted planes

In [31], a novel patch-based stereo algorithm that cuts the segments of one imageusing the segments of the other, and handles occlusion area in a proper way A

symmetric graph-cuts optimization framework is used to find correspondence andocclusion simultaneously

Trang 36

2.2 Motivation

2.2.1 Observation I

The visual correspondence problem is to compute the pairs of pixels from

two images that result from the same scene point Most state-of-the-art stereo ods compute the likelihood using single pixel dissimilarity based on the assumptionthat the corresponding pixels in the two images should have identical intensity values.Another advantage of using single pixel matching is avoid fattening effect at depth

meth-boundary However, this assumption holds only when the surfaces in the scene arelambertian and the mapping from reflectance to intensity captured by the camera

(e.g camera gain and bias) are identical among different views When the constraintbrightness assumption is violated, for example, in the presence of non-lambertian re-flectance or different camera gains for biases, corresponding scene elements in different

images can be poorly correlated, leading to incorrect depth results To show this, we

use the Tsukuba image pair [2] as an example and slightly increase the brightness ofthe right image Then, we apply the belief propagation (BP) algorithm implemented

in [37] to compute the depth and the result is shown in Fig 2.1(c) As a comparison,

we also compute a depth map by using the 9 x 9 normalized cross-correlation (NCC)

It can be observed from Fig 2.1(d) that NCC generates much better results Another

example is shown in Fig 2.2 to compare the graph cut method [18] and the

normal-ized cross-correlation The stereo pair is taken from CMU VASC image database [3]

Trang 37

and the original stereo images have different overall intensity levels in the two views.

We can see that the traditional graph cuts algorithm gives very poor results

com-paring with NCC The main reason that belief propagation and graph cut methods

do not perform well in the above two examples is that when the constant brightness

assumption is violated due to imaging noise or different camera gains and biases,

finding the correspondences using single pixel evidence is not reliable The problem

we indicated here was also observed in [64] and the solution proposed was based onintegrating mutual information into an energy minimization framework, where the

energy is minimized using the graph cut method It should be noted that belief

prop-agation and graph cut are two successful global optimization methods widely nsed in

computer vision For the above two image pairs, both the belief propagation methodand the graph cut method can perform better if NCC scores are used as the likelihood

term Our main argument here is that using single pixel matching scores for stereoimages with different intensity levels, even with sophisticated optimization methods,

the results are not satisfactory

(a) (b) (c) (d)

Figure 2.1: (a) Tsukuba left image (b) Synthetically alteration of the Tsukuba right

image by increasing the intensity (c) Depth map computed using multi-scale Belief

Propagation (d) Depth computed using 9 x 9 correlation window

Trang 38

(a) (b) (c) (d)

Figure 2.2: (Toy stereo image pair from CMU VASC image database: (a) Left (b)Right (c) Depth computed using graph cut (d) Depth computed using 9 x 9 corre-lation window

2.2.2 Observation II

Window-based methods such as the normalized cross correlation (NCC) gregate support in local image regions and are robust against intensity changes How-ever, window-based methods suffer from the well-known limitations of poor perfor-

ag-mance at depth discontinuities and in low-texture regions On the other hand, single

pixel matching can be used to produce accurate depth boundaries but fails when the

image intensity levels vary in the two views The question is how to overcome both

problems so that we can deal with the situation of intensity changes while at the

same time preserve the depth discontinuities and produce accurate results in

texture-less regions In [5], an algorithm is presented which combines the window-based localmatching into a global optimization framework to preserve discontinuities However,

to compute the local matching efficiently using graph cuts, [5] assumes that local

windows can have at most two disparities, which has limitations in practice

Inspired by recent research on learning Markov random field (MRF) priors

[25][140](108][109], We propose a novel learning-based approach to learn the matching

Trang 39

at a WH He Ha zs.

Figure 2.3: All the experts used in the algorithm Black dot means the center of thematching window

behavior of local methods (SAD,SSD,NCC) and integrate the learned knowledge into

a global probabilistic framework to estimate the depth Thus, instead of learning

priors, the method proposed here can be regarded as the first attempt to learn the

the behaviors of a family of stereo matching algorithms We consider normalizedcross-correlation methods with different window sizes and matching centers in this

paper due to its robustness against image intensity changes Each NCC is called

an expert in the algorithm In the current work, we limit the expert shape to be

rectangular window with 4 scales (3 x 3, 5x 5, 7 x 7 and 9 x 9) and 9 matching centers Therefore, there are total of 36 experts and each expert makes local decision based

on which disparity level the maximum matching score is obtained (winner-takes-all).Fig 2.3 shows the experts used in the algorithm We develop the multiple expertsapproach with two reasons First, different window size suits different situations

Trang 40

position Foreground

A

Background

Figure 2.4: For depth discontinuity regions, the accuracy of depth estimates depends

on the matching position of the correlation window In this example, window A isbetter than B

For example, small windows give accurate estimation in textured regions while largewindows are appropriate for low-texture area Second, different matching position

suits pixels on different sides of disparity discontinuities For example, in Fig 2.4,

windowA is better than window B for the pixel on the left side of the depth boundary

to the phenomenon that foreground objects appear to be bigger in the depth map

Tiêu đề	Learning-based Approach for Vision Problems
Tác giả	Dan Kong
Người hướng dẫn	Professor Hai Tao, Professor R., Professor James Davis
Trường học	University of California, Santa Cruz
Chuyên ngành	Computer Engineering
Thể loại	Dissertation
Năm xuất bản	2006
Thành phố	Santa Cruz

Định dạng
Số trang	190
Dung lượng	14,44 MB