Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 87 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
87
Dung lượng
4,56 MB
Nội dung
Abstract
Image refocusing is a potential research area in computer graphics and computer vision. By
definition it means focusing an image again, or changing the emphasized region in a given image.
To achieve the focusing job, it requires shallow depth of field to create a focus-defocus scene,
which depends on larger size of lens aperture.
In our project, we simulate a larger camera lens aperture by using several photos taken from
slightly different viewpoints. Based on these images, a layer depth map is generated to present
how the objects distribute in the real world scene. User can arbitrarily select one of the
objects/layers to focus, and other parts will be naturally blurred according to their depth values in
the scene.
This project can be divided into two parts. One is how to produce a layer depth map. Computing
a depth map is actually a work of labeling assignment. This type of problem can be solved by
finding a minimum value of a constructed energy function. Graph Cuts algorithm is one of the
most efficient optimization methods. We use it to optimize our built energy function due to its
feature of fast convergence. The second part is to blur each layer that is not assigned to be
focused. Several blurry algorithms are applied to achieve this goal.
In this paper, I first describe some related work and background studies on the labeling
assignment theories and their related topics in vision area. I then explore the refocus-related
principals in computational photography. Based on these studies, I go through our image
refocusing project in details and compare the experimental results to other existed approaches.
Finally, I proposed some possible future work in this research area.
1
Acknowledgment
First, I would like to express my sincere appreciation to my supervisor, Dr Low Kok Lim. He has
offered me a large amount of precious advice and suggestions, contributed his valued time to
review this paper and provided me proposed comments.
I would also like to thank my family, especially my parents. They always offer me continuous
encouragement and infinite support. My acknowledgement will give to my labmates as well.
When I fell tired and wanted to give up, they were here by my side and support me.
2
Contents
Abstract
1
Acknowledgment
2
Contents
3
1 Introduction
7
1.1
Introduction
7
1.2
Problem Statement
8
1.3
Advantages and Contributions
2 Related Work
2.1
2.2
10
13
Depth Estimation
13
2.1.1
Single Image Input
13
2.1.2
Stereo Matching – Two Images Input
15
2.1.3
Multi-View scene reconstruction - Multiple Images Input
17
2.1.4
3D World Reconstruction from a large dataset
17
Image Refocusing
3 Background Studies
18
21
3.1
Standard Energy Function in Vision
21
3.2
Optimization Methods
23
3.2.1
23
Introduction to Optimization
3
3.2.2
3.3
3.4
3.5
Graph Cuts
24
Graph Cuts
26
3.3.1
Preliminary Knowledge
26
3.3.2
Details of Graph Cuts
28
Concepts of Stereo Matching
33
3.4.1
Problem Formulation
33
3.4.2
Conclusion and Future Work
35
Photography
36
3.5.1 Technical Principles of Photography
36
3.5.2 Effects and Processing in Photography
41
Defocus Magnification
45
3.6.1 Overview and Problem Formulation
45
3.6.2 Results and Conclusion
46
3.7
Refocus Imaging
48
3.8
Adobe Photoshop
49
3.6
4 Refocusing from Multiple Images
51
4.1
Data Set
52
4.2
Computation of Camera Parameters
54
4.3
Estimation of Depth Value in 3D Scene
57
4.4
Problem Formulation and Graph Cuts
58
4.4.1
58
Overview of Problem Formulation
4
4.4.2
Results of Single Depth Map
60
4.4.3
Considerations and Difficulties
61
4.5
Layer Depth Map
62
4.6
Layer Blurring
65
4.7
Combining Blurry Layers
67
4.8
Results and Comparison
69
4.8.1
Experimental Results
69
4.8.2
Comparison to Adobe Photoshop
73
4.8.3
Different Lens Apertures
75
4.8.4
Bokeh
76
4.8.5
Different number of input images
78
4.8.6
Comparison to Other Software on Camera
81
4.8.7
Limitation
83
5 Conclusion
84
Bibliography
85
5
Chapter 1
Introduction
1.1 Introduction
In photography, topics on focusing, refocusing and depth of field are potential and popular
research areas which attract a large amount of attention. If an object is at the exact focus point,
we call that it is focused precisely. While at any other distance from the camera, this object is
defocused, and will looks like to be blurred in the resultant photo. The refocusing technique is to
change the focused part of a picture, which leads to a case that other not-be-focused objects are
blurred. Given the position of a camera, the depth of field (DOF) is controlled by the diameter or
the shape of camera lens aperture. In general, a small aperture results in a large depth of field;
then we can obtain an image with sharp objects everywhere (deep focus). In other cases, some
objects are emphasized while some are blurred. This is resulted from a relatively large aperture,
i.e. a small depth of field (shallow focus).
In practice, people sometimes want to shoot a photo with sharp foreground and blurred
background. Moreover, they would like to sharpen the objects they prefer, blur other parts they
do not consider important. In this case, we need a camera with large aperture to create a shallow
depth of field. However, general point-and-shoot cameras have small sensors and lenses, which
is to some degree difficult to generate the effect of shallow focus. To figure out this problem, our
project is using multiple photos with small aperture taken from slightly different viewpoints,
which simulates a bigger camera aperture, to create a shallow depth of field. It means that with
only one point-and-shoot camera, ordinary users without any photography technique can take an
6
„artistic‟ depth-of-field photo. Another advantage of our project is that user can randomly select
one part of the referenced photo to sharpen it; the other parts are blurred meanwhile. This can be
achieved by producing a depth map from these given several photos.
1.2 Problem Statement
Most people will ask: what kind of problems does our project solve? The first thing we consider
is the convenience for camera users. The top selling point-and-shoot camera in the current
market is indeed convenient and easy to use. People just need to press the shutter button, and a
piece of beautiful photo will be produced. However, only to generate a picture is not enough.
They prefer more realistic effects on the photos. For example, if we choose a function of depthof-field scene simulation on the camera, the foreground objects (i.e. persons) will be sharpened
while the background will be blurred by just pressing a button on the camera.
The current welcomed point-and-shoot camera with depth-of-field effect usually has these two
resultant photos: one is that objects are sharpened everywhere (Figure 1.1 - left), the other is that
the camera detects the foreground objects automatically and keeps them sharpened, and blurs all
the other regions in the photo (Figure1.1 - right). Here is the problem: what if the user wants to
sharpen the background objects instead of the foreground ones? For example, in Fig.1-right,
users do not want to focus the flower, maybe they would like to see the white object in the
background clearer. At this time we need to refocus the whole scene in the photo, i.e. change the
emphasized part of the scene. In a word, our project should be able to solve a problem like this:
with the least handling steps on the input, how users can finally obtain a photo containing the
sharp part and blurred part they prefer.
7
Figure 1.1 – left: all the objects are sharp everywhere; right: shallow depth of field – flower in the foreground is
sharp, while background is largely blurred.
There are some existing works in the area of refocusing. I will describe the details of these
methods later in the next chapter. In this thesis, we present a simple but effective idea to
implement the refocusing work in current popular point-and-shoot camera. The whole procedure
can be divided into two parts. The first part is to compute a depth map based on the given input
photos. The theory and the original problem formulation under the depth map computation is one
classical research area in early vision: label assignment, which focuses on how to assign a label
to a pixel (in image) given some observed data. This kind of problems can be solved by energy
minimization. A way to figure out the label assigning work is to construct an energy function and
minimize it.
The second part of the project is to sharpen one part of the photo, while blur the other parts based
on the depth map computed from the previous step. For this phase, we need to handle the
8
occurred problems such as partial-occluded parts, blur kernel and issues from layers of
constructed 3D scene. After combining these two steps above, user can finally obtain a new
image with the sharp parts they want by only shooting several photos from common point-andshoot camera.
Therefore, we can summarize that the input of this project is a sequence of sharp photos taken
from slightly different viewpoints. Then user selects a preferred region on the referenced photo
that to be emphasized, and finally the output is a depth-of-field image.
1.3 Advantages and Contributions
We analyze the advantages of our project from three aspects: depth map generation, refocusing
procedure and quantity of input parameters.
(a). Depth map generation:
There are many approaches on producing depth map in early vision area. Most of them are stereo
matching methods. User needs to prepare two images – left and right, with only translation
movement between them. The output is a disparity map according to the moving distance of each
pixel in these two images, i.e. large translation interprets nearer objects to the camera, while
small movement accounts for the scene background that is far away from camera holder.
Another type of depth map generation is to reconstruct a 3D scene from a large set of photos.
Those photos can be taken from very different viewpoints. Actually to produce a coarse depth
map, there is no need to reconstruct the corresponding 3D scene. The information a depth map
requires is depending on where it applies for. Sometimes rough depth value (the distance from
camera lens to the objects) of each pixel in the image is enough; while in other cases, especially
9
full 3D scene reconstruction, exact depth value (e.g. x, y, z axis point representation) of each 3D
point is necessary.
Our method is a tradeoff between the previous two approaches. We use several images to
generate a usual depth map instead of 3D reconstruction. First, users do not need to shoot a large
set of photos, 5-7 is enough. Second, the theory of multi-view scene reconstruction is simple,
which is to generate an appropriate energy function for the given problem and apply an efficient
energy minimization algorithm such as graph cuts to minimize it [33]. The details and our
implementation will be described in the later chapter.
In a word, the advantages of our depth map production has less input and is a simple algorithm
idea instead of much more steps of 3D reconstruction approach such as feature point extraction,
structure from motion, or bundle adjustment, etc. The reason why we do not apply 3D
reconstruction approach is also because the information we require for the project is less than
that in reconstructing a 3D scene from real world – rough depth value of each pixel in the
referenced image is truly enough.
(b). Refocusing procedure:
One common approach to refocusing is to acquire a sequence of images with different focused
setting. According to the spatially varying blur estimation of the whole scene where the
information is taken from the focused difference, an all-focused resultant image can be computed
to refocus. However, from the view of users, they need to take several photos under different
focused setting. It is not so easy to achieve if users have little knowledge about photography.
What if they do not know how to adjust different focused settings? In other words, some existing
refocusing works have at least two input images with very different properties. For example, one
of the input photos has sharp objects in the foreground, while in the other image, background
objects are sharp and foreground is blurred. Hence according to the different blurred degrees and
regions, the program can compute a relatively accurate blur kernel to accomplish the refocusing
task. Notice that a camera with depth-of-field effect is required for this kind of approach.
10
A different approach to refocusing is to measure the light field associated with a scene. In this
case, the measured rays can be combined to simulate new depth of field settings without
explicitly computing depth [8]. The drawback of this method is that it requires either a large
amount of photos or a large camera array.
The blur estimation of our project is based on the depth map produced from the previous step. It
does not require large number of photos, or any additional requirement like camera array. What
we need is only the depth value of each pixel and blurs the objects in the picture according to the
magnitude of the depth value. The nearer to the camera, the less the objects are blurred. Besides,
the photos users shoot are under the same focused setting. There is no need for them to have any
extra photography knowledge.
(c). Quantity of input parameters:
Compared to other existing refocusing methods, most of them need more input parameters than
ours. In [8], in order to estimate a depth map under a single camera (with a wide depth of field),
the author uses a sparse set of dots projected onto the scene (with a shallow depth of field
projector), and refocus based on the depth map.
Another type of methods is to modify the camera lens. They use various shapes of camera
aperture, or apply a coded aperture on a conventional camera [1]. With the knowledge of the
aperture shape, the blurred kernel is computed as well.
Our project does not require extra equipment like a projector or any camera modification. We do
not need to take any photos with shallow depth of field to estimate the blur kernel either. All
users need to do is to shoot several all-focused photos with slightly different shooting angles by
using a shoot-and-point digital camera. It is more convenient for ordinary people without any
photography technique to obtain a final refocusing photo. In our project, the interaction between
user and computer is the user selection of preferred region which will be emphasized (sharpened).
11
Chapter 2
Related Work
According to the procedure of our project – depth map computation and image refocusing based
on the depth information, the literature survey in this chapter will also be divided into two
categories: existing work on depth estimation and existing work on image refocusing.
In this chapter, we will not only introduce the existing related work to our project, but also
describe some key algorithms and their corresponding applications. These algorithms such as
stereo matching concepts, take an important part into our project as well. Therefore, introduction
of such concepts in detail is necessary.
2.1 Depth Estimation
We divide this part as four categories according to the required input, which in my opinion is
much clearer to compare these various methods.
2.1.1 Single Image Input
The approaches for obtaining one depth map from a single input image, most of them require
additional equipment like projector, or device modification like camera aperture shape change,
etc. In [8], the author uses a single camera (with a wide depth of field) and the depth value
12
computation is based on the defocus of a sparse set of dots projected onto the scene (using a
narrow depth of field projector). With the help of the projecting dots from the projector and the
color segmentation algorithm, an approximated depth map of the scene with sharp boundaries
can be computed for the next step. Figure 2.1 is the example of [8]. (a) is the acquired image
from single input image and projector. (b) is the computed depth map from the information
provided by (a). The produced depth map has very sharp and accurate boundaries on the objects.
However, it cannot handle the partial occlusion problem, i.e., we are not able to see the region of
the man behind that flower in depth map.
Figure 2.1 – example result from [8].
In [1], the authors use a single image capture, and a small modification to the traditional camera
lens – a simple piece of cardboard suffices. On the condition that we have already know the
shape of lens aperture, the corresponding blur kernel can be estimated, and thus convolution can
be applied to the blurry part of the image in order to recover an all-focused final image (in
refocusing step). The output of this method is a coarse depth map, which is sufficient for the next
refocusing phase in most applications.
13
2.1.2 Stereo Matching – Two Images Input
Stereo matching is one of the most active research areas in computer vision and it can be applied
to many applications as an significant intermediate step, such as view synthesis, image based
rendering, 3D scene reconstruction. Given two images/photos that are taken from slightly
different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the
reference image, where the final result is represented as a disparity map.
Disparity indicates the difference in location between two corresponding pixels and it is also
considered as a synonym for inverse depth. The most important step at first is to find the
corresponding pixels which refer to the same scene point from the given two left and right
images. Once these correspondences are known, we can find out how much difference produced
by the camera movement.
Consider that it is hard to point out the corresponding pixels under the one-map-to-one condition
based on the casual camera motion, to simplify the search for correspondence, the image pair is
commonly transformed into horizontal translation, so that the stereo problem is reduced to a onedimensional search along corresponding scan lines. Therefore, we can easily view the disparity
value as the offset between x-coordinates in the left and right images (Figure 2.2). The objects
nearer to viewpoint have a larger translation, while farer ones have only a slight move.
Figure 2.2 - The stereo vision is captured in a left and a right image.
14
An excellent review of stereo work can be found in [4]. It presents the taxonomy and comparison
of two-frame stereo correspondence algorithms.
Stereo matching algorithms generally perform (subsets of) of the following three steps [4]:
a. matching cost computation;
My own comparison and implementation only focuses on pixel-based matching costs, which is
enough for the research of this project. The most common pixel-based methods include squared
intensity differences (SD) [14, 18, 12, 7] and absolute intensity differences (AD) [30]. The
computation of SD and AD falls between the single pixels of the given left and right images.
b. cost (support) aggregation;
The aggregation of cost is usually window-based (local) method. It aggregates the matching cost
by summing or averaging over a support region. For common used stereo matching, a support
region is a two-dimensional area, often a square window (3-by-3, 7-by-7). Therefore the cost
aggregation of each pixel in an image can be calculated over such region.
c. disparity computation / optimization;
It can be separated into two classes. One is local methods, and the other is global ones. The local
methods usually perform a local “winner-take-all” (WTA) optimization at each pixel [40]. For
the global optimization methods, the objective is to build a disparity function that minimizes a
global energy. Such energy function includes data term and smoothness term, which represents
energy distribution of each pixel in the two stereo images. With the energy function, several
minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic
programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth
map.
15
2.1.3 Multi-View scene reconstruction - Multiple Images Input
A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures). The
result of this type of methods is usually a dense depth map that is similar to those of stereo
matching. However, it is difficult to deal with the partial occluded part in stereo matching
because of lacking of information from only two input images. Users provide multiple images as
input, which means they have enough information that could produce a relatively exact result.
Furthermore, partial occlusion may also be solved. In [Multi-camera; asymmetrical], the energy
function has more than two terms, i.e. data term, smoothness term and visibility term. Obviously,
the visibility term is used to handle the partial occlusion problem. Besides, [13, 25, 23] not only
take advantages of multiple images as the input, but also build a new type of data structure as the
output – layered depth map, which can represent the layered pattern of the real 3D scene more
clearly. The whole scene can be divided into several planes. Each plane represents the distance
of objects to the camera holder. Figure 2.3 shows an example of layered depth map [23]. This
new representation of depth map deals with the partial occlusion better and the most important
point is that it is more convenient and exact for the next step – refocusing.
(a)
(b)
(c)
Figure 2.3 – (a) is one of the input sequence; (b) is the recovered depth map; (c) is the separated layers.
2.1.4 3D World Reconstruction from a large dataset
As in [16], the input is a large dataset where the photos are taken from very different views.
Therefore we can obtain plenty of information including camera position, affine matrix of each
photo, and finally compute the exact depth value of each pixel in real world. Given each pixel‟s x,
16
y, z value of an image (z value is the one we compute from the large dataset), we can easily
reconstruct the 3D scene of this image and of course, the computed depth map is much more
accurate than those produced from stereo matching or multi-view scene reconstruction. See
Figure 2.4 (the bear one if I still hold). The result of 3D reconstruction is comprised of a large
amount of sparse points. For people who would like to see a rough sketch of a certain building, a
set of sparse points is enough. If we try to obtain a dense map, estimation on the regions without
known points or triangulation may be performed in order to connect the discrete points together.
(a) large dataset as input
(b) 3D reconstruction result
Figure 2.4 – example from [16].
2.2 Image Refocusing
In our project, this part is based on the previous depth estimation phase. All our research and
implementation of refocusing are according to the single dense depth map that we have produced.
Therefore, for the introduction of related work in image refocusing, we will only describe the
refocusing part of the paper, i.e. we presume that the depth map (single or layered) has already
been provided and focus on how to blur or sharpen the original image.
17
One common approach is to compute the blur scale based on a set of images. These given images
can be different focused ones, or one all-focused image plus known focused settings or
parameters. In [22, 27], the degree of blur of different parts in the image is computed from the
given defocused image sequence. For this kind of approach, to compute the blur of the whole
scene, in addition to the size of the lens aperture and the focal length, there are two issues to be
noticed. First is the partial occlusion. Different parts of the lens or different cameras may capture
different views due to partial occlusion. Second, pixel value at object boundaries may have
vague providers, i.e., background and foreground could both have possible contributions to the
boundary pixels. The [8] refocusing algorithm addresses both these issues. It views the partial
occlusions as missing regions in the resultant depth map. Then the missing parts are recreated by
estimation, i.e., the algorithm extends the occluded surface using texture synthesis. Moreover, to
deal with foreground and background transitions at pixel boundaries, the authors blend a
foreground focused image with a background focused image within the boundary region. They
combine their depth map with a matting matrix computed from the depth estimation refinement
to produce better result. [1] is also another type of approach using blur kernel estimation to fulfill
the refocusing work. The input are single blurred photographs taken with the modified camera
(obtain both depth information and an all-focus image) and output is coarse depth information
together with a normal high resolution RGB image. Therefore, to reconstruct the original sharp
image, correct blur scale of an observed image has to be identified with the help of modified
shape of camera aperture. The process is based on probabilistic model which finds the maximum
likelihood of a blur scale estimation equation. To summary, according to the general convolution
equation y = fk * x, where y is the observed image, x is the sharp image that will be recovered,
and the blur filter fk is a scaled version of the aperture shape [1]. The defocus step of this method
is to find a correct kernel fk with the help of known information of coded aperture.
Another kind of methods is to produce a layered depth map which divided the whole scene into
several parts. Each part has a certain depth value. These separated layers are viewed as
individual parts. It is definitely a solution to avoid the missing regions between objects lying in
different scene layers. Given a layered depth map, the only thing we need to do is to blur the
layers according to the depth value of the corresponding individual layer without considering
18
whether there is any missing region. Our refocusing algorithm is based on this layered depth map
idea. We will describe the details in later chapter.
A different approach to refocusing is to measure the light field of a scene. In this case, there is no
need to compute the exact depth value of each pixel. The measured rays can be combined to
simulate new depth of field settings [light field papers]. Since depth estimation is going through
our whole project and this thesis, we will not introduce more about light field. Detailed
information about light field in depth estimation area is in [32].
19
Chapter 3
Background Studies
In this Chapter, I mainly discuss some background theories with respect to this project. Some
other similar work will also be introduced to offer a rough image and comparison to our project.
In section 3.1, basic knowledge of energy function and optimization method – graph cuts are
explored. In section 3.2, we mainly discuss the deep theories graph cuts algorithm including
implemental details. A classical and critical application related to graph cuts and multi-view
depth map production is stereo matching. We will also discuss it in order to better understand the
basic theories under our project.
I also introduce some concepts that are related to the refocusing part of the project. The basic
theories and relationships among the common camera parameters are described in section 3.3.
Besides, some different methods (applications) on the topics of refocusing, defocusing or depth
of field are introduced to do a direct comparison with our method.
3.1 Standard Energy Function in Vision
In early vision problems, the label assignment can be explained as: every pixel
must be
assigned a label in some finite set L. In image restoration, the assignments represent pixel
intensities, while for stereo vision and motion, the labels are disparities. Another example is
image segmentation, where labels represent the pixel values of each segment. The goal of label
assignment is to find a labeling f that assigns each pixel
a label
, where f is both
20
piecewise smooth and consistent with the observed data. In general energy function definition,
the vision problems can be naturally formulated as
E(f) = Esmooth (f) + Edata (f)
(1)
where Esmooth (f) measures the extent to which f is not piecewise smooth, while Edata (f) measures
the disagreement between f and the observed data [39]. Typically, in many literatures and
applications, the form of Edata (f) is defined as
( )
∑
( )
( )
where Dp measures how closed label fp to pixel p given the observed data. If a minimum value of
this data term is found, that means the configuration f fits the corresponding pixels very well. In
stereo vision problem,
( ) is usually (
) , where Iright and Ileft are the pixel
intensities of the two corresponding points p and q in the given left and right image, respectively.
While in image restoration problem,
( ) is normally (
) , where Ip is the observed
intensity of p.
While the data term is easier to define and apply to different practical problems, the choice of
smoothness term is a very important and critical issue in the current research area. It directly
decides whether the final result is optimal and the form of this term usually depends on the
applications. For example, in some approaches, Esmooth (f) makes f smooth everywhere according
to the demand of the algorithms. For many other applications, Esmooth (f) has to detect the object
boundaries as clear as possible, which is often denoted as discontinuity preserving. For image
segmentation, stereo vision problems, object boundary is a necessary factor to be firstly
considered. The Potts model I described above is also a popularly used type of smoothness term.
From the discussion about data term and smoothness term in energy function, we consider
energies of the form
21
( )
∑
*
(
)
∑
( )
( )
+
where N is set of pairs of neighboring pixels. Normally, N is composed of adjacent pixels (i.e.
left and right, top and bottom, etc.), but can be arbitrary pairs as well according to problem
requirement. Most applications only consider Vp,q under pair-wise interactions, since pixel
dependence and interaction often happen between adjacent pixels.
3.2 Optimization Methods
Finding a minimum value for a given energy function is a typical kind of optimization problem.
In this section, first I give an introduction and motivation to optimization technique. Then I focus
on introducing an efficient optimization method – graph cuts.
3.2.1 Introduction to Optimization
In mathematics and computer science, optimization, or mathematical programming, refers to
choosing the best element from some set of available alternatives. In the simplest case,
optimization means solving problems where one seeks to minimize or maximize a real function
by systematically choosing the values of real or integer variables from within an allowed set.
This formulation, using a scalar, real-valued objective function, is probably the simplest example.
Generally, optimization means finding "best available" values of some objective function given a
defined domain, including a variety of different types of objective functions and different types
of domains.
Greig et al. [3] was first to use powerful min-cut/max-flow algorithms from combinatorial
optimization to minimize certain typical energy function in computer vision. Combinatorial
22
optimization is a branch of optimization problems. The feasible solutions of this kind of
problems are discrete or can be reduced to discrete case, and the goal is to find the best possible
solution, most of the time is to find an approximated one. For approximation algorithms, they
can run in polynomial time and find a solution that is “close” to optimal.
For certain energy function, in general, a labeling f is a local minimum of the energy E if
( )
( )
In the discrete labeling case, the labeling near to f lies within a single move of f. Many local
optimization approaches applies standard moves, where only one pixel can change its label at
one time [39].
3.2.2 Graph Cuts
Graph cuts algorithm is one of the most popular optimization algorithms in current related
research areas. It can rapidly compute a local minimum with relatively good results. Here are
some examples produced by graph cuts. The object boundaries are pretty clear which fit the
requirement of image segmentation and stereo vision cases.
Since graph cuts algorithm is the major discussing topic in this paper, I will introduce it later in
details.
23
Figure 3.1 - Results of color segmentation on the Berkeley dataset by using graph cuts.
Figure 3.2 - Results of texture segmentation on the MIT VisTex and the Berkeley dataset by using graph cuts.
Figure 3.3 – Disparity map of stereo vision matching using graph cuts.
24
3.3 Graph Cuts
3.3.1 Preliminary Knowledge
To fully understand the idea of graph cuts, there are some useful fundamental theories to be
preliminarily known.
3.3.1.1 Metric and Semi-Metric
V is called a metric on the label space L if it satisfies
for any labels
(
)
( )
(
)
(
)
(
)
(
)
( )
(
)
( )
. If V satisfies only (a) and (b), it is called a semi-metric.
For example, Potts model (
)
(
) is a metric, where ( ) is 1 if its argument is
true, otherwise 0. The Potts model encourages labeling consisting of several regions where pixels
in the same region have equal labels [4]. The discontinuity-preserving results produced from this
model are also called piecewise constant, which is widely used in segmentation, stereo vision
problems.
Another type of models is called piecewise smooth. The truncated quadratic (
)
(
|
| ) is a semi-metric, while the truncated absolute distance (
)
(
|
|) is a metric, where K is some constant. The role of constant K is to restrict
possible larger discontinuity penalty imposed on the smoothness term in the energy function.
These models encourage labeling consisting of several regions where pixels in the same region
have similar labels.
25
3.3.1.2 Graph denotation and construction
Let
〈
〉 be a weighted graph. It consists of a set of nodes V and a set of edges
that
connect them. The set of nodes has several distinguished vertices which are called the terminals.
In the context of vision problem, the nodes normally correspond to pixels, voxels or other types
of image features and terminals correspond to the set of labels which can be assigned to each
pixel in the image. For simplification, I will only focus on the case of two terminals (i.e. two
labels to be assigned). Usually the two terminal nodes can also be called source node and sink
node. The multiple terminals problem can be naturally extended from the two-label case. In
Figure 3.4, a simple example of a two terminal graph is shoed. This graph construction can be
used on a 3 x 3 image with two to-be-assign labels.
For the edges connected between different nodes, a t-link is an edge that connects terminal nodes
(source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes
within a neighborhood system.
Figure 3.4 – Example of a constructed graph. A similar graph cuts construction was first introduced in vision by
Greig et al. [3] for binary image restoration.
A cut
graph ( )
is a set of edges such that the terminals are separated by this cut in the induced
〈
〉. After the cut, a subset of nodes belongs to source terminal, while the
other subset of nodes is categorized into the sink terminal. The cost of the cut C, denoted |C|,
equals the sum of edge weights of this cut. Figure 3.5 shows a typical cut on the constructed
graph. The cut is represented as green dotted line.
26
The minimum cut problem is to find the optimal cut with lowest cost among all cuts separating
the terminals.
Figure 3.5 – Example of a cut on the constructed graph.
3.3.1.3 Minimizing the Potts energy is NP-hard
The details of proof of NP-hard will not be described here. All we have to know is that a
polynomial-time method for finding an optimal configuration f* would provide a polynomialtime algorithm for finding the minimum cost multi-way cut, which is known to be NP-hard [38].
3.3.2 Details of Graph Cuts
3.3.2.1 Overview
In contrast to other optimization approaches described before, which use the standard moves
with only one label changes at a time, graph cuts algorithms can change a large number of labels
of pixels simultaneously. This improvement directly speeds up the processing time on images.
There are two types of large moves in graph cuts algorithm: α-β swap and α-expansion. Figure
27
3.6 shows the comparison of local minima with respect to standard and large moves for image
restoration.
Figure 3.6 – Comparison of local minima with respect to standard and large moves for image restoration. (a)
Original image. (b) Observed noisy image. (c) Local minima with respect to standard moves (i.e. only one label
changes at a time). (d) Local minima with respect to large moves. Both local minima in (c) and (d) were obtained
using labeling (b) as an initial labeling estimate [39].
3.3.2.2 Definition of
Swap and
Algorithms
For α-β swap, given a pair of labels α and β, a α-β swap is a move from an old labeling f to a
new labeling f’. If there is pixel labeling difference between the before and after changes which
leads to a decrease of energy in the function, we say that the α-β swap succeeds and will
continue to the next iteration. In other words, α-β swap means that some pixels that were labeled
α now change to be labeled β, and some pixels that were labeled β now labeled α.
For α-expansion, given a label α, a α-expansion move is also a move from an old labeling f to a
new labeling f’. This algorithm means that some pixels that were not assigned to label α now are
assigned to label α.
Figure 3.7 shows the α-β swap and α-expansion algorithms respectively. We will call a single
execution of Steps 3.1-3.2 and iteration, and an execution of Steps 2, 3, and 4 a cycle. In each
cycle, an iteration is performed for every label α or for every pair of label α and β. The
28
algorithms will continue until it cannot find any successful labeling change. It can be obvious
seen that a cycle in the α-β swap algorithm take |L|2 iterations, while a cycle in the α-expansion
algorithm only take |L| iterations.
Figure 3.7 – Overview of α-β swap algorithm (top) and α-expansion algorithm (bottom).
Given an input initial labeling f and a pair of labels α and β (swap algorithm) or a label α
(expansion algorithm), we want to find a new labeling f’ which can minimize the given energy
function
( )
∑
*
(
)
∑
( )
+
I will discuss the procedure and results of these two algorithms with respect to a constructed
graph. There are also a number of theorems and corollaries to be proved, I will not prove them.
In contrast, I will directly use those theorems to give a clear and straight interpretation to easier
understand these two algorithms.
29
3.3.2.3
Swap
Any cut leaves each pixel in the image with exactly one t-link, which means that the result of a
cut determines the labeling f’ of every pixel. From another view, a cut can be described as: a
pixel p is assigned label α when the cut C separates p from the terminal α; similarly, p is
assigned label β when the cut C separates p from the terminal β. If p is not chosen to be changed
the label, its original label fp will be kept.
Lemma 3.1: A labeling fC corresponding to a cut C on the constructed graph is one α-β swap
away from the initial labeling f.
Lemma 3.2: There is a one-to-one correspondence between cuts C on the constructed graph and
labelings that are one α-β swap from f. Moreover, the cost of a cut C on the graph is | |
(
) plus a constant.
Corollary 3.1: The lowest energy labeling within a single α-β swap move from f is ̂
,
where C is the minimum cut on the constructed graph.
3.3.2.4
The theory of this algorithm is similar to α-β swap as described above. The most different part
from swap algorithm is the introduction of auxiliary nodes. Here I will not explain it, only give
an overall idea of the relationship between the edge weights and the energy function produced
before.
A cut can be described as: a pixel p is assigned label α when the cut C separates p from the
terminal α. If p is not chosen to be changed to the label α, its original label fp will be kept.
Lemma 3.3: A labeling fC corresponding to a cut C on the constructed graph is one α-expansion
away from the initial labeling f.
30
Lemma 3.4: There is a one-to-one correspondence between elementary cuts on the constructed
graph and labelings within one α-expansion of f. Moreover, for any elementary cut C, we have
| |
(
).
Corollary 3.2: The lowest energy labeling within a single α-expansion move from f is ̂
,
where C is the minimum cut on the constructed graph.
In [39], the authors define the edge weights which are definitely related to the energy function.
The weight of t-link corresponds to the data term and the weight of n-link corresponds to the
smoothness term. Based on the definitions of edge weights, all the theorems and corollaries can
be finally proved.
Figure 3.8 and Figure 3.9 are two examples to illustrate the results of α-β swap and α-expansion
algorithms.
Figure 3.8 – Examples of α-β swap and α-expansion algorithms.
Figure 3.9 – Example of α-expansion. Leftmost is the input initial labeling. An expansion move is shown in the
middle and the right one is the corresponding binary labeling.
31
The window-based algorithms described in this survey may produce the results with a number of
errors like black holes or mismatches in the disparity map. Using graph cuts algorithm, the object
boundaries can be clearly detected according to the choice of smoothness term and the label
assignment to disparity map can be obtained due to the existence of data term
3.4 Concepts of Stereo Matching
3.4.1 Problem Formulation
For every pixel in one image, find the corresponding pixel in the other image is the basic idea of
stereo matching. Here the authors of [35] refer to this definition as the traditional stereo problem.
(
The goal of this problem is also to find the labeling
( )
∑
*
(
)
| |)
∑
that minimizes
( )
+
Again, Dp is the penalty for assigning a label to the pixel p. N is the neighborhood system
composed of pairs of adjacent pixels; and V is the penalty for assigning different labels to
adjacent pixels.
In the traditional stereo matching problem, the location movement of each pixel goes along
horizontal or vertical direction. So if we assign a label fp to the pixel p in the reference image I,
the corresponding pixel in the matching image I’ should be (p + fp). The matching penalty Dp
will enforce photo-consistency, which is the tendency of corresponding pixels to have similar
intensities. The possible form of Dp is
( )
‖ ( )
(
)‖
32
For the smoothness term, Potts model is usually used to impose a penalty for different fp, fq. The
natural form of Potts model for smoothness term is (
)
[
], where the
indicator function , - is 1 if its argument is true and otherwise 0 [35].
We can easily see the change of terms D and V from the tables below, which are | |
| |
| |, respectively. For stereo with the intensity difference and Potts model applied on
smoothness term, they are
Dp =
((
)
(
))
((
)
(
))
((
)
(
))
V=
More efficient and fast implementation of graph cuts is using a new min-cut/max-flow algorithm
[36] to find the optimal cuts. Figure 3.10 shows some experimental results of the two-view stereo
matching algorithms with graph cuts. We can see that even for heavily texture images, graph cuts
can still work to detect clear object boundaries and assign correct labels to the pixels.
left image
result
ground truth
33
left image
result
Figure 3.10 – Original images and their results. Top row is the “lamp” data sequence and the bottom row is the “tree”
data sets.
3.4.2 Conclusion and Future Work
The binocular stereo vision area is a relatively mature field. A large number of two-view stereo
matching algorithms have been proposed in recent years. The results of most approaches turn out
to be pretty good, with not only clear object boundaries but also accurate disparity values. An
evaluation of those stereo methods is provided by Middlebury [5] weighs the advantages and
disadvantages among them, which can give readers an overall understanding of the current trend
in stereo vision area.
However, the given data have been simplified into a pair of rectified images, where the
corresponding points are easy to find (only need to search along the horizontal or vertical line),
since the data set offered by Middlebury is strictly several pixels offset between left and right
images. Future work may focus on two or more casually taken images without such strict
horizontal or vertical translation. Slight rotation can be viewed as a challenge to be improved.
34
3.5 Photography
In this section, I will introduce a series of concepts which has closed relationship to this project.
The camera parameters, post-processing techniques, and technical principles are included to help
better understand our project. For section 3.5.1, some related basic theories will be introduced to
build a fundamental image of photography, especially camera. Moreover, I will describe some
pre-processing or post-processing techniques as well, such as focusing, refocusing, defocusing
methods in section 3.5.2.
3.5.1 Technical Principles of Photography
3.5.1.1 Pinhole Camera Model
A pinhole camera is a simple camera with a single small aperture but without a lens to focus light.
Figure 3.11 shows a diagram of a pinhole camera. This type of camera model is usually used as a
first order approximation of the mapping from a 3D scene to a 2D image, which is the main
assumption in our project (I will describe it later).
Figure 3.11 – A diagram of pinhole camera
The pinhole camera model demonstrates the mathematical relationship between the coordinates
of a 3D point and its projection onto the 2D image plane of an ideal pinhole camera. The reason
why we would like to use such simple form is that even some of the effects like geometric
distortions or depth of filed cannot be taken into account, it can still figure them out by applying
35
suitable coordinate transformation on the image coordinates. Therefore, pinhole camera model is
relatively appropriate to be used as a reasonable description of how a common camera with 2D
image plane depicts a 3D scene.
Figure 3.12 illustrates the geometry related to the mapping of a pinhole camera.
Figure 3.12 – The geometry of a pinhole camera
A point R locates at the intersection of the optical axis and the image plane which is referred to
as the principle point or image center.
A point P somewhere in the world at coordinate (x1, x2, x3) represents an object in the real world
that taken by the camera.
The projection of point P onto the image plane denotes Q. This point is given by the intersection
of the projection line (green) and the image plane. Figure 3.13 is also the geometry of pinhole
camera but viewed from the X2 axis. It better demonstrates how the model works in practical
case.
36
Figure 3.13 - The geometry of a pinhole camera as seen from the X2 axis
I apply this model to our project in the first step of computing depth map. It is very useful when
calculating the camera parameters and the partial occluded parts in the 3D scene. I will describe
it in detail in the later chapter.
3.5.1.2 Camera Aperture
In optics, an aperture is a hole or an opening through which light travels. In some other context,
especially photography, aperture refers to the diameter or shape of the camera aperture. A
camera can have large or small aperture to control the amount of light reaching the film or image
sensor; or an aperture can also have different shapes to control shape of rays going through.
Combining with variation of shutter speed, the aperture size will regulate the film's or image
sensor's degree of exposure to light. Typically, a fast shutter speed will require a larger aperture
to ensure sufficient light exposure, and a slow shutter speed will require a smaller aperture to
avoid excessive exposure. Figure 3.14 shows two different sizes of a given camera aperture.
37
Figure 3.14 - A large (1) and a small (2) aperture
The lens aperture is usually specified as an f-number, the ratio of focal length to effective
aperture diameter. A lower f-number denotes a greater aperture opening which allows more light
to reach the film or image sensor. Figure 3.15 illustrates some standard aperture sizes. For
convenience, I will use this “f / f-number” form to represent the size of aperture.
Figure 3.15 - Diagram of decreasing aperture sizes (increasing f-numbers)
I will not introduce more about camera aperture here. It has a strong relationship with some
photography effect like depth of field. In section 3.5.2, it will be introduced again in detail
combined with the practical techniques.
38
3.5.1.3 Circle of Confusion (CoC)
In photography, the circle of confusion is also used to determine the depth of field. It defines
how much a point needs to be blurred in order to be perceived as unsharp from human eyes.
When the circle of confusion becomes perceptible to the human eyes, we say that this area is
outside the depth of field and therefore no longer "acceptably sharp" under the definition of DOF.
Figure 3.16 and 3.17 picture how the circle of confusion performs in depth of field.
Figure 3.16 – The range of circle of confusion
Figure 3.17 – Illustration of circle of confusion and depth of field
Again, the relationship between circle of confusion and depth of field will be further described in
the later sections.
39
3.5.2 Effects and Processing in Photography
3.5.2.1 Depth of Field
In optics, particularly in photography, depth of field (DOF) is the range of distance within the
subject that appears acceptably sharp in the image. Figure 3.18 is the picture which depicts the
depth of field in real world.
Figure 3.18 - The area within the depth of field appears sharp, while the areas in front of and behind the depth of
field appear blurry
In general, depth of field does not abruptly change from sharp to unsharp, but instead appears as
a gradual transition (Figure 3.19).
Figure 3.19 – An image with very shallow depth of field, which appears as a gradual transition (from blurry to sharp,
then to blurry again)
40
From the introduction above, it is obvious that if we prefer a photo with the objects sharp
everywhere, just enlarge the depth of field, while shorten the depth of field if we want to
emphasize one object with blurring the left.
The DOF is determined by the camera-to-subject distance, the lens focal length, the lens fnumber, and the format size or circle of confusion criterion. Camera aperture (lens f-number) and
lens focal length are the two main factors that determine how big the depth of field will be. For a
given focal length, increase the diameter of aperture will lead to the case that decreases the depth
of field. For a given lens f-number, using a lens of greater focal length will decrease the depth of
field as well. Figure 3.20 shows the cases taken from two different aperture sizes. For the left
image, the background competes for the viewer‟s attention, which means when our eyes focus on
the flower in foreground, we possibly will not ignore the background scene. However, in the
right image, the flowers are isolated from the background.
Figure 3.20 – Left: f/32 - narrow aperture and slow shutter speed; Right: f/5.6 - wide aperture and fast shutter speed
For a pinhole camera or a point-and-shoot camera, the diameter of aperture is usually small,
which results in sharp scenes everywhere in the photo. The goal of our project is to use several
sharp images to create a relatively shallow depth of field effect in the resultant image.
41
3.5.2.2 Bokeh
In photography, bokeh is the blur, or the aesthetic quality of the blur in out-of-focus areas of an
image, or the way the lens renders out-of-focus points of light. Different lens aberrations and
shapes of camera aperture cause different results of bokeh. It is hard to tell whether bokeh in a
certain photo is good or bad. It depends on the lens design to blur the image. If the bokeh pleases
our eyes, it is said to be good, while some unpleasant or upset blurring results can be viewed as
bad bokeh. Photographers with larger camera aperture sometimes would like to use a shallow
depth of field technique to shoot photos with prominent out-of-focus regions, which can easily
separate the objects in foreground with the background scene. Examples in Figure 3.21 illustrate
different effects of bokeh.
Figure 3.21 – Different effects of bokeh
42
Bokeh is often most visible around small background highlights, such as specular reflections and
light sources, which is why it is often associated with such areas [9]. However, bokeh is not
limited to highlights, as blur occurs in all out-of-focus regions of the image.
Bokeh has a strong relationship with the shape of camera aperture. The shape of the aperture has
a great influence on the subjective quality of bokeh. Actually in the out-of-focus blurry regions,
we can clearly observe the aperture shape (see Figure 3.22 [15]).
Figure 3.22 – Different shapes of aperture cause different effects of bokeh
Even in our project, in the blurry regions, if we change the shape of aperture, the variation of
small-background-highlight shape will be changed as well.
43
3.5.2.3 Defocusing and Refocusing
In optics, especially in photography, defocus simply means out of focus. Generally, defocus
reduces the sharpness of the whole image. For example, in the resultant defocusing image, the
sharp edges become gradual transition and finer detail in the scene is blurred or even cannot be
seen clearly from viewer.
For the goal to generate the shallow depth of field of a lens with a larger aperture, which is
similar to our project, people can increase the defocus parts in the image by using suitable
technique, such as the paper – “Defocus Magnification” [24].
Refocusing by definition means that to focus again, or to change the emphasis in a given image.
Refocus imaging is a type of redefining photography. User can select one region in the image
that they would like to emphasize. Thus other parts should be naturally blurred. This topic is not
new in some areas, where researchers implement refocusing by modifying certain imaging
hardware in conventional cameras. The main goal of our project is refocusing from multiple
images, which is similar to the final objective of some existed companies or techniques.
However, the difference is that we only develop a kind of software instead of adjusting the
hardware of camera. Using software to some degree is more convenient to be accepted for the
users who may not have enough professional knowledge on photography.
3.6 Defocus Magnification [24]
3.6.1 Overview and Problem Formulation
44
Photographers often prefer a blurry background because of shallow depth of field in order to
emphasize the foreground object they want, such as portraits. However, unfortunately, common
point-and-shoot cameras limit the occurrence of enough defocus effect due to the existence of the
small lens aperture in cameras. Defocus magnification is an image-processing technique which
simulates the shallow depth of field with a larger camera aperture of a lens by magnifying the
defocused regions in an image.
This technique takes a single input image with a large depth of field, and then increases degree of
blurry effect in out-of-focus regions. The authors first estimates the spatially-varying amount of
blur over the image by estimating the size of the blur kernel at edges, then they propagate this
blurred method over the areas of the image where need further blurring. Based on the amount of
blur estimated, they generate a defocus map, which more likely acts as a depth map. Moreover,
they propagate the blurry measure to the neighbors with similar color value under the assumption
that blurriness changes smoothly over the whole image except for the regions where the color
intensity is not continuous. Finally, they magnify defocus effects on the image by using their
produced defocus map. According to the defocus map, the authors rely on the lens blur filter of
Photoshop to compute the defocused output.
3.6.2 Results and Conclusion
Figure 3.23 shows one significant result from the “Defocus Magnification” paper.
Figure 3.23 – Using the defocus map (b), this method can synthesize refocusing effects. Here (a) is the input image.
(b) is the produced defocus map. The result (c) looks as if the foreground is focused.
45
This example uses the defocus map to synthesize refocusing effects. The result just looks as if
the man is focused in foreground. Actually there is no difference on the man between the input
image and the refocusing result. They magnify the blurriness of the background to make the
foreground object appear sharper. In other words, it is not a real refocusing process, since from
the very beginning of given input image, there have already been some regions that are blurred.
However, the blurry areas of input image are necessary, as this approach has to estimate a
defocus map according to the degree of blurry edge.
Figure 3.24 is another set of results from this paper. It is obvious that what they do is to increase
the defocus part which has been already blurred in order to present a more realistic or artistic
impression to viewers (emphasize the major object in foreground), and keep the sharp part
unchanged.
For future work, this approach can be extended to video inputs. Besides, the partial occlusion
problem should also be studied, which is a traditional issue for depth of field effects.
Figure 3.24 – Other results. From left to right: the original images, their defocus maps, and results blurred using the
magnification approach
46
3.7 Refocus Imaging [21]
Refocus Imaging, Inc. is an early-stage company headquartered in Mountain View, California.
This company starts up in the area of computational photography. There is little information
online about how they implement the refocusing method. What we can know is that they develop
a special lens and capture the entire “light field” entering a given camera. The main idea of the
refocus imaging project is to use light field photography which requires a “simple optical
modification to existing digital cameras”. This type of camera is called “4D light field camera”.
The new plenoptic camera can essentially turn a 2D photo into 4D. In this thesis I will not talk
more on 4D light field, just present some results come from the website of the company (Figure
3.25). The quality of the results turns out to be good. Compare to our project, the basic principle
of our implementation is simpler to understand and the method we develop is more convenient
and acceptable for ordinary people without much professional photography knowledge.
Figure 3.25 – Refocusing. Focus on three different layers. The first image focuses on the red lady layer. The second
one focuses on the yellow lady layer, while the third one focuses the blue lady layer
47
3.8 Adobe Photoshop
In Adobe Photoshop, there is a filter called lens blur. For this filter, the input is one original
image and a given depth map. According to the depth map with some options such as aperture
shape, specular highlights and blur focal distance, user can finally obtain a refocusing result. It is
not difficult to realize the theory behind Photoshop. Assume that we have already held a depth
map, which means we fully understand the relationship among all the depth values in the image.
The work Photoshop needs to do is the blurring step in accordance with focal distance. Therefore,
whether the result is good mostly depends on the quality of input depth map.
Besides, compulsory input of depth map is an inconvenient measure for users. Most of the time
people cannot provide a depth map handy, even sometimes they do not understand what a depth
map means. Thus this situation limits ordinary people to refocus a photo through Photoshop
themselves. Figure 3.26 shows several results produced by lens blur filter in Photoshop. Note
that the depth map we use here is a ground truth of the original image. In practice, it is no way
for photographer to obtain a ground truth depth map directly. I will compare the results of
Photoshop to our project by using the same original images and depth map, where the depth map
is generated by our program instead of the ground truth on website.
48
Figure 3.26 – Results from Adobe Photoshop. The first row is the input image that is sharp everywhere and the
given ground truth depth map. The second row includes three different refocusing results. From left to right: focus
part is far away from the camera, focus on the middle of the scene and focus on the front part of the image
49
Chapter 4
Refocusing from Multiple Images
In Computer Vision, in order to produce a depth map, there are a large number of methods.
However, some of those approaches only permit two images (left and right) as input to perform
the stereo matching procedure. Thus, the information of partial occlusion is lost because the user
cannot provide enough data to reconstruct the clear layers of the scene (or the whole 3D scene),
i.e., two input images only. Another series of depth map computing is to reconstruct a 3D scene
from multiple images. In general, 3D scene reconstruction needs a large set of similar images (i.e.
the same real scene taken from different viewpoints). Those photos can be shot from relatively
large different viewpoints, as the example in [16]. The output is a full view of 3D scene in real
world. From the scene we can naturally obtain accurate depth values of each object.
Although the intermediate result of our project is also to obtain depth values from multiple
images, preparing large dataset to reconstruct a whole 3D scene is too sufficient to be applied in
our project. We only require five to ten input photos, and getting a dense depth map with rough
separated layers is enough.
Once we acquire the depth map from the given input images, the next step of our project is to
refocus a referenced image with the help of the produced depth map before. Main task of
refocusing in our project is blurring user-assign regions in the referenced photo. We choose the
common blurring approach – convolution of sharp image and blur kernel to obtain the final
blurred result. Some additional post processing work has to be done as well, such as object
boundaries handling, alpha-blending and bokeh.
In photography, lens aperture refers to the size of the opening in the lens of the camera through
which light can pass. By adjusting the size of the aperture, the photographer can ensure that the
50
correct amount of light reaches the digital sensor during any given exposure. Here I give a
simple example to illustrate the relationship between aperture and depth of field. Picture this: we
are taking a portrait photo. We focus the lens on the subject‟s face. Behind him is a tree. If we set
the lens at a large aperture (small f-number) the tree behind the subject will not be in focus
(blurred). In contrast, if we use a small aperture (large f-number) the tree will be in focus (sharp).
Image refocusing is taking advantage of this technique. By changing how big a part of the
photograph is in focus, users can control exactly which details show up, and which do not. This
also allows leading the users‟ eye anywhere they wish.
Out main task is to achieve the effect of shallow depth of field from multiple photos with small
aperture taken from slightly different viewpoints. The multiple photos actually simulate a bigger
camera aperture. The first and most critical step in this project is to compute the depth map of
objects in the photo. After that we can refocus the objects according to their respective depths.
4.1 Data Set
The assumption of our project is as follows:
1. n given calibrated images including camera intrinsic and extrinsic parameters (camera
positions, camera rotation and translation matrix, etc.).
2. The rotation and translation among all the photos should not be large. Our goal is to
simulate a larger camera aperture, so slight rotation and translation is enough.
3. All the photos are taken by one given point-and-shoot camera, which means the focal
length remains the same. What we need to consider is the camera movement during the
shooting period.
4. All the photos should be taken using small aperture (aperture on the point-and-shoot
cameras), i.e. they are all sharp enough.
51
For the first point in the assumption, how to compute the intrinsic and extrinsic parameters of
camera will be introduced in the subsequent section. The 2-4 points describe the limitation of our
dataset. If we apply only five to ten images shot from large different viewpoints, we may lose too
much information to produce a dense depth map successfully. Figure 4.1-4.3 shows several sets
of our original input photos.
Figure 4.1 – Flower garden set. They are extracted from a flower garden video
52
Figure 4.2 – Rocks set. This set comes from the middlebury website [18]
Figure 4.3 – Desk set. This set is taken from our own point-and-shoot camera
4.2 Computation of Camera Parameters
According to the first point of project assumption in the previous section, the camera intrinsic
and extrinsic parameters (camera positions, camera rotation and translation matrix, etc.) should
be provided in order to give enough information to the depth map computation procedure.
53
We use a piece of popular software – Voodoo Camera Tracker [31]. It is a tool for the integration
of virtual and real scenes. It can estimate camera parameters and reconstruct a 3D scene from
image sequences. Here we only need camera parameters instead of reconstructing a 3D scene.
We will describe the reason later on.
The estimation method of Voodoo Tracker consists of the following five processing steps [31]:
Automatic detection of feature points
Automatic correspondence analysis
Outlier elimination
Robust incremental estimation of the camera parameters
Final refinement of the camera parameters
These are the fundamental and usual steps for estimating camera parameters. We use Voodoo as
a convenient tool to obtain the parameters quickly, without computing them respectively step by
step.
The parameters of each input images getting from Voodoo that will be used in the following
problem formulation are as follows:
Camera position
Focal length
Intrinsic matrix, including radial distortion, pixel size, focal length and horizontal field of
view
Projection of 3D coordinates in the camera image (computed from rotation
matrix/extrinsic matrix, camera position and 3D coordinates [mm])
3D feature points
We can obtain a set of 3D feature points, which are the real points in the 3D scene. However, the
number of feature points is absolutely not sufficient for generating a dense depth map. They can
only reconstruct the outline of a 3D scene consisting of sparse point distribution. From Figure
54
4.4, the 3D model offers us a rough image of what the real world to be, but it still lacks
information to build a dense map.
Figure 4.4 – 3D model for showing feature points. There are two angles to observe the model
It is obvious that the feature points are too scattered to provide enough information for the
further computation. In other words, if we would like to obtain a dense depth map from such
information, we need to connect these sparse points under certain geometrical rule or algorithm,
or perform triangulation on them. It will be another type of estimation job where the uncertainty
55
must exist and the workload of computation. We need to use other methods to compute dense
depth map combining with the given camera parameters above.
4.3 Estimation of Depth Value in 3D Scene
For the entire 3D scene, we need to divide it into several layers according to the objects
distribution in real world. For example, in Figure 4.5, left image can be roughly separated into
three layers: box, baby and map. This is a simpler case. Furthermore, in the right image of cloth,
we can still divide the cloth scene into front, middle and back layers. Even the object like the
baby in left picture cannot be seen as a plane in practice, in our project, we simplify it due to the
preference that people view an object as an individual unity instead of partitioning a human
being or an integrated object into several parts.
Figure 4.5 – Two examples for illustrating depth value in real world
We have already obtained the depth value (z value) of feature points from Voodoo. Figure 4.6 is
two pictures of hundreds of feature points extracted from two image sequences, where y axis is
the depth value in 3D scene of each feature point.
56
Figure 4.6 – Z-value distribution of 541 feature points from two different sequences
By observing the distribution chart of many input image sequences, we come to a conclusion that
most z-value distribution follows the figure as showed in Figure 4.6, except for a few points are
fallen in erroneous detection. It can be simplified to linear distribution. Therefore, for depth
estimation, we first decide how many discrete layers are appropriate for the real scene. Secondly,
we group the z-values into distinct classes according to the number of discrete layers. In each
class, the z-value of points is closed to each other. Then we average them for each group. For
instance, in Figure 4.6, left image can be divided into 7-10 layers, while in right image, 3 layers
are enough. Therefore, the output of this step is a set of integer values representing layers of 3D
scene in real world.
4.4 Problem Formulation and Graph Cuts
4.4.1 Overview of Problem Formulation
Our current goal is to compute a depth map from multiple images based on the given and
computed parameters above. [33] builds a good framework for solving such problems. In our
case, we combine the problem formulation in the paper to our own idea in order to reach the final
goal – layer depth map.
57
Suppose we have n given calibrated images of the same scene from slightly different viewpoints
(data set was showed in section 4.1). From the calibrated images, we can obtain enough
information, such as camera intrinsic and extrinsic parameters (section 4.2), which are extremely
useful to find the corresponding pixels among all the n images. Let Pi be the set of pixels in the
camera i, and
be the set of all the pixels. Our goal is to find the depth of every
pixel (dense depth map). Thus, we want to find a labeling
where L is a discrete set of
labels corresponding to different depths in real world (section 4.3). Fig.4.7 shows a simple case
of this scene construction.
Figure 4.7 – Example of interactions between camera pixels and 3D scene.
For a certain pixel p in one image, we first project it into 3D scene using the parameters matrix
(from 2D to 3D). We assume that the pixel p corresponds to a ray in 3D space. In Figure 4.7,
C1p represents a ray intersecting with the scene. We store the intersected values with depth
labels for the subsequent energy minimization algorithm. Here the discrete depth labels have
already been provided from the previous section.
Further suppose that the intersection point of C1p and depth label l is t1. The coordinate of t1 is
(tx, ty, l). The next step is to project t1 back to the other cameras. Since each camera has a
corresponding projection matrix, for a 3D point t1, we will have several corresponding points t2’,
t3’,…,tn’ for each camera (n is the number of cameras). Then we compare the intensity
58
difference in each pair (t1, t2’), (t1, t3’),…,(t1, tn’). The smallest difference and corresponding
camera index should be stored for graph cuts algorithm.
We define an energy function to be minimized, which is a standard form of energy functions:
( )
( )
( )
( )
Then we can use graph cuts to find the optimal labeling from this function.
The data term imposes photo-consistency. It is
( )
(
∑
〈
( )〉 〈
)
( )
( )〉
Here I consist of all the 3D-points having the same depth. i.e. if *〈
( )〉 〈
( )〉+
then
f(p) = f(q).
The smoothness term can be written as
( )
∑
*
The term
*
+
*
+(
( ) ( ))
( )
+
is required to be a metric. The smoothness term here is the same as in the two-
view stereo problem.
4.4.2 Results of Single Depth Map
The remaining job of this step is to assign a suitable depth label to each pixel in the referenced
image using graph cuts algorithm. We have discussed the fundamental theory and
implementation of graph cuts before. Figure 4.8 shows some results from our program. The
object boundary and depth information are computed relatively accurate.
59
Figure 4.8 – Results on flower garden and rocks dataset.
4.4.3 Considerations and Difficulties
One difficulty to compute accurate depth value for each pixel in referenced image is how to
obtain a proper set of test data. All the taken images should satisfy the given assumption as
described in section 4.1. Because we are now simulating a larger size of aperture compared to the
small aperture in point-and-shoot cameras, the rotation and translation among those photos
should not be that large.
Finding corresponding pixels among the n calibrated images is also a critical task in this project.
At first I tried to detect the correspondent points using SIFT algorithm. This method is able to
60
find the accurate correspondent points, but one limitation is that we can only get sparse points.
For extracting a depth map in this project, dense-point correspondence is extremely important.
This requires computing corresponding pixel in matching images of every pixel in the reference
image. Therefore, the camera information such as camera position or translation matrix has to be
very accurate for calculating the intersection of image pixel and corresponding 3D point.
Finally, we need to construct an approximation algorithm based on graph cuts that finds a strong
local minimum. The traditional energy function has two terms – data term and smoothness term.
It is proved that equation (21) can be minimized by graph cuts since it satisfies the conditions
defined in [39].
4.5 Layer Depth Map
The depth maps in Figure 4.8 only represent the depth values from single view of one certain
referenced image. Single-view depth map is hard to deal with the partial occlusion problem. One
of the reasons we use multiple images is to handle occluded parts which cannot be seen in the
referenced image but appears in other cameras. For example, in the flower garden sequences,
behind the tree there is actually something that can be seen from another viewpoint (Figure 4.9).
61
Figure 4.9 – Occlusion problem. Left is the referenced image. We cannot see the occluded parts behind the tree.
However, in the right image, where viewing from another viewpoint, we can see the occlusion behind the tree
(inside red line)
To solve the occlusion problem, we still divide the whole scene into several discrete layers. For
the flower garden example in Figure 4.8, the scene is separated into 8 layers as in Figure 4.10.
Figure 4.10 – Eight layers of the scene extracted from flower garden referenced image
Actually single-view depth map is only a simple division along the object boundary, we need
more information about each layers due to the existence of multiple camera. The principal is that
when a certain 3D point locates in real scene, we can know whether it appears in each input
image. The detail of method is as follows (layer by layer):
62
Extend the non-zero regions in each layer following the original shape (Figure 4.11).
In referenced image, p is the pixel in the extending area. It is projected into 3D scene, and
it will have a 3D coordinate in the scene p_3d.
p_3d is then projected back to each matching image by applying corresponding
projection matrix. A new sequence of corresponding points (q2, q3, … , qn) is produced.
For each point in the new sequence, we again compare the intensity difference between
corresponding point and the nearest point in referenced image (within original non-zero
region). If the difference is smaller than a threshold, we select the corresponding point as
a candidate, which means this point may be occluded.
For each candidate point, detect whether it locates behind the previous layers. If yes, it is
finally in the occluded part. If no, it should be discarded.
Figure 4.11 – Extension of each layer to be prepared for further estimation
After the procedure above, we can eventually compute the occluded parts of each layer (Figure
4.12). Thus, when we combine all the layers together, they may overlap each other, while there is
no overlapping in the single-view depth map. Note that even there are still some wrong-detected
points; it can be ignored in the blurring step. I will describe it later on.
63
Figure 4.12 – Final layer depth map. Note that there is no occluded part in the first layer
4.6 Layer Blurring
We have already obtained the layer depth map. The current work is to blur each layer according
to the user selection and the depth value. Before performing the blurry job, one important thing
has to be noticed is that which type of blur kernel is used and how it works during the blurry
process. When a kernel is applied, neighbors of center pixel also have contribution to the center
pixel. In each layer map, e.g. in Figure 4.12, the pixel in the black regions has no intensity. In
other words, if we want to blur the boundary of the layer, we may wrongly combine the black
64
points leading to incorrect results. It is because that there is sharp degradation along the
boundary (Figure 4.13).
Figure 4.13 – Incorrect results. The boundary of layers is obvious
To solve the black-point problem, we choose a simple method to estimate the value of neighbor
black regions. Figure 4.14 is an example of intensity of boundary pixels. We try to fill the zeroregion with some certain values near to boundary pixels. Here we have several choices for only
simple estimation.
Symmetric: Neighbor pixel values outside the bounds of the array are computed by
mirror-reflecting the array across the array border.
Replicate: Neighbor pixel values outside the bounds of the array are assumed to equal the
nearest array border value.
Circular: Neighbor pixel values outside the bounds of the array are computed by
implicitly assuming the input array is periodic.
Figure 4.14 – Example of intensity of boundary pixels
65
We decide to replicate the nearest border value to the zero-region. The error computation of
boundary blurred pixels is actually eliminated.
Following the simple replication method, we perform blurring on each layer. User can choose
different kernels, such as box filter, disk filter or gaussian filter to simulate different shapes of
lens aperture.
4.7 Combining Blurry Layers
Given the blurry layers from previous sections, the final step is combining all the blurred layers
to achieve our ultimate goal – image refocusing. Since there is possible overlapping between two
layers, an approach is needed to deal with this problem. Alpha blending is one of the classical
methods to handle overlap of foreground and background.
Actually in our project, there is no need to use alpha blending to combine every backgroundforeground pair. Two cases may apply this blending algorithm: One is that if we would like to
sharpen the objects in background while blur foreground; the other is that if there is a very bright
light source in the background. In the first case, when the object far away from the camera is
sharpened, we can see the blurred boundaries around the foreground objects in the overlapped
regions. It is so-called the transparent effect around the boundary (Figure 4.15). In other words,
the border of the foreground object is not that sharp, it turns out to be a little bit blurry. Therefore,
we can use alpha blending algorithm to implement the partial transparent effect.
66
Figure 4.15 – Examples of transparent effect around the boundary of the bottle. The right image comes from
www.refocusimaging.com [21].
Alpha blending is a convex combination of two colors allowing for transparency effects in
computer graphics. The value of alpha in the color code ranges from 0.0 to 1.0, where 0.0
represents a fully transparent color, and 1.0 represents a fully opaque color. The value of
resulting color when color fg with an alpha value of α is laid over an opaque background of color
bg is given by:
Result = (1 - α) * bg + α * fg
The alpha component may be used to blend to read, green and blue components equally, as in
32-bit RGBA, which is significant to transparent effect in color images.
Figure 4.16 shows a result of our test image. Left one is the case that foreground object is
sharpen, while right one is to sharp the background object. We handle the overlap and the object
border by applying alpha blending. From the resultant image, we can clearly find that in the right
image, the boundary of the cloth is not that sharp, and it looks more realistic. In the left picture,
the boundaries of the small objects in foreground are still sharp without any blending.
67
Figure 4.16 – One result of layer combination. Left - the foreground is focused; Right – the background is focused.
4.8 Results and Comparison
In this section, I will show some sets of our experimental results. Each set includes original
image, single depth map, 2-4 results from different layers focusing. We will compare our results
to those of Adobe Photoshop using the same depth map that produced by our project. Besides, a
series of resultant images with different shapes of lens aperture (different blur kernels) will be
performed as well.
4.8.1 Experimental Results
1st set (Figure 4.17):
68
Original
Front
Depth map
Middle
Back
Figure 4.17 – Result of flower garden sequence
2nd set (Figure 4.18):
Original
Depth map
69
Front
Middle
Back
Figure 4.18 – Result of rocks sequence
3rd set (Figure 4.19):
Original
Front
Depth map
Middle
Back
Figure 4.19 – Result of gifts sequence
70
4th set (Figure 4.20):
Front
Middle
Back
Figure 4.20 – Result of cap sequence
5th set (Figure 4.21):
71
Original
Depth map
Front
Middle
Back
Figure 4.20 – Result of book sequence
4.8.2 Comparison to Adobe Photoshop
In Photoshop, a filter called lens blur can achieve the refocusing goal as in our project. The input
of Photoshop is one original image and one given depth map. For fair comparison, we use the
same single depth map that we computed ourselves as the common input of both Photoshop and
our project. All the depth maps have already been in section 4.8.1, thus now we only present the
results between the two methods.
72
Ours
Photoshop
Ours
Photoshop
Ours
Photoshop
73
Ours
Photoshop
Figure 4.19 – Comparison between our project and Photoshop
The results of both methods are actually similar to each other under the same condition, i.e.
common inputs. However, the boundary along sharp and blurred objects in Photoshop is much
clearer than that of our project. In the respect of art and realistic, our output is better than
Photoshop‟s. Our method combines multi-layer representation and alpha blending algorithm so
that the boundary of our results turns out to be smoother and more natural than Photoshop‟s. The
best advantage of our method is the input image. User only needs to provide several hand-take
photos without any additional given information such as depth map.
4.8.3 Different Lens Apertures
User can also change the shape of lens aperture in order to obtain variant effects in blurring
regions. In this section, to better show the shape effect, we divide the original image into only
two parts – foreground and background. Moreover, we assume that foreground is very far away
74
from background. Therefore, under the assumption above, the background should be largely
blurred compared to the sharp foreground.
Box
Gaussian
Disk
Motion
Figure 4.20 – Four different shapes of blur kernel, which simulates the shapes of lens aperture
4.8.4 Bokeh
We also implement the effects of different bokeh. As described in the previous chapter (the
“bokeh” section), bokeh can be roughly categorized into two classes, where these two classes
75
have no clear dividing line. One is the region around small background highlights. The other is
prominent out-of-focus region. The results of the second classes have been showed in section
4.8.3 – “different lens aperture”. In those flower-garden resultant images, the house is far away
from the tree in foreground. Therefore we blur the background with different aperture shapes,
and we can see the shapes in the blurry background clearly.
In this section, we will show the results of the first class of bokeh, i.e. bokeh around small
background highlights. For convenience of demonstration, we also separate the input photo as
two parts – foreground and background. There is a bright point in the background, and we blur it
in order to show the bokeh effect, while the foreground object is sharpened. Figure 4.21 is our
bokeh effect result. The outline of the shape is showed through the pink bright light in the
background. Actually the whole scene should not be divided into the two layers like this. We do
this only to see the bokeh shape more clearly.
(a) original
(c) heart
(b) circle
(d) triangle
76
(e) diamond
Figure 4.21 – bokeh effect of our project. The shapes are circle, heart, triangle and diamond respectively.
4.8.5 Different number of input images
We also study the effects of varying the number of input images. The ideal number of input
photos of our project is 6-10. What if users shoot only 3-4 photos? Or a larger dataset, let‟s say,
15 photos are provided? We will analyze these two situations respectively.
(a). feature point extraction
One of the functions of feature point extraction is to calculate the common related points among
the input images. In theory, the more input images, the more extracted feature points, and
certainly the more accurate 3D scene reconstruction. However, in our practice and
implementation, the number of feature points extracted from image sequence has an upper bound.
For example, if 10 images can extract 1000 feature points and this number has already reached
the upper limit, even we have 10 more input images, the number of feature points will be around
1000. It cannot have a large increase. It is because the restriction of our input images. As
mentioned in section 4.1, we require the input photos have only a slightly viewpoint change,
77
which means the whole scene does not change so much. Under this condition, the common
feature points among all the images may be relatively stable. Figure 4.22 is a chart of flower
garden sequence, which shows the number of extracted feature points changing with the number
of input images.
feature points
600
500
400
300
feature points
200
100
0
input =
3
5
7
10
12
15
17
20
25
30
Figure 4.22 – relationship between input images and feature points. Input image size is 320x240.
Therefore, we can conclude that if we have input image less than 5, the information from images
may not be sufficient, and will lead to a small amount of feature points. Thus the extracted layers
may not be accurate. On the other hand, if we have more than 10, or even more, 15 input photos,
the number of feature points stays almost the same. This will lead to extra workload and longer
computing time so that increasing input images is useless in our project.
(b). graph cuts
To consider this factor, we assume the premise that different numbers of input images have the
same extracted feature points. We apply graph cuts algorithm to figure out a 3D point in the real
scene appears in which given image. This application is used to deal with the partial occlusion
78
problem. If we have only a few images, e.g. 3 images, the occluded object may not be fully seen
by other cameras. Figure 4.23 is one example to illustrate the case. Since our project restricts the
input photos can only have slight viewpoint change, i.e., small translation or rotation. In this
example, the pink bottle is occluded by the white bottle. The three images cannot see the full
view of the pink one. On the other hand, if we have more than 10 input photos, the information
provided by these cameras may be redundant, i.e., occluded object may be seen by two or more
cameras at the same time. This will increase the workload of graph cuts algorithm. Figure 4.24
shows the relationship between input image number and the running time of graph cuts algorithm.
Even one more image can result in large increase on elapsed time. It is not efficient and we must
avoid such problem occur.
Figure 4.23 – partial occlusion problem under 3 input photos situation.
79
running time(s)
6000
5000
4000
3000
running time(s)
2000
1000
0
input = 3
5
7
10
12
15
Figure 4.24 – relationship between input images and running time of graph cuts. Image size is 640x480.
4.8.6 Comparison to Other Software on Camera
We compare our results to those of other depth-of-field software on current camera. This
software also allows users select a region they prefer and blur other non-select regions. However,
the software only permits an adaptive circle as a region-select tool. It means that users can only
select a circle-like area in the photo to be sharpened. The results of our method are more realistic.
We blur/sharpen an integrated layer instead of blurring/sharpening a circular region.
Theoretically, the idea of this software – “inside the circle is sharpened, while outside is blurred”
is wrong. Besides, this software does not allow users to choose aperture shape either.
80
Other software on camera
our method
Other software on camera
our method
81
Figure 4.25 – comparison between other software and our method on refocusing.
4.8.7 Limitation
Our method has its own drawbacks and will lead to some failure cases with poor results.
1. Small viewpoint change when shooting input photos:
This is a critical restriction. If the input images have very different viewpoints, i.e., user has
large translation, rotation, or even change the shooting plane, the produced depth map will be
totally wrong. It is because our algorithm of finding corresponding points among all the input
images has a certain searching area. Once these images change a lot, each pixel in the
referenced image may have many similar corresponding points in other images, which will
confuse the matching algorithm, and finally it will fail to produce correct depth map for
refocusing phase.
2. Very bright point as an partial occluded object
If there is a very bright point behind a box and half of it is outside the box, our method will
fail to blur it. The intensity of a very bright point exceeds the common color range of 0-255.
In this case, our method is unable to estimate the neighbor pixel value of it, and we suggest
that the HDR (High Dynamic Range) technique can handle this type of problem. But it is
outside this paper and we will skip it.
82
Chapter 5
Conclusion
Image refocusing is a potential and interesting research area in computer graphics and computer
vision, especially in computational photography. There are many methods to solve this type of
problem, both in hardware and software. Our project – Multi-view Image Refocusing is a new
approach that is convenient to ordinary point-and-shoot users. It does not require any
professional knowledge in the field of photography, just shoot several photos from closed
viewpoints and select a region that they prefer. Then the output is one referenced image with the
given region sharp and others blurred.
Our project can be divided into two sub-procedures. The first part is to compute layer depth map.
Labeling assignment is the basic theory in this step. To assign labels to each pixel, we use graph
cut algorithm which is fast and efficient. The second part is refocus the referenced image based
on the given produced depth map in the first step. Several methods are used for better blurring
result. Comparison to lens blur filter in Adobe Photoshop is also perform to weigh the
advantages and disadvantages of our project.
Future work can be on the improvement of input data set. Current limitation is that better result
comes from image sequence with only a little translation and rotation. Larger viewpoints
difference is possibly more convenient for users. Besides, our project is not robust to
illumination variance. Image sequence taken under different light conditions may occur
unpredictable problems.
83
Bibliography
Conference / Journals / Technical Papers / Books
[1] Anat Levin, Rob Fergus, Fredo Durand, and William T.Freeman, “Image and Depth from a
Conventional Camera with a Coded Aperture”, In ACM SIGGRAPH 2007 papers
(SIGGRAPH ‟07), ACM, New York, NY, USA, Article 70, DOI =
10.1145/1275808.1276464, http://doi.acm.org/10.1145/1275808.1276464, 2007.
[2] Chia-Kai Liang, Tai-Hsu Lin, Bing-Yi Wong, Chi Liu and Homer H. Chen, “Programmable
aperture photography: multiplexed light field acquisition”, SIGGRAPH '08 ACM
SIGGRAPH 2008 papers, ACM New York, NY, USA, 2008, doi>10.1145/1399504.1360654.
[3] D. Greig, B. Porteous, and A. Seheult, “Exact Maximum a posteriori Estimation for Binary
Images,” Journal of the Association for Computing Machinery, 35(4):921-940, October 1988.
[4] D.Scharstein, R.Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo
Correspondence Algorithms,” International Journal of Computer Vision, 47(1): 7-42, May,
2002.
[5] D. Scharstein and R. Szeliski, “http://vision.middlebury.edu/stereo/”, The Middlebury vision
website.
[6] E. Dahlhaus, D. Johnson, et al, “The Complexity of Multiway Cuts,” ACM symp. Theory of
Computing, pp. 241-251, 1992.
[7] E. P. Simoncelli, E. H.Adelson, and D. J. Heeger, “Probability distributions of optic flow”, In
CVPR, pages 310–315, 1991.
[8] F.Moreno-Noguer, P.N.Belhumeur and S.K.Nayar, “Active Refocusing of Images and
Videos”, ACM Trans. on Graphics (also Proc. of ACM SIGGRAPH), Aug, 2007.
[9] Harold Davis, Practical Artistry, “Light & Exposure for Digital Photographers (2008)”,
O'Reilly Media, p. 62, ISBN 9780596529888.
[10] I. J. Cox, S. L. Hingorani, S. B. Rao, and B. M. Maggs, “A maximum likelihood stereo
algorithm”, CVIU, 63(3):542–567, 1996.
84
[11] J. Besag, “On the Statistical Analysis of Dirty Pictures,” Journal of the Royal Statistical
Society, Series B 48 (1986) 259-302.
[12] L. Matthies, R. Szeliski, and T. Kanade, “Kalman filter-based algorithms for estimating
depth from image sequences”, IJCV, 3:209–236, 1989.
[13] Lee, S., Eisemann, E., Seidel, H, “Depth-of-Field Rendering with Multiview Synthesis”,
ACM Trans. Graph, 28, 5, Article 134 (December 2009), 6 pages. DOI =
10.1145/1618452.1618480, http://doi.acm.org/10.1145/1618452.1618480.
[14] M. J. Hannah, “Computer Matching of Areas in Stereo Images”, PhD thesis, Stanford
University, 1974.
[15] Matthew Kozak, “Camera Aperture Design”, http://www.screamyguy.net/iris/index.htm
[16] Noah Snavely, Steven M. Seitz, Richard Szeliski, “Modeling the World from Internet Photo
Collections”, International Journal of Computer Vision, DOI 10.1007/s11263-007-0107-3.
[17] O. Veksler, “Efficient Graph-based Energy Minimization Methods in Computer Vision”,
PhD thesis, Cornell University, 1999.
[18] P. Anandan, “A computational framework and an algorithm for the measurement of visual
motion”, IJCV, 2(3):283–310,1989.
[19] P. Felzenszwalb and P. Huttenlocher, “Efficient belief propagation for early vision,” In
CVPR, 2004, 261-268.
[20] P. N. Belhumeur, “A Bayesian approach to binocular stereopsis”, IJCV, 19(3):237–260,
1996.
[21] Refocus Imaging, Inc. http://www.refocusimaging.com.
[22] Rajagopalan, A. N., and Chaudhuri, S., “An mrf model-based approach to simultaneous
recovery of depth and restoration from defocused images”, IEEE Trans. Pattern Anal. Mach,
1999, Intell. 21, 7, 577589.
[23] Shade, Jonathan, Steven J. Gortler, Li-wei He, and Richard Szeliski, “Layered depth
images”, In Proceedings of the 25th annual conference on computer graphics and interactive
techniques (SIGGRAPH 1998), July 19-24, 1998, Orlando, Flor., ed. SIGGRAPH and
Michael Cohen, 231-242. New York, N.Y.: ACM Press.
[24] Soonmin Bae and Fredo Durand, “Defocus Magnification”, Computer Graphics Forum,
Volume 26, Issue 3 (Proc. Of Eurographics 2007).
85
[25] Simon Baker, R.Szeliski and P.Anandan, “A Layered Approach to Stereo Reconstruction”,
To appear in the 1998 conference on CVPR, Santa Barbara, CA, June 1998.
[26] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian
Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
6:721-741, 1984.
[27] Subbarao, M., Wei, T., and Surya, G. 1995, “Focused image recovery from two defocused
images recorded with different camera settings”, IEEE Trans. Image Processing 4, 12,
16131628.
[28] S. Roy and I. J. Cox, “A maximum-flow formulation of the N-camera stereo correspondence
problem”, In ICCV, pages 492–499, 1998.
[29] S. T. Barnard, “Stochastic stereo matching over scale”, IJCV, 3(1):17–32, 1989.
[30] T. Kanade, “Development of a video-rate stereo machine”, In Image Understanding
Workshop, pages 549–557, Monterey, CA, 1994. Morgan Kaufmann Publishers.
[31] Voodoo Camera Tracker, http://www.digilab.uni-hannover.de/docs/manual.html, Copyright
(C) 2002-2010 Laboratorium fur Infomationstechnologie.
[32] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using
graph cuts”, In ICCV, volume II, pages 508–515, 2001.
[33] V. Kolmogorov and R. Zabih, “Multi-Camera Scene Reconstruction via Graph Cuts,” Proc.
Seventh European Conf. Computer Vision, vol. III, pp. 82-96, May 2002.
[34] V. Kolmogorov and R. Zabih, “What Energy Functions Can Be Minimized via Graph Cuts?”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147-159, Feb.
2004.
[35] V. Kolmogorov and R. Zabih, “Graph Cut Algorithms for Binocular Stereo with
Occlusions,” In N. Paragios, Y. Chen, and O. Faugeras, editors, The Handbook of
Mathematical Models in Computer Vision. Springer, 2005.
[36] Y. Boykov and V. Kolmogorov, “An Experimental Comparison of Min-Cut / Max-Flow
Algorithms for Energy Minimization in Vision,” Proc. Int’l Workshop Energy Minimization
Methods in Computer Vision and Pattern Recognition, pp. 359-374, Sept. 2001.
[37] Y. Boykov and M. Jolly, “Interactive Graph Cuts for Optimal Boundary and Region
Segmentation of Objects in N-D images,” proc. Eighth IEEE int’l Conf. Computer Vision,
vol.1, pp. 105-112, 2001.
86
[38] Y. Boykov, O. Veksler, and R. Zabih, “Markov Random Fields with Efficient
Approximations,” Proc. IEEE conf. Computer Vision and Pattern Recognition, pp. 648-655,
1998.
[39] Y. Boykov, O. Veksler and R. Zabih, “Fast Approximate Energy Minimization via Graph
Cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No.11,
2001.
[40] Y.Kuk-Jin, K.In-So, “Adaptive Support-Weight Approach for Correspondence Search”,
IEEE PAMI, Vol.28, No.4, April, 2006.
87
[...]... pixel in the two stereo images With the energy function, several minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth map 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures)... reconstruction result Figure 2.4 – example from [16] 2.2 Image Refocusing In our project, this part is based on the previous depth estimation phase All our research and implementation of refocusing are according to the single dense depth map that we have produced Therefore, for the introduction of related work in image refocusing, we will only describe the refocusing part of the paper, i.e we presume that... provided and focus on how to blur or sharpen the original image 17 One common approach is to compute the blur scale based on a set of images These given images can be different focused ones, or one all-focused image plus known focused settings or parameters In [22, 27], the degree of blur of different parts in the image is computed from the given defocused image sequence For this kind of approach, to compute... blur kernel can be estimated, and thus convolution can be applied to the blurry part of the image in order to recover an all-focused final image (in refocusing step) The output of this method is a coarse depth map, which is sufficient for the next refocusing phase in most applications 13 2.1.2 Stereo Matching – Two Images Input Stereo matching is one of the most active research areas in computer vision... and it can be applied to many applications as an significant intermediate step, such as view synthesis, image based rendering, 3D scene reconstruction Given two images/photos that are taken from slightly different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the reference image, where the final result is represented as a disparity map Disparity indicates the difference... along corresponding scan lines Therefore, we can easily view the disparity value as the offset between x-coordinates in the left and right images (Figure 2.2) The objects nearer to viewpoint have a larger translation, while farer ones have only a slight move Figure 2.2 - The stereo vision is captured in a left and a right image 14 An excellent review of stereo work can be found in [4] It presents the... optimal cuts Figure 3.10 shows some experimental results of the two -view stereo matching algorithms with graph cuts We can see that even for heavily texture images, graph cuts can still work to detect clear object boundaries and assign correct labels to the pixels left image result ground truth 33 left image result Figure 3.10 – Original images and their results Top row is the “lamp” data sequence and... exact depth value of each pixel in real world Given each pixel‟s x, 16 y, z value of an image (z value is the one we compute from the large dataset), we can easily reconstruct the 3D scene of this image and of course, the computed depth map is much more accurate than those produced from stereo matching or multi- view scene reconstruction See Figure 2.4 (the bear one if I still hold) The result of 3D... are single blurred photographs taken with the modified camera (obtain both depth information and an all-focus image) and output is coarse depth information together with a normal high resolution RGB image Therefore, to reconstruct the original sharp image, correct blur scale of an observed image has to be identified with the help of modified shape of camera aperture The process is based on probabilistic... 3 image with two to-be-assign labels For the edges connected between different nodes, a t-link is an edge that connects terminal nodes (source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes within a neighborhood system Figure 3.4 – Example of a constructed graph A similar graph cuts construction was first introduced in vision by Greig et al [3] for binary image ... Single Image Input 13 2.1.2 Stereo Matching – Two Images Input 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input 17 2.1.4 3D World Reconstruction from a large dataset 17 Image Refocusing. .. compute the final depth map 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures) The result... the image 49 Chapter Refocusing from Multiple Images In Computer Vision, in order to produce a depth map, there are a large number of methods However, some of those approaches only permit two images