Multi view image refocusing

Abstract Image refocusing is a potential research area in computer graphics and computer vision. By definition it means focusing an image again, or changing the emphasized region in a given image. To achieve the focusing job, it requires shallow depth of field to create a focus-defocus scene, which depends on larger size of lens aperture. In our project, we simulate a larger camera lens aperture by using several photos taken from slightly different viewpoints. Based on these images, a layer depth map is generated to present how the objects distribute in the real world scene. User can arbitrarily select one of the objects/layers to focus, and other parts will be naturally blurred according to their depth values in the scene. This project can be divided into two parts. One is how to produce a layer depth map. Computing a depth map is actually a work of labeling assignment. This type of problem can be solved by finding a minimum value of a constructed energy function. Graph Cuts algorithm is one of the most efficient optimization methods. We use it to optimize our built energy function due to its feature of fast convergence. The second part is to blur each layer that is not assigned to be focused. Several blurry algorithms are applied to achieve this goal. In this paper, I first describe some related work and background studies on the labeling assignment theories and their related topics in vision area. I then explore the refocus-related principals in computational photography. Based on these studies, I go through our image refocusing project in details and compare the experimental results to other existed approaches. Finally, I proposed some possible future work in this research area. 1 Acknowledgment First, I would like to express my sincere appreciation to my supervisor, Dr Low Kok Lim. He has offered me a large amount of precious advice and suggestions, contributed his valued time to review this paper and provided me proposed comments. I would also like to thank my family, especially my parents. They always offer me continuous encouragement and infinite support. My acknowledgement will give to my labmates as well. When I fell tired and wanted to give up, they were here by my side and support me. 2 Contents Abstract 1 Acknowledgment 2 Contents 3 1 Introduction 7 1.1 Introduction 7 1.2 Problem Statement 8 1.3 Advantages and Contributions 2 Related Work 2.1 2.2 10 13 Depth Estimation 13 2.1.1 Single Image Input 13 2.1.2 Stereo Matching – Two Images Input 15 2.1.3 Multi-View scene reconstruction - Multiple Images Input 17 2.1.4 3D World Reconstruction from a large dataset 17 Image Refocusing 3 Background Studies 18 21 3.1 Standard Energy Function in Vision 21 3.2 Optimization Methods 23 3.2.1 23 Introduction to Optimization 3 3.2.2 3.3 3.4 3.5 Graph Cuts 24 Graph Cuts 26 3.3.1 Preliminary Knowledge 26 3.3.2 Details of Graph Cuts 28 Concepts of Stereo Matching 33 3.4.1 Problem Formulation 33 3.4.2 Conclusion and Future Work 35 Photography 36 3.5.1 Technical Principles of Photography 36 3.5.2 Effects and Processing in Photography 41 Defocus Magnification 45 3.6.1 Overview and Problem Formulation 45 3.6.2 Results and Conclusion 46 3.7 Refocus Imaging 48 3.8 Adobe Photoshop 49 3.6 4 Refocusing from Multiple Images 51 4.1 Data Set 52 4.2 Computation of Camera Parameters 54 4.3 Estimation of Depth Value in 3D Scene 57 4.4 Problem Formulation and Graph Cuts 58 4.4.1 58 Overview of Problem Formulation 4 4.4.2 Results of Single Depth Map 60 4.4.3 Considerations and Difficulties 61 4.5 Layer Depth Map 62 4.6 Layer Blurring 65 4.7 Combining Blurry Layers 67 4.8 Results and Comparison 69 4.8.1 Experimental Results 69 4.8.2 Comparison to Adobe Photoshop 73 4.8.3 Different Lens Apertures 75 4.8.4 Bokeh 76 4.8.5 Different number of input images 78 4.8.6 Comparison to Other Software on Camera 81 4.8.7 Limitation 83 5 Conclusion 84 Bibliography 85 5 Chapter 1 Introduction 1.1 Introduction In photography, topics on focusing, refocusing and depth of field are potential and popular research areas which attract a large amount of attention. If an object is at the exact focus point, we call that it is focused precisely. While at any other distance from the camera, this object is defocused, and will looks like to be blurred in the resultant photo. The refocusing technique is to change the focused part of a picture, which leads to a case that other not-be-focused objects are blurred. Given the position of a camera, the depth of field (DOF) is controlled by the diameter or the shape of camera lens aperture. In general, a small aperture results in a large depth of field; then we can obtain an image with sharp objects everywhere (deep focus). In other cases, some objects are emphasized while some are blurred. This is resulted from a relatively large aperture, i.e. a small depth of field (shallow focus). In practice, people sometimes want to shoot a photo with sharp foreground and blurred background. Moreover, they would like to sharpen the objects they prefer, blur other parts they do not consider important. In this case, we need a camera with large aperture to create a shallow depth of field. However, general point-and-shoot cameras have small sensors and lenses, which is to some degree difficult to generate the effect of shallow focus. To figure out this problem, our project is using multiple photos with small aperture taken from slightly different viewpoints, which simulates a bigger camera aperture, to create a shallow depth of field. It means that with only one point-and-shoot camera, ordinary users without any photography technique can take an 6 „artistic‟ depth-of-field photo. Another advantage of our project is that user can randomly select one part of the referenced photo to sharpen it; the other parts are blurred meanwhile. This can be achieved by producing a depth map from these given several photos. 1.2 Problem Statement Most people will ask: what kind of problems does our project solve? The first thing we consider is the convenience for camera users. The top selling point-and-shoot camera in the current market is indeed convenient and easy to use. People just need to press the shutter button, and a piece of beautiful photo will be produced. However, only to generate a picture is not enough. They prefer more realistic effects on the photos. For example, if we choose a function of depthof-field scene simulation on the camera, the foreground objects (i.e. persons) will be sharpened while the background will be blurred by just pressing a button on the camera. The current welcomed point-and-shoot camera with depth-of-field effect usually has these two resultant photos: one is that objects are sharpened everywhere (Figure 1.1 - left), the other is that the camera detects the foreground objects automatically and keeps them sharpened, and blurs all the other regions in the photo (Figure1.1 - right). Here is the problem: what if the user wants to sharpen the background objects instead of the foreground ones? For example, in Fig.1-right, users do not want to focus the flower, maybe they would like to see the white object in the background clearer. At this time we need to refocus the whole scene in the photo, i.e. change the emphasized part of the scene. In a word, our project should be able to solve a problem like this: with the least handling steps on the input, how users can finally obtain a photo containing the sharp part and blurred part they prefer. 7 Figure 1.1 – left: all the objects are sharp everywhere; right: shallow depth of field – flower in the foreground is sharp, while background is largely blurred. There are some existing works in the area of refocusing. I will describe the details of these methods later in the next chapter. In this thesis, we present a simple but effective idea to implement the refocusing work in current popular point-and-shoot camera. The whole procedure can be divided into two parts. The first part is to compute a depth map based on the given input photos. The theory and the original problem formulation under the depth map computation is one classical research area in early vision: label assignment, which focuses on how to assign a label to a pixel (in image) given some observed data. This kind of problems can be solved by energy minimization. A way to figure out the label assigning work is to construct an energy function and minimize it. The second part of the project is to sharpen one part of the photo, while blur the other parts based on the depth map computed from the previous step. For this phase, we need to handle the 8 occurred problems such as partial-occluded parts, blur kernel and issues from layers of constructed 3D scene. After combining these two steps above, user can finally obtain a new image with the sharp parts they want by only shooting several photos from common point-andshoot camera. Therefore, we can summarize that the input of this project is a sequence of sharp photos taken from slightly different viewpoints. Then user selects a preferred region on the referenced photo that to be emphasized, and finally the output is a depth-of-field image. 1.3 Advantages and Contributions We analyze the advantages of our project from three aspects: depth map generation, refocusing procedure and quantity of input parameters. (a). Depth map generation: There are many approaches on producing depth map in early vision area. Most of them are stereo matching methods. User needs to prepare two images – left and right, with only translation movement between them. The output is a disparity map according to the moving distance of each pixel in these two images, i.e. large translation interprets nearer objects to the camera, while small movement accounts for the scene background that is far away from camera holder. Another type of depth map generation is to reconstruct a 3D scene from a large set of photos. Those photos can be taken from very different viewpoints. Actually to produce a coarse depth map, there is no need to reconstruct the corresponding 3D scene. The information a depth map requires is depending on where it applies for. Sometimes rough depth value (the distance from camera lens to the objects) of each pixel in the image is enough; while in other cases, especially 9 full 3D scene reconstruction, exact depth value (e.g. x, y, z axis point representation) of each 3D point is necessary. Our method is a tradeoff between the previous two approaches. We use several images to generate a usual depth map instead of 3D reconstruction. First, users do not need to shoot a large set of photos, 5-7 is enough. Second, the theory of multi-view scene reconstruction is simple, which is to generate an appropriate energy function for the given problem and apply an efficient energy minimization algorithm such as graph cuts to minimize it [33]. The details and our implementation will be described in the later chapter. In a word, the advantages of our depth map production has less input and is a simple algorithm idea instead of much more steps of 3D reconstruction approach such as feature point extraction, structure from motion, or bundle adjustment, etc. The reason why we do not apply 3D reconstruction approach is also because the information we require for the project is less than that in reconstructing a 3D scene from real world – rough depth value of each pixel in the referenced image is truly enough. (b). Refocusing procedure: One common approach to refocusing is to acquire a sequence of images with different focused setting. According to the spatially varying blur estimation of the whole scene where the information is taken from the focused difference, an all-focused resultant image can be computed to refocus. However, from the view of users, they need to take several photos under different focused setting. It is not so easy to achieve if users have little knowledge about photography. What if they do not know how to adjust different focused settings? In other words, some existing refocusing works have at least two input images with very different properties. For example, one of the input photos has sharp objects in the foreground, while in the other image, background objects are sharp and foreground is blurred. Hence according to the different blurred degrees and regions, the program can compute a relatively accurate blur kernel to accomplish the refocusing task. Notice that a camera with depth-of-field effect is required for this kind of approach. 10 A different approach to refocusing is to measure the light field associated with a scene. In this case, the measured rays can be combined to simulate new depth of field settings without explicitly computing depth [8]. The drawback of this method is that it requires either a large amount of photos or a large camera array. The blur estimation of our project is based on the depth map produced from the previous step. It does not require large number of photos, or any additional requirement like camera array. What we need is only the depth value of each pixel and blurs the objects in the picture according to the magnitude of the depth value. The nearer to the camera, the less the objects are blurred. Besides, the photos users shoot are under the same focused setting. There is no need for them to have any extra photography knowledge. (c). Quantity of input parameters: Compared to other existing refocusing methods, most of them need more input parameters than ours. In [8], in order to estimate a depth map under a single camera (with a wide depth of field), the author uses a sparse set of dots projected onto the scene (with a shallow depth of field projector), and refocus based on the depth map. Another type of methods is to modify the camera lens. They use various shapes of camera aperture, or apply a coded aperture on a conventional camera [1]. With the knowledge of the aperture shape, the blurred kernel is computed as well. Our project does not require extra equipment like a projector or any camera modification. We do not need to take any photos with shallow depth of field to estimate the blur kernel either. All users need to do is to shoot several all-focused photos with slightly different shooting angles by using a shoot-and-point digital camera. It is more convenient for ordinary people without any photography technique to obtain a final refocusing photo. In our project, the interaction between user and computer is the user selection of preferred region which will be emphasized (sharpened). 11 Chapter 2 Related Work According to the procedure of our project – depth map computation and image refocusing based on the depth information, the literature survey in this chapter will also be divided into two categories: existing work on depth estimation and existing work on image refocusing. In this chapter, we will not only introduce the existing related work to our project, but also describe some key algorithms and their corresponding applications. These algorithms such as stereo matching concepts, take an important part into our project as well. Therefore, introduction of such concepts in detail is necessary. 2.1 Depth Estimation We divide this part as four categories according to the required input, which in my opinion is much clearer to compare these various methods. 2.1.1 Single Image Input The approaches for obtaining one depth map from a single input image, most of them require additional equipment like projector, or device modification like camera aperture shape change, etc. In [8], the author uses a single camera (with a wide depth of field) and the depth value 12 computation is based on the defocus of a sparse set of dots projected onto the scene (using a narrow depth of field projector). With the help of the projecting dots from the projector and the color segmentation algorithm, an approximated depth map of the scene with sharp boundaries can be computed for the next step. Figure 2.1 is the example of [8]. (a) is the acquired image from single input image and projector. (b) is the computed depth map from the information provided by (a). The produced depth map has very sharp and accurate boundaries on the objects. However, it cannot handle the partial occlusion problem, i.e., we are not able to see the region of the man behind that flower in depth map. Figure 2.1 – example result from [8]. In [1], the authors use a single image capture, and a small modification to the traditional camera lens – a simple piece of cardboard suffices. On the condition that we have already know the shape of lens aperture, the corresponding blur kernel can be estimated, and thus convolution can be applied to the blurry part of the image in order to recover an all-focused final image (in refocusing step). The output of this method is a coarse depth map, which is sufficient for the next refocusing phase in most applications. 13 2.1.2 Stereo Matching – Two Images Input Stereo matching is one of the most active research areas in computer vision and it can be applied to many applications as an significant intermediate step, such as view synthesis, image based rendering, 3D scene reconstruction. Given two images/photos that are taken from slightly different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the reference image, where the final result is represented as a disparity map. Disparity indicates the difference in location between two corresponding pixels and it is also considered as a synonym for inverse depth. The most important step at first is to find the corresponding pixels which refer to the same scene point from the given two left and right images. Once these correspondences are known, we can find out how much difference produced by the camera movement. Consider that it is hard to point out the corresponding pixels under the one-map-to-one condition based on the casual camera motion, to simplify the search for correspondence, the image pair is commonly transformed into horizontal translation, so that the stereo problem is reduced to a onedimensional search along corresponding scan lines. Therefore, we can easily view the disparity value as the offset between x-coordinates in the left and right images (Figure 2.2). The objects nearer to viewpoint have a larger translation, while farer ones have only a slight move. Figure 2.2 - The stereo vision is captured in a left and a right image. 14 An excellent review of stereo work can be found in [4]. It presents the taxonomy and comparison of two-frame stereo correspondence algorithms. Stereo matching algorithms generally perform (subsets of) of the following three steps [4]: a. matching cost computation; My own comparison and implementation only focuses on pixel-based matching costs, which is enough for the research of this project. The most common pixel-based methods include squared intensity differences (SD) [14, 18, 12, 7] and absolute intensity differences (AD) [30]. The computation of SD and AD falls between the single pixels of the given left and right images. b. cost (support) aggregation; The aggregation of cost is usually window-based (local) method. It aggregates the matching cost by summing or averaging over a support region. For common used stereo matching, a support region is a two-dimensional area, often a square window (3-by-3, 7-by-7). Therefore the cost aggregation of each pixel in an image can be calculated over such region. c. disparity computation / optimization; It can be separated into two classes. One is local methods, and the other is global ones. The local methods usually perform a local “winner-take-all” (WTA) optimization at each pixel [40]. For the global optimization methods, the objective is to build a disparity function that minimizes a global energy. Such energy function includes data term and smoothness term, which represents energy distribution of each pixel in the two stereo images. With the energy function, several minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth map. 15 2.1.3 Multi-View scene reconstruction - Multiple Images Input A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures). The result of this type of methods is usually a dense depth map that is similar to those of stereo matching. However, it is difficult to deal with the partial occluded part in stereo matching because of lacking of information from only two input images. Users provide multiple images as input, which means they have enough information that could produce a relatively exact result. Furthermore, partial occlusion may also be solved. In [Multi-camera; asymmetrical], the energy function has more than two terms, i.e. data term, smoothness term and visibility term. Obviously, the visibility term is used to handle the partial occlusion problem. Besides, [13, 25, 23] not only take advantages of multiple images as the input, but also build a new type of data structure as the output – layered depth map, which can represent the layered pattern of the real 3D scene more clearly. The whole scene can be divided into several planes. Each plane represents the distance of objects to the camera holder. Figure 2.3 shows an example of layered depth map [23]. This new representation of depth map deals with the partial occlusion better and the most important point is that it is more convenient and exact for the next step – refocusing. (a) (b) (c) Figure 2.3 – (a) is one of the input sequence; (b) is the recovered depth map; (c) is the separated layers. 2.1.4 3D World Reconstruction from a large dataset As in [16], the input is a large dataset where the photos are taken from very different views. Therefore we can obtain plenty of information including camera position, affine matrix of each photo, and finally compute the exact depth value of each pixel in real world. Given each pixel‟s x, 16 y, z value of an image (z value is the one we compute from the large dataset), we can easily reconstruct the 3D scene of this image and of course, the computed depth map is much more accurate than those produced from stereo matching or multi-view scene reconstruction. See Figure 2.4 (the bear one if I still hold). The result of 3D reconstruction is comprised of a large amount of sparse points. For people who would like to see a rough sketch of a certain building, a set of sparse points is enough. If we try to obtain a dense map, estimation on the regions without known points or triangulation may be performed in order to connect the discrete points together. (a) large dataset as input (b) 3D reconstruction result Figure 2.4 – example from [16]. 2.2 Image Refocusing In our project, this part is based on the previous depth estimation phase. All our research and implementation of refocusing are according to the single dense depth map that we have produced. Therefore, for the introduction of related work in image refocusing, we will only describe the refocusing part of the paper, i.e. we presume that the depth map (single or layered) has already been provided and focus on how to blur or sharpen the original image. 17 One common approach is to compute the blur scale based on a set of images. These given images can be different focused ones, or one all-focused image plus known focused settings or parameters. In [22, 27], the degree of blur of different parts in the image is computed from the given defocused image sequence. For this kind of approach, to compute the blur of the whole scene, in addition to the size of the lens aperture and the focal length, there are two issues to be noticed. First is the partial occlusion. Different parts of the lens or different cameras may capture different views due to partial occlusion. Second, pixel value at object boundaries may have vague providers, i.e., background and foreground could both have possible contributions to the boundary pixels. The [8] refocusing algorithm addresses both these issues. It views the partial occlusions as missing regions in the resultant depth map. Then the missing parts are recreated by estimation, i.e., the algorithm extends the occluded surface using texture synthesis. Moreover, to deal with foreground and background transitions at pixel boundaries, the authors blend a foreground focused image with a background focused image within the boundary region. They combine their depth map with a matting matrix computed from the depth estimation refinement to produce better result. [1] is also another type of approach using blur kernel estimation to fulfill the refocusing work. The input are single blurred photographs taken with the modified camera (obtain both depth information and an all-focus image) and output is coarse depth information together with a normal high resolution RGB image. Therefore, to reconstruct the original sharp image, correct blur scale of an observed image has to be identified with the help of modified shape of camera aperture. The process is based on probabilistic model which finds the maximum likelihood of a blur scale estimation equation. To summary, according to the general convolution equation y = fk * x, where y is the observed image, x is the sharp image that will be recovered, and the blur filter fk is a scaled version of the aperture shape [1]. The defocus step of this method is to find a correct kernel fk with the help of known information of coded aperture. Another kind of methods is to produce a layered depth map which divided the whole scene into several parts. Each part has a certain depth value. These separated layers are viewed as individual parts. It is definitely a solution to avoid the missing regions between objects lying in different scene layers. Given a layered depth map, the only thing we need to do is to blur the layers according to the depth value of the corresponding individual layer without considering 18 whether there is any missing region. Our refocusing algorithm is based on this layered depth map idea. We will describe the details in later chapter. A different approach to refocusing is to measure the light field of a scene. In this case, there is no need to compute the exact depth value of each pixel. The measured rays can be combined to simulate new depth of field settings [light field papers]. Since depth estimation is going through our whole project and this thesis, we will not introduce more about light field. Detailed information about light field in depth estimation area is in [32]. 19 Chapter 3 Background Studies In this Chapter, I mainly discuss some background theories with respect to this project. Some other similar work will also be introduced to offer a rough image and comparison to our project. In section 3.1, basic knowledge of energy function and optimization method – graph cuts are explored. In section 3.2, we mainly discuss the deep theories graph cuts algorithm including implemental details. A classical and critical application related to graph cuts and multi-view depth map production is stereo matching. We will also discuss it in order to better understand the basic theories under our project. I also introduce some concepts that are related to the refocusing part of the project. The basic theories and relationships among the common camera parameters are described in section 3.3. Besides, some different methods (applications) on the topics of refocusing, defocusing or depth of field are introduced to do a direct comparison with our method. 3.1 Standard Energy Function in Vision In early vision problems, the label assignment can be explained as: every pixel must be assigned a label in some finite set L. In image restoration, the assignments represent pixel intensities, while for stereo vision and motion, the labels are disparities. Another example is image segmentation, where labels represent the pixel values of each segment. The goal of label assignment is to find a labeling f that assigns each pixel a label , where f is both 20 piecewise smooth and consistent with the observed data. In general energy function definition, the vision problems can be naturally formulated as E(f) = Esmooth (f) + Edata (f) (1) where Esmooth (f) measures the extent to which f is not piecewise smooth, while Edata (f) measures the disagreement between f and the observed data [39]. Typically, in many literatures and applications, the form of Edata (f) is defined as ( ) ∑ ( ) ( ) where Dp measures how closed label fp to pixel p given the observed data. If a minimum value of this data term is found, that means the configuration f fits the corresponding pixels very well. In stereo vision problem, ( ) is usually ( ) , where Iright and Ileft are the pixel intensities of the two corresponding points p and q in the given left and right image, respectively. While in image restoration problem, ( ) is normally ( ) , where Ip is the observed intensity of p. While the data term is easier to define and apply to different practical problems, the choice of smoothness term is a very important and critical issue in the current research area. It directly decides whether the final result is optimal and the form of this term usually depends on the applications. For example, in some approaches, Esmooth (f) makes f smooth everywhere according to the demand of the algorithms. For many other applications, Esmooth (f) has to detect the object boundaries as clear as possible, which is often denoted as discontinuity preserving. For image segmentation, stereo vision problems, object boundary is a necessary factor to be firstly considered. The Potts model I described above is also a popularly used type of smoothness term. From the discussion about data term and smoothness term in energy function, we consider energies of the form 21 ( ) ∑ * ( ) ∑ ( ) ( ) + where N is set of pairs of neighboring pixels. Normally, N is composed of adjacent pixels (i.e. left and right, top and bottom, etc.), but can be arbitrary pairs as well according to problem requirement. Most applications only consider Vp,q under pair-wise interactions, since pixel dependence and interaction often happen between adjacent pixels. 3.2 Optimization Methods Finding a minimum value for a given energy function is a typical kind of optimization problem. In this section, first I give an introduction and motivation to optimization technique. Then I focus on introducing an efficient optimization method – graph cuts. 3.2.1 Introduction to Optimization In mathematics and computer science, optimization, or mathematical programming, refers to choosing the best element from some set of available alternatives. In the simplest case, optimization means solving problems where one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set. This formulation, using a scalar, real-valued objective function, is probably the simplest example. Generally, optimization means finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types of domains. Greig et al. [3] was first to use powerful min-cut/max-flow algorithms from combinatorial optimization to minimize certain typical energy function in computer vision. Combinatorial 22 optimization is a branch of optimization problems. The feasible solutions of this kind of problems are discrete or can be reduced to discrete case, and the goal is to find the best possible solution, most of the time is to find an approximated one. For approximation algorithms, they can run in polynomial time and find a solution that is “close” to optimal. For certain energy function, in general, a labeling f is a local minimum of the energy E if ( ) ( ) In the discrete labeling case, the labeling near to f lies within a single move of f. Many local optimization approaches applies standard moves, where only one pixel can change its label at one time [39]. 3.2.2 Graph Cuts Graph cuts algorithm is one of the most popular optimization algorithms in current related research areas. It can rapidly compute a local minimum with relatively good results. Here are some examples produced by graph cuts. The object boundaries are pretty clear which fit the requirement of image segmentation and stereo vision cases. Since graph cuts algorithm is the major discussing topic in this paper, I will introduce it later in details. 23 Figure 3.1 - Results of color segmentation on the Berkeley dataset by using graph cuts. Figure 3.2 - Results of texture segmentation on the MIT VisTex and the Berkeley dataset by using graph cuts. Figure 3.3 – Disparity map of stereo vision matching using graph cuts. 24 3.3 Graph Cuts 3.3.1 Preliminary Knowledge To fully understand the idea of graph cuts, there are some useful fundamental theories to be preliminarily known. 3.3.1.1 Metric and Semi-Metric V is called a metric on the label space L if it satisfies for any labels ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . If V satisfies only (a) and (b), it is called a semi-metric. For example, Potts model ( ) ( ) is a metric, where ( ) is 1 if its argument is true, otherwise 0. The Potts model encourages labeling consisting of several regions where pixels in the same region have equal labels [4]. The discontinuity-preserving results produced from this model are also called piecewise constant, which is widely used in segmentation, stereo vision problems. Another type of models is called piecewise smooth. The truncated quadratic ( ) ( | | ) is a semi-metric, while the truncated absolute distance ( ) ( | |) is a metric, where K is some constant. The role of constant K is to restrict possible larger discontinuity penalty imposed on the smoothness term in the energy function. These models encourage labeling consisting of several regions where pixels in the same region have similar labels. 25 3.3.1.2 Graph denotation and construction Let 〈 〉 be a weighted graph. It consists of a set of nodes V and a set of edges that connect them. The set of nodes has several distinguished vertices which are called the terminals. In the context of vision problem, the nodes normally correspond to pixels, voxels or other types of image features and terminals correspond to the set of labels which can be assigned to each pixel in the image. For simplification, I will only focus on the case of two terminals (i.e. two labels to be assigned). Usually the two terminal nodes can also be called source node and sink node. The multiple terminals problem can be naturally extended from the two-label case. In Figure 3.4, a simple example of a two terminal graph is shoed. This graph construction can be used on a 3 x 3 image with two to-be-assign labels. For the edges connected between different nodes, a t-link is an edge that connects terminal nodes (source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes within a neighborhood system. Figure 3.4 – Example of a constructed graph. A similar graph cuts construction was first introduced in vision by Greig et al. [3] for binary image restoration. A cut graph ( ) is a set of edges such that the terminals are separated by this cut in the induced 〈 〉. After the cut, a subset of nodes belongs to source terminal, while the other subset of nodes is categorized into the sink terminal. The cost of the cut C, denoted |C|, equals the sum of edge weights of this cut. Figure 3.5 shows a typical cut on the constructed graph. The cut is represented as green dotted line. 26 The minimum cut problem is to find the optimal cut with lowest cost among all cuts separating the terminals. Figure 3.5 – Example of a cut on the constructed graph. 3.3.1.3 Minimizing the Potts energy is NP-hard The details of proof of NP-hard will not be described here. All we have to know is that a polynomial-time method for finding an optimal configuration f* would provide a polynomialtime algorithm for finding the minimum cost multi-way cut, which is known to be NP-hard [38]. 3.3.2 Details of Graph Cuts 3.3.2.1 Overview In contrast to other optimization approaches described before, which use the standard moves with only one label changes at a time, graph cuts algorithms can change a large number of labels of pixels simultaneously. This improvement directly speeds up the processing time on images. There are two types of large moves in graph cuts algorithm: α-β swap and α-expansion. Figure 27 3.6 shows the comparison of local minima with respect to standard and large moves for image restoration. Figure 3.6 – Comparison of local minima with respect to standard and large moves for image restoration. (a) Original image. (b) Observed noisy image. (c) Local minima with respect to standard moves (i.e. only one label changes at a time). (d) Local minima with respect to large moves. Both local minima in (c) and (d) were obtained using labeling (b) as an initial labeling estimate [39]. 3.3.2.2 Definition of Swap and Algorithms For α-β swap, given a pair of labels α and β, a α-β swap is a move from an old labeling f to a new labeling f’. If there is pixel labeling difference between the before and after changes which leads to a decrease of energy in the function, we say that the α-β swap succeeds and will continue to the next iteration. In other words, α-β swap means that some pixels that were labeled α now change to be labeled β, and some pixels that were labeled β now labeled α. For α-expansion, given a label α, a α-expansion move is also a move from an old labeling f to a new labeling f’. This algorithm means that some pixels that were not assigned to label α now are assigned to label α. Figure 3.7 shows the α-β swap and α-expansion algorithms respectively. We will call a single execution of Steps 3.1-3.2 and iteration, and an execution of Steps 2, 3, and 4 a cycle. In each cycle, an iteration is performed for every label α or for every pair of label α and β. The 28 algorithms will continue until it cannot find any successful labeling change. It can be obvious seen that a cycle in the α-β swap algorithm take |L|2 iterations, while a cycle in the α-expansion algorithm only take |L| iterations. Figure 3.7 – Overview of α-β swap algorithm (top) and α-expansion algorithm (bottom). Given an input initial labeling f and a pair of labels α and β (swap algorithm) or a label α (expansion algorithm), we want to find a new labeling f’ which can minimize the given energy function ( ) ∑ * ( ) ∑ ( ) + I will discuss the procedure and results of these two algorithms with respect to a constructed graph. There are also a number of theorems and corollaries to be proved, I will not prove them. In contrast, I will directly use those theorems to give a clear and straight interpretation to easier understand these two algorithms. 29 3.3.2.3 Swap Any cut leaves each pixel in the image with exactly one t-link, which means that the result of a cut determines the labeling f’ of every pixel. From another view, a cut can be described as: a pixel p is assigned label α when the cut C separates p from the terminal α; similarly, p is assigned label β when the cut C separates p from the terminal β. If p is not chosen to be changed the label, its original label fp will be kept. Lemma 3.1: A labeling fC corresponding to a cut C on the constructed graph is one α-β swap away from the initial labeling f. Lemma 3.2: There is a one-to-one correspondence between cuts C on the constructed graph and labelings that are one α-β swap from f. Moreover, the cost of a cut C on the graph is | | ( ) plus a constant. Corollary 3.1: The lowest energy labeling within a single α-β swap move from f is ̂ , where C is the minimum cut on the constructed graph. 3.3.2.4 The theory of this algorithm is similar to α-β swap as described above. The most different part from swap algorithm is the introduction of auxiliary nodes. Here I will not explain it, only give an overall idea of the relationship between the edge weights and the energy function produced before. A cut can be described as: a pixel p is assigned label α when the cut C separates p from the terminal α. If p is not chosen to be changed to the label α, its original label fp will be kept. Lemma 3.3: A labeling fC corresponding to a cut C on the constructed graph is one α-expansion away from the initial labeling f. 30 Lemma 3.4: There is a one-to-one correspondence between elementary cuts on the constructed graph and labelings within one α-expansion of f. Moreover, for any elementary cut C, we have | | ( ). Corollary 3.2: The lowest energy labeling within a single α-expansion move from f is ̂ , where C is the minimum cut on the constructed graph. In [39], the authors define the edge weights which are definitely related to the energy function. The weight of t-link corresponds to the data term and the weight of n-link corresponds to the smoothness term. Based on the definitions of edge weights, all the theorems and corollaries can be finally proved. Figure 3.8 and Figure 3.9 are two examples to illustrate the results of α-β swap and α-expansion algorithms. Figure 3.8 – Examples of α-β swap and α-expansion algorithms. Figure 3.9 – Example of α-expansion. Leftmost is the input initial labeling. An expansion move is shown in the middle and the right one is the corresponding binary labeling. 31 The window-based algorithms described in this survey may produce the results with a number of errors like black holes or mismatches in the disparity map. Using graph cuts algorithm, the object boundaries can be clearly detected according to the choice of smoothness term and the label assignment to disparity map can be obtained due to the existence of data term 3.4 Concepts of Stereo Matching 3.4.1 Problem Formulation For every pixel in one image, find the corresponding pixel in the other image is the basic idea of stereo matching. Here the authors of [35] refer to this definition as the traditional stereo problem. ( The goal of this problem is also to find the labeling ( ) ∑ * ( ) | |) ∑ that minimizes ( ) + Again, Dp is the penalty for assigning a label to the pixel p. N is the neighborhood system composed of pairs of adjacent pixels; and V is the penalty for assigning different labels to adjacent pixels. In the traditional stereo matching problem, the location movement of each pixel goes along horizontal or vertical direction. So if we assign a label fp to the pixel p in the reference image I, the corresponding pixel in the matching image I’ should be (p + fp). The matching penalty Dp will enforce photo-consistency, which is the tendency of corresponding pixels to have similar intensities. The possible form of Dp is ( ) ‖ ( ) ( )‖ 32 For the smoothness term, Potts model is usually used to impose a penalty for different fp, fq. The natural form of Potts model for smoothness term is ( ) [ ], where the indicator function , - is 1 if its argument is true and otherwise 0 [35]. We can easily see the change of terms D and V from the tables below, which are | | | | | |, respectively. For stereo with the intensity difference and Potts model applied on smoothness term, they are Dp = (( ) ( )) (( ) ( )) (( ) ( )) V= More efficient and fast implementation of graph cuts is using a new min-cut/max-flow algorithm [36] to find the optimal cuts. Figure 3.10 shows some experimental results of the two-view stereo matching algorithms with graph cuts. We can see that even for heavily texture images, graph cuts can still work to detect clear object boundaries and assign correct labels to the pixels. left image result ground truth 33 left image result Figure 3.10 – Original images and their results. Top row is the “lamp” data sequence and the bottom row is the “tree” data sets. 3.4.2 Conclusion and Future Work The binocular stereo vision area is a relatively mature field. A large number of two-view stereo matching algorithms have been proposed in recent years. The results of most approaches turn out to be pretty good, with not only clear object boundaries but also accurate disparity values. An evaluation of those stereo methods is provided by Middlebury [5] weighs the advantages and disadvantages among them, which can give readers an overall understanding of the current trend in stereo vision area. However, the given data have been simplified into a pair of rectified images, where the corresponding points are easy to find (only need to search along the horizontal or vertical line), since the data set offered by Middlebury is strictly several pixels offset between left and right images. Future work may focus on two or more casually taken images without such strict horizontal or vertical translation. Slight rotation can be viewed as a challenge to be improved. 34 3.5 Photography In this section, I will introduce a series of concepts which has closed relationship to this project. The camera parameters, post-processing techniques, and technical principles are included to help better understand our project. For section 3.5.1, some related basic theories will be introduced to build a fundamental image of photography, especially camera. Moreover, I will describe some pre-processing or post-processing techniques as well, such as focusing, refocusing, defocusing methods in section 3.5.2. 3.5.1 Technical Principles of Photography 3.5.1.1 Pinhole Camera Model A pinhole camera is a simple camera with a single small aperture but without a lens to focus light. Figure 3.11 shows a diagram of a pinhole camera. This type of camera model is usually used as a first order approximation of the mapping from a 3D scene to a 2D image, which is the main assumption in our project (I will describe it later). Figure 3.11 – A diagram of pinhole camera The pinhole camera model demonstrates the mathematical relationship between the coordinates of a 3D point and its projection onto the 2D image plane of an ideal pinhole camera. The reason why we would like to use such simple form is that even some of the effects like geometric distortions or depth of filed cannot be taken into account, it can still figure them out by applying 35 suitable coordinate transformation on the image coordinates. Therefore, pinhole camera model is relatively appropriate to be used as a reasonable description of how a common camera with 2D image plane depicts a 3D scene. Figure 3.12 illustrates the geometry related to the mapping of a pinhole camera. Figure 3.12 – The geometry of a pinhole camera A point R locates at the intersection of the optical axis and the image plane which is referred to as the principle point or image center. A point P somewhere in the world at coordinate (x1, x2, x3) represents an object in the real world that taken by the camera. The projection of point P onto the image plane denotes Q. This point is given by the intersection of the projection line (green) and the image plane. Figure 3.13 is also the geometry of pinhole camera but viewed from the X2 axis. It better demonstrates how the model works in practical case. 36 Figure 3.13 - The geometry of a pinhole camera as seen from the X2 axis I apply this model to our project in the first step of computing depth map. It is very useful when calculating the camera parameters and the partial occluded parts in the 3D scene. I will describe it in detail in the later chapter. 3.5.1.2 Camera Aperture In optics, an aperture is a hole or an opening through which light travels. In some other context, especially photography, aperture refers to the diameter or shape of the camera aperture. A camera can have large or small aperture to control the amount of light reaching the film or image sensor; or an aperture can also have different shapes to control shape of rays going through. Combining with variation of shutter speed, the aperture size will regulate the film's or image sensor's degree of exposure to light. Typically, a fast shutter speed will require a larger aperture to ensure sufficient light exposure, and a slow shutter speed will require a smaller aperture to avoid excessive exposure. Figure 3.14 shows two different sizes of a given camera aperture. 37 Figure 3.14 - A large (1) and a small (2) aperture The lens aperture is usually specified as an f-number, the ratio of focal length to effective aperture diameter. A lower f-number denotes a greater aperture opening which allows more light to reach the film or image sensor. Figure 3.15 illustrates some standard aperture sizes. For convenience, I will use this “f / f-number” form to represent the size of aperture. Figure 3.15 - Diagram of decreasing aperture sizes (increasing f-numbers) I will not introduce more about camera aperture here. It has a strong relationship with some photography effect like depth of field. In section 3.5.2, it will be introduced again in detail combined with the practical techniques. 38 3.5.1.3 Circle of Confusion (CoC) In photography, the circle of confusion is also used to determine the depth of field. It defines how much a point needs to be blurred in order to be perceived as unsharp from human eyes. When the circle of confusion becomes perceptible to the human eyes, we say that this area is outside the depth of field and therefore no longer "acceptably sharp" under the definition of DOF. Figure 3.16 and 3.17 picture how the circle of confusion performs in depth of field. Figure 3.16 – The range of circle of confusion Figure 3.17 – Illustration of circle of confusion and depth of field Again, the relationship between circle of confusion and depth of field will be further described in the later sections. 39 3.5.2 Effects and Processing in Photography 3.5.2.1 Depth of Field In optics, particularly in photography, depth of field (DOF) is the range of distance within the subject that appears acceptably sharp in the image. Figure 3.18 is the picture which depicts the depth of field in real world. Figure 3.18 - The area within the depth of field appears sharp, while the areas in front of and behind the depth of field appear blurry In general, depth of field does not abruptly change from sharp to unsharp, but instead appears as a gradual transition (Figure 3.19). Figure 3.19 – An image with very shallow depth of field, which appears as a gradual transition (from blurry to sharp, then to blurry again) 40 From the introduction above, it is obvious that if we prefer a photo with the objects sharp everywhere, just enlarge the depth of field, while shorten the depth of field if we want to emphasize one object with blurring the left. The DOF is determined by the camera-to-subject distance, the lens focal length, the lens fnumber, and the format size or circle of confusion criterion. Camera aperture (lens f-number) and lens focal length are the two main factors that determine how big the depth of field will be. For a given focal length, increase the diameter of aperture will lead to the case that decreases the depth of field. For a given lens f-number, using a lens of greater focal length will decrease the depth of field as well. Figure 3.20 shows the cases taken from two different aperture sizes. For the left image, the background competes for the viewer‟s attention, which means when our eyes focus on the flower in foreground, we possibly will not ignore the background scene. However, in the right image, the flowers are isolated from the background. Figure 3.20 – Left: f/32 - narrow aperture and slow shutter speed; Right: f/5.6 - wide aperture and fast shutter speed For a pinhole camera or a point-and-shoot camera, the diameter of aperture is usually small, which results in sharp scenes everywhere in the photo. The goal of our project is to use several sharp images to create a relatively shallow depth of field effect in the resultant image. 41 3.5.2.2 Bokeh In photography, bokeh is the blur, or the aesthetic quality of the blur in out-of-focus areas of an image, or the way the lens renders out-of-focus points of light. Different lens aberrations and shapes of camera aperture cause different results of bokeh. It is hard to tell whether bokeh in a certain photo is good or bad. It depends on the lens design to blur the image. If the bokeh pleases our eyes, it is said to be good, while some unpleasant or upset blurring results can be viewed as bad bokeh. Photographers with larger camera aperture sometimes would like to use a shallow depth of field technique to shoot photos with prominent out-of-focus regions, which can easily separate the objects in foreground with the background scene. Examples in Figure 3.21 illustrate different effects of bokeh. Figure 3.21 – Different effects of bokeh 42 Bokeh is often most visible around small background highlights, such as specular reflections and light sources, which is why it is often associated with such areas [9]. However, bokeh is not limited to highlights, as blur occurs in all out-of-focus regions of the image. Bokeh has a strong relationship with the shape of camera aperture. The shape of the aperture has a great influence on the subjective quality of bokeh. Actually in the out-of-focus blurry regions, we can clearly observe the aperture shape (see Figure 3.22 [15]). Figure 3.22 – Different shapes of aperture cause different effects of bokeh Even in our project, in the blurry regions, if we change the shape of aperture, the variation of small-background-highlight shape will be changed as well. 43 3.5.2.3 Defocusing and Refocusing In optics, especially in photography, defocus simply means out of focus. Generally, defocus reduces the sharpness of the whole image. For example, in the resultant defocusing image, the sharp edges become gradual transition and finer detail in the scene is blurred or even cannot be seen clearly from viewer. For the goal to generate the shallow depth of field of a lens with a larger aperture, which is similar to our project, people can increase the defocus parts in the image by using suitable technique, such as the paper – “Defocus Magnification” [24]. Refocusing by definition means that to focus again, or to change the emphasis in a given image. Refocus imaging is a type of redefining photography. User can select one region in the image that they would like to emphasize. Thus other parts should be naturally blurred. This topic is not new in some areas, where researchers implement refocusing by modifying certain imaging hardware in conventional cameras. The main goal of our project is refocusing from multiple images, which is similar to the final objective of some existed companies or techniques. However, the difference is that we only develop a kind of software instead of adjusting the hardware of camera. Using software to some degree is more convenient to be accepted for the users who may not have enough professional knowledge on photography. 3.6 Defocus Magnification [24] 3.6.1 Overview and Problem Formulation 44 Photographers often prefer a blurry background because of shallow depth of field in order to emphasize the foreground object they want, such as portraits. However, unfortunately, common point-and-shoot cameras limit the occurrence of enough defocus effect due to the existence of the small lens aperture in cameras. Defocus magnification is an image-processing technique which simulates the shallow depth of field with a larger camera aperture of a lens by magnifying the defocused regions in an image. This technique takes a single input image with a large depth of field, and then increases degree of blurry effect in out-of-focus regions. The authors first estimates the spatially-varying amount of blur over the image by estimating the size of the blur kernel at edges, then they propagate this blurred method over the areas of the image where need further blurring. Based on the amount of blur estimated, they generate a defocus map, which more likely acts as a depth map. Moreover, they propagate the blurry measure to the neighbors with similar color value under the assumption that blurriness changes smoothly over the whole image except for the regions where the color intensity is not continuous. Finally, they magnify defocus effects on the image by using their produced defocus map. According to the defocus map, the authors rely on the lens blur filter of Photoshop to compute the defocused output. 3.6.2 Results and Conclusion Figure 3.23 shows one significant result from the “Defocus Magnification” paper. Figure 3.23 – Using the defocus map (b), this method can synthesize refocusing effects. Here (a) is the input image. (b) is the produced defocus map. The result (c) looks as if the foreground is focused. 45 This example uses the defocus map to synthesize refocusing effects. The result just looks as if the man is focused in foreground. Actually there is no difference on the man between the input image and the refocusing result. They magnify the blurriness of the background to make the foreground object appear sharper. In other words, it is not a real refocusing process, since from the very beginning of given input image, there have already been some regions that are blurred. However, the blurry areas of input image are necessary, as this approach has to estimate a defocus map according to the degree of blurry edge. Figure 3.24 is another set of results from this paper. It is obvious that what they do is to increase the defocus part which has been already blurred in order to present a more realistic or artistic impression to viewers (emphasize the major object in foreground), and keep the sharp part unchanged. For future work, this approach can be extended to video inputs. Besides, the partial occlusion problem should also be studied, which is a traditional issue for depth of field effects. Figure 3.24 – Other results. From left to right: the original images, their defocus maps, and results blurred using the magnification approach 46 3.7 Refocus Imaging [21] Refocus Imaging, Inc. is an early-stage company headquartered in Mountain View, California. This company starts up in the area of computational photography. There is little information online about how they implement the refocusing method. What we can know is that they develop a special lens and capture the entire “light field” entering a given camera. The main idea of the refocus imaging project is to use light field photography which requires a “simple optical modification to existing digital cameras”. This type of camera is called “4D light field camera”. The new plenoptic camera can essentially turn a 2D photo into 4D. In this thesis I will not talk more on 4D light field, just present some results come from the website of the company (Figure 3.25). The quality of the results turns out to be good. Compare to our project, the basic principle of our implementation is simpler to understand and the method we develop is more convenient and acceptable for ordinary people without much professional photography knowledge. Figure 3.25 – Refocusing. Focus on three different layers. The first image focuses on the red lady layer. The second one focuses on the yellow lady layer, while the third one focuses the blue lady layer 47 3.8 Adobe Photoshop In Adobe Photoshop, there is a filter called lens blur. For this filter, the input is one original image and a given depth map. According to the depth map with some options such as aperture shape, specular highlights and blur focal distance, user can finally obtain a refocusing result. It is not difficult to realize the theory behind Photoshop. Assume that we have already held a depth map, which means we fully understand the relationship among all the depth values in the image. The work Photoshop needs to do is the blurring step in accordance with focal distance. Therefore, whether the result is good mostly depends on the quality of input depth map. Besides, compulsory input of depth map is an inconvenient measure for users. Most of the time people cannot provide a depth map handy, even sometimes they do not understand what a depth map means. Thus this situation limits ordinary people to refocus a photo through Photoshop themselves. Figure 3.26 shows several results produced by lens blur filter in Photoshop. Note that the depth map we use here is a ground truth of the original image. In practice, it is no way for photographer to obtain a ground truth depth map directly. I will compare the results of Photoshop to our project by using the same original images and depth map, where the depth map is generated by our program instead of the ground truth on website. 48 Figure 3.26 – Results from Adobe Photoshop. The first row is the input image that is sharp everywhere and the given ground truth depth map. The second row includes three different refocusing results. From left to right: focus part is far away from the camera, focus on the middle of the scene and focus on the front part of the image 49 Chapter 4 Refocusing from Multiple Images In Computer Vision, in order to produce a depth map, there are a large number of methods. However, some of those approaches only permit two images (left and right) as input to perform the stereo matching procedure. Thus, the information of partial occlusion is lost because the user cannot provide enough data to reconstruct the clear layers of the scene (or the whole 3D scene), i.e., two input images only. Another series of depth map computing is to reconstruct a 3D scene from multiple images. In general, 3D scene reconstruction needs a large set of similar images (i.e. the same real scene taken from different viewpoints). Those photos can be shot from relatively large different viewpoints, as the example in [16]. The output is a full view of 3D scene in real world. From the scene we can naturally obtain accurate depth values of each object. Although the intermediate result of our project is also to obtain depth values from multiple images, preparing large dataset to reconstruct a whole 3D scene is too sufficient to be applied in our project. We only require five to ten input photos, and getting a dense depth map with rough separated layers is enough. Once we acquire the depth map from the given input images, the next step of our project is to refocus a referenced image with the help of the produced depth map before. Main task of refocusing in our project is blurring user-assign regions in the referenced photo. We choose the common blurring approach – convolution of sharp image and blur kernel to obtain the final blurred result. Some additional post processing work has to be done as well, such as object boundaries handling, alpha-blending and bokeh. In photography, lens aperture refers to the size of the opening in the lens of the camera through which light can pass. By adjusting the size of the aperture, the photographer can ensure that the 50 correct amount of light reaches the digital sensor during any given exposure. Here I give a simple example to illustrate the relationship between aperture and depth of field. Picture this: we are taking a portrait photo. We focus the lens on the subject‟s face. Behind him is a tree. If we set the lens at a large aperture (small f-number) the tree behind the subject will not be in focus (blurred). In contrast, if we use a small aperture (large f-number) the tree will be in focus (sharp). Image refocusing is taking advantage of this technique. By changing how big a part of the photograph is in focus, users can control exactly which details show up, and which do not. This also allows leading the users‟ eye anywhere they wish. Out main task is to achieve the effect of shallow depth of field from multiple photos with small aperture taken from slightly different viewpoints. The multiple photos actually simulate a bigger camera aperture. The first and most critical step in this project is to compute the depth map of objects in the photo. After that we can refocus the objects according to their respective depths. 4.1 Data Set The assumption of our project is as follows: 1. n given calibrated images including camera intrinsic and extrinsic parameters (camera positions, camera rotation and translation matrix, etc.). 2. The rotation and translation among all the photos should not be large. Our goal is to simulate a larger camera aperture, so slight rotation and translation is enough. 3. All the photos are taken by one given point-and-shoot camera, which means the focal length remains the same. What we need to consider is the camera movement during the shooting period. 4. All the photos should be taken using small aperture (aperture on the point-and-shoot cameras), i.e. they are all sharp enough. 51 For the first point in the assumption, how to compute the intrinsic and extrinsic parameters of camera will be introduced in the subsequent section. The 2-4 points describe the limitation of our dataset. If we apply only five to ten images shot from large different viewpoints, we may lose too much information to produce a dense depth map successfully. Figure 4.1-4.3 shows several sets of our original input photos. Figure 4.1 – Flower garden set. They are extracted from a flower garden video 52 Figure 4.2 – Rocks set. This set comes from the middlebury website [18] Figure 4.3 – Desk set. This set is taken from our own point-and-shoot camera 4.2 Computation of Camera Parameters According to the first point of project assumption in the previous section, the camera intrinsic and extrinsic parameters (camera positions, camera rotation and translation matrix, etc.) should be provided in order to give enough information to the depth map computation procedure. 53 We use a piece of popular software – Voodoo Camera Tracker [31]. It is a tool for the integration of virtual and real scenes. It can estimate camera parameters and reconstruct a 3D scene from image sequences. Here we only need camera parameters instead of reconstructing a 3D scene. We will describe the reason later on. The estimation method of Voodoo Tracker consists of the following five processing steps [31]:  Automatic detection of feature points  Automatic correspondence analysis  Outlier elimination  Robust incremental estimation of the camera parameters  Final refinement of the camera parameters These are the fundamental and usual steps for estimating camera parameters. We use Voodoo as a convenient tool to obtain the parameters quickly, without computing them respectively step by step. The parameters of each input images getting from Voodoo that will be used in the following problem formulation are as follows:  Camera position  Focal length  Intrinsic matrix, including radial distortion, pixel size, focal length and horizontal field of view  Projection of 3D coordinates in the camera image (computed from rotation matrix/extrinsic matrix, camera position and 3D coordinates [mm])  3D feature points We can obtain a set of 3D feature points, which are the real points in the 3D scene. However, the number of feature points is absolutely not sufficient for generating a dense depth map. They can only reconstruct the outline of a 3D scene consisting of sparse point distribution. From Figure 54 4.4, the 3D model offers us a rough image of what the real world to be, but it still lacks information to build a dense map. Figure 4.4 – 3D model for showing feature points. There are two angles to observe the model It is obvious that the feature points are too scattered to provide enough information for the further computation. In other words, if we would like to obtain a dense depth map from such information, we need to connect these sparse points under certain geometrical rule or algorithm, or perform triangulation on them. It will be another type of estimation job where the uncertainty 55 must exist and the workload of computation. We need to use other methods to compute dense depth map combining with the given camera parameters above. 4.3 Estimation of Depth Value in 3D Scene For the entire 3D scene, we need to divide it into several layers according to the objects distribution in real world. For example, in Figure 4.5, left image can be roughly separated into three layers: box, baby and map. This is a simpler case. Furthermore, in the right image of cloth, we can still divide the cloth scene into front, middle and back layers. Even the object like the baby in left picture cannot be seen as a plane in practice, in our project, we simplify it due to the preference that people view an object as an individual unity instead of partitioning a human being or an integrated object into several parts. Figure 4.5 – Two examples for illustrating depth value in real world We have already obtained the depth value (z value) of feature points from Voodoo. Figure 4.6 is two pictures of hundreds of feature points extracted from two image sequences, where y axis is the depth value in 3D scene of each feature point. 56 Figure 4.6 – Z-value distribution of 541 feature points from two different sequences By observing the distribution chart of many input image sequences, we come to a conclusion that most z-value distribution follows the figure as showed in Figure 4.6, except for a few points are fallen in erroneous detection. It can be simplified to linear distribution. Therefore, for depth estimation, we first decide how many discrete layers are appropriate for the real scene. Secondly, we group the z-values into distinct classes according to the number of discrete layers. In each class, the z-value of points is closed to each other. Then we average them for each group. For instance, in Figure 4.6, left image can be divided into 7-10 layers, while in right image, 3 layers are enough. Therefore, the output of this step is a set of integer values representing layers of 3D scene in real world. 4.4 Problem Formulation and Graph Cuts 4.4.1 Overview of Problem Formulation Our current goal is to compute a depth map from multiple images based on the given and computed parameters above. [33] builds a good framework for solving such problems. In our case, we combine the problem formulation in the paper to our own idea in order to reach the final goal – layer depth map. 57 Suppose we have n given calibrated images of the same scene from slightly different viewpoints (data set was showed in section 4.1). From the calibrated images, we can obtain enough information, such as camera intrinsic and extrinsic parameters (section 4.2), which are extremely useful to find the corresponding pixels among all the n images. Let Pi be the set of pixels in the camera i, and be the set of all the pixels. Our goal is to find the depth of every pixel (dense depth map). Thus, we want to find a labeling where L is a discrete set of labels corresponding to different depths in real world (section 4.3). Fig.4.7 shows a simple case of this scene construction. Figure 4.7 – Example of interactions between camera pixels and 3D scene. For a certain pixel p in one image, we first project it into 3D scene using the parameters matrix (from 2D to 3D). We assume that the pixel p corresponds to a ray in 3D space. In Figure 4.7, C1p represents a ray intersecting with the scene. We store the intersected values with depth labels for the subsequent energy minimization algorithm. Here the discrete depth labels have already been provided from the previous section. Further suppose that the intersection point of C1p and depth label l is t1. The coordinate of t1 is (tx, ty, l). The next step is to project t1 back to the other cameras. Since each camera has a corresponding projection matrix, for a 3D point t1, we will have several corresponding points t2’, t3’,…,tn’ for each camera (n is the number of cameras). Then we compare the intensity 58 difference in each pair (t1, t2’), (t1, t3’),…,(t1, tn’). The smallest difference and corresponding camera index should be stored for graph cuts algorithm. We define an energy function to be minimized, which is a standard form of energy functions: ( ) ( ) ( ) ( ) Then we can use graph cuts to find the optimal labeling from this function. The data term imposes photo-consistency. It is ( ) ( ∑ 〈 ( )〉 〈 ) ( ) ( )〉 Here I consist of all the 3D-points having the same depth. i.e. if *〈 ( )〉 〈 ( )〉+ then f(p) = f(q). The smoothness term can be written as ( ) ∑ * The term * + * +( ( ) ( )) ( ) + is required to be a metric. The smoothness term here is the same as in the two- view stereo problem. 4.4.2 Results of Single Depth Map The remaining job of this step is to assign a suitable depth label to each pixel in the referenced image using graph cuts algorithm. We have discussed the fundamental theory and implementation of graph cuts before. Figure 4.8 shows some results from our program. The object boundary and depth information are computed relatively accurate. 59 Figure 4.8 – Results on flower garden and rocks dataset. 4.4.3 Considerations and Difficulties One difficulty to compute accurate depth value for each pixel in referenced image is how to obtain a proper set of test data. All the taken images should satisfy the given assumption as described in section 4.1. Because we are now simulating a larger size of aperture compared to the small aperture in point-and-shoot cameras, the rotation and translation among those photos should not be that large. Finding corresponding pixels among the n calibrated images is also a critical task in this project. At first I tried to detect the correspondent points using SIFT algorithm. This method is able to 60 find the accurate correspondent points, but one limitation is that we can only get sparse points. For extracting a depth map in this project, dense-point correspondence is extremely important. This requires computing corresponding pixel in matching images of every pixel in the reference image. Therefore, the camera information such as camera position or translation matrix has to be very accurate for calculating the intersection of image pixel and corresponding 3D point. Finally, we need to construct an approximation algorithm based on graph cuts that finds a strong local minimum. The traditional energy function has two terms – data term and smoothness term. It is proved that equation (21) can be minimized by graph cuts since it satisfies the conditions defined in [39]. 4.5 Layer Depth Map The depth maps in Figure 4.8 only represent the depth values from single view of one certain referenced image. Single-view depth map is hard to deal with the partial occlusion problem. One of the reasons we use multiple images is to handle occluded parts which cannot be seen in the referenced image but appears in other cameras. For example, in the flower garden sequences, behind the tree there is actually something that can be seen from another viewpoint (Figure 4.9). 61 Figure 4.9 – Occlusion problem. Left is the referenced image. We cannot see the occluded parts behind the tree. However, in the right image, where viewing from another viewpoint, we can see the occlusion behind the tree (inside red line) To solve the occlusion problem, we still divide the whole scene into several discrete layers. For the flower garden example in Figure 4.8, the scene is separated into 8 layers as in Figure 4.10. Figure 4.10 – Eight layers of the scene extracted from flower garden referenced image Actually single-view depth map is only a simple division along the object boundary, we need more information about each layers due to the existence of multiple camera. The principal is that when a certain 3D point locates in real scene, we can know whether it appears in each input image. The detail of method is as follows (layer by layer): 62  Extend the non-zero regions in each layer following the original shape (Figure 4.11).  In referenced image, p is the pixel in the extending area. It is projected into 3D scene, and it will have a 3D coordinate in the scene p_3d.  p_3d is then projected back to each matching image by applying corresponding projection matrix. A new sequence of corresponding points (q2, q3, … , qn) is produced.  For each point in the new sequence, we again compare the intensity difference between corresponding point and the nearest point in referenced image (within original non-zero region). If the difference is smaller than a threshold, we select the corresponding point as a candidate, which means this point may be occluded.  For each candidate point, detect whether it locates behind the previous layers. If yes, it is finally in the occluded part. If no, it should be discarded. Figure 4.11 – Extension of each layer to be prepared for further estimation After the procedure above, we can eventually compute the occluded parts of each layer (Figure 4.12). Thus, when we combine all the layers together, they may overlap each other, while there is no overlapping in the single-view depth map. Note that even there are still some wrong-detected points; it can be ignored in the blurring step. I will describe it later on. 63 Figure 4.12 – Final layer depth map. Note that there is no occluded part in the first layer 4.6 Layer Blurring We have already obtained the layer depth map. The current work is to blur each layer according to the user selection and the depth value. Before performing the blurry job, one important thing has to be noticed is that which type of blur kernel is used and how it works during the blurry process. When a kernel is applied, neighbors of center pixel also have contribution to the center pixel. In each layer map, e.g. in Figure 4.12, the pixel in the black regions has no intensity. In other words, if we want to blur the boundary of the layer, we may wrongly combine the black 64 points leading to incorrect results. It is because that there is sharp degradation along the boundary (Figure 4.13). Figure 4.13 – Incorrect results. The boundary of layers is obvious To solve the black-point problem, we choose a simple method to estimate the value of neighbor black regions. Figure 4.14 is an example of intensity of boundary pixels. We try to fill the zeroregion with some certain values near to boundary pixels. Here we have several choices for only simple estimation.  Symmetric: Neighbor pixel values outside the bounds of the array are computed by mirror-reflecting the array across the array border.  Replicate: Neighbor pixel values outside the bounds of the array are assumed to equal the nearest array border value.  Circular: Neighbor pixel values outside the bounds of the array are computed by implicitly assuming the input array is periodic. Figure 4.14 – Example of intensity of boundary pixels 65 We decide to replicate the nearest border value to the zero-region. The error computation of boundary blurred pixels is actually eliminated. Following the simple replication method, we perform blurring on each layer. User can choose different kernels, such as box filter, disk filter or gaussian filter to simulate different shapes of lens aperture. 4.7 Combining Blurry Layers Given the blurry layers from previous sections, the final step is combining all the blurred layers to achieve our ultimate goal – image refocusing. Since there is possible overlapping between two layers, an approach is needed to deal with this problem. Alpha blending is one of the classical methods to handle overlap of foreground and background. Actually in our project, there is no need to use alpha blending to combine every backgroundforeground pair. Two cases may apply this blending algorithm: One is that if we would like to sharpen the objects in background while blur foreground; the other is that if there is a very bright light source in the background. In the first case, when the object far away from the camera is sharpened, we can see the blurred boundaries around the foreground objects in the overlapped regions. It is so-called the transparent effect around the boundary (Figure 4.15). In other words, the border of the foreground object is not that sharp, it turns out to be a little bit blurry. Therefore, we can use alpha blending algorithm to implement the partial transparent effect. 66 Figure 4.15 – Examples of transparent effect around the boundary of the bottle. The right image comes from www.refocusimaging.com [21]. Alpha blending is a convex combination of two colors allowing for transparency effects in computer graphics. The value of alpha in the color code ranges from 0.0 to 1.0, where 0.0 represents a fully transparent color, and 1.0 represents a fully opaque color. The value of resulting color when color fg with an alpha value of α is laid over an opaque background of color bg is given by: Result = (1 - α) * bg + α * fg The alpha component may be used to blend to read, green and blue components equally, as in 32-bit RGBA, which is significant to transparent effect in color images. Figure 4.16 shows a result of our test image. Left one is the case that foreground object is sharpen, while right one is to sharp the background object. We handle the overlap and the object border by applying alpha blending. From the resultant image, we can clearly find that in the right image, the boundary of the cloth is not that sharp, and it looks more realistic. In the left picture, the boundaries of the small objects in foreground are still sharp without any blending. 67 Figure 4.16 – One result of layer combination. Left - the foreground is focused; Right – the background is focused. 4.8 Results and Comparison In this section, I will show some sets of our experimental results. Each set includes original image, single depth map, 2-4 results from different layers focusing. We will compare our results to those of Adobe Photoshop using the same depth map that produced by our project. Besides, a series of resultant images with different shapes of lens aperture (different blur kernels) will be performed as well. 4.8.1 Experimental Results  1st set (Figure 4.17): 68 Original Front Depth map Middle Back Figure 4.17 – Result of flower garden sequence  2nd set (Figure 4.18): Original Depth map 69 Front Middle Back Figure 4.18 – Result of rocks sequence  3rd set (Figure 4.19): Original Front Depth map Middle Back Figure 4.19 – Result of gifts sequence 70  4th set (Figure 4.20): Front Middle Back Figure 4.20 – Result of cap sequence  5th set (Figure 4.21): 71 Original Depth map Front Middle Back Figure 4.20 – Result of book sequence 4.8.2 Comparison to Adobe Photoshop In Photoshop, a filter called lens blur can achieve the refocusing goal as in our project. The input of Photoshop is one original image and one given depth map. For fair comparison, we use the same single depth map that we computed ourselves as the common input of both Photoshop and our project. All the depth maps have already been in section 4.8.1, thus now we only present the results between the two methods. 72 Ours Photoshop Ours Photoshop Ours Photoshop 73 Ours Photoshop Figure 4.19 – Comparison between our project and Photoshop The results of both methods are actually similar to each other under the same condition, i.e. common inputs. However, the boundary along sharp and blurred objects in Photoshop is much clearer than that of our project. In the respect of art and realistic, our output is better than Photoshop‟s. Our method combines multi-layer representation and alpha blending algorithm so that the boundary of our results turns out to be smoother and more natural than Photoshop‟s. The best advantage of our method is the input image. User only needs to provide several hand-take photos without any additional given information such as depth map. 4.8.3 Different Lens Apertures User can also change the shape of lens aperture in order to obtain variant effects in blurring regions. In this section, to better show the shape effect, we divide the original image into only two parts – foreground and background. Moreover, we assume that foreground is very far away 74 from background. Therefore, under the assumption above, the background should be largely blurred compared to the sharp foreground. Box Gaussian Disk Motion Figure 4.20 – Four different shapes of blur kernel, which simulates the shapes of lens aperture 4.8.4 Bokeh We also implement the effects of different bokeh. As described in the previous chapter (the “bokeh” section), bokeh can be roughly categorized into two classes, where these two classes 75 have no clear dividing line. One is the region around small background highlights. The other is prominent out-of-focus region. The results of the second classes have been showed in section 4.8.3 – “different lens aperture”. In those flower-garden resultant images, the house is far away from the tree in foreground. Therefore we blur the background with different aperture shapes, and we can see the shapes in the blurry background clearly. In this section, we will show the results of the first class of bokeh, i.e. bokeh around small background highlights. For convenience of demonstration, we also separate the input photo as two parts – foreground and background. There is a bright point in the background, and we blur it in order to show the bokeh effect, while the foreground object is sharpened. Figure 4.21 is our bokeh effect result. The outline of the shape is showed through the pink bright light in the background. Actually the whole scene should not be divided into the two layers like this. We do this only to see the bokeh shape more clearly. (a) original (c) heart (b) circle (d) triangle 76 (e) diamond Figure 4.21 – bokeh effect of our project. The shapes are circle, heart, triangle and diamond respectively. 4.8.5 Different number of input images We also study the effects of varying the number of input images. The ideal number of input photos of our project is 6-10. What if users shoot only 3-4 photos? Or a larger dataset, let‟s say, 15 photos are provided? We will analyze these two situations respectively. (a). feature point extraction One of the functions of feature point extraction is to calculate the common related points among the input images. In theory, the more input images, the more extracted feature points, and certainly the more accurate 3D scene reconstruction. However, in our practice and implementation, the number of feature points extracted from image sequence has an upper bound. For example, if 10 images can extract 1000 feature points and this number has already reached the upper limit, even we have 10 more input images, the number of feature points will be around 1000. It cannot have a large increase. It is because the restriction of our input images. As mentioned in section 4.1, we require the input photos have only a slightly viewpoint change, 77 which means the whole scene does not change so much. Under this condition, the common feature points among all the images may be relatively stable. Figure 4.22 is a chart of flower garden sequence, which shows the number of extracted feature points changing with the number of input images. feature points 600 500 400 300 feature points 200 100 0 input = 3 5 7 10 12 15 17 20 25 30 Figure 4.22 – relationship between input images and feature points. Input image size is 320x240. Therefore, we can conclude that if we have input image less than 5, the information from images may not be sufficient, and will lead to a small amount of feature points. Thus the extracted layers may not be accurate. On the other hand, if we have more than 10, or even more, 15 input photos, the number of feature points stays almost the same. This will lead to extra workload and longer computing time so that increasing input images is useless in our project. (b). graph cuts To consider this factor, we assume the premise that different numbers of input images have the same extracted feature points. We apply graph cuts algorithm to figure out a 3D point in the real scene appears in which given image. This application is used to deal with the partial occlusion 78 problem. If we have only a few images, e.g. 3 images, the occluded object may not be fully seen by other cameras. Figure 4.23 is one example to illustrate the case. Since our project restricts the input photos can only have slight viewpoint change, i.e., small translation or rotation. In this example, the pink bottle is occluded by the white bottle. The three images cannot see the full view of the pink one. On the other hand, if we have more than 10 input photos, the information provided by these cameras may be redundant, i.e., occluded object may be seen by two or more cameras at the same time. This will increase the workload of graph cuts algorithm. Figure 4.24 shows the relationship between input image number and the running time of graph cuts algorithm. Even one more image can result in large increase on elapsed time. It is not efficient and we must avoid such problem occur. Figure 4.23 – partial occlusion problem under 3 input photos situation. 79 running time(s) 6000 5000 4000 3000 running time(s) 2000 1000 0 input = 3 5 7 10 12 15 Figure 4.24 – relationship between input images and running time of graph cuts. Image size is 640x480. 4.8.6 Comparison to Other Software on Camera We compare our results to those of other depth-of-field software on current camera. This software also allows users select a region they prefer and blur other non-select regions. However, the software only permits an adaptive circle as a region-select tool. It means that users can only select a circle-like area in the photo to be sharpened. The results of our method are more realistic. We blur/sharpen an integrated layer instead of blurring/sharpening a circular region. Theoretically, the idea of this software – “inside the circle is sharpened, while outside is blurred” is wrong. Besides, this software does not allow users to choose aperture shape either. 80 Other software on camera our method Other software on camera our method 81 Figure 4.25 – comparison between other software and our method on refocusing. 4.8.7 Limitation Our method has its own drawbacks and will lead to some failure cases with poor results. 1. Small viewpoint change when shooting input photos: This is a critical restriction. If the input images have very different viewpoints, i.e., user has large translation, rotation, or even change the shooting plane, the produced depth map will be totally wrong. It is because our algorithm of finding corresponding points among all the input images has a certain searching area. Once these images change a lot, each pixel in the referenced image may have many similar corresponding points in other images, which will confuse the matching algorithm, and finally it will fail to produce correct depth map for refocusing phase. 2. Very bright point as an partial occluded object If there is a very bright point behind a box and half of it is outside the box, our method will fail to blur it. The intensity of a very bright point exceeds the common color range of 0-255. In this case, our method is unable to estimate the neighbor pixel value of it, and we suggest that the HDR (High Dynamic Range) technique can handle this type of problem. But it is outside this paper and we will skip it. 82 Chapter 5 Conclusion Image refocusing is a potential and interesting research area in computer graphics and computer vision, especially in computational photography. There are many methods to solve this type of problem, both in hardware and software. Our project – Multi-view Image Refocusing is a new approach that is convenient to ordinary point-and-shoot users. It does not require any professional knowledge in the field of photography, just shoot several photos from closed viewpoints and select a region that they prefer. Then the output is one referenced image with the given region sharp and others blurred. Our project can be divided into two sub-procedures. The first part is to compute layer depth map. Labeling assignment is the basic theory in this step. To assign labels to each pixel, we use graph cut algorithm which is fast and efficient. The second part is refocus the referenced image based on the given produced depth map in the first step. Several methods are used for better blurring result. Comparison to lens blur filter in Adobe Photoshop is also perform to weigh the advantages and disadvantages of our project. Future work can be on the improvement of input data set. Current limitation is that better result comes from image sequence with only a little translation and rotation. Larger viewpoints difference is possibly more convenient for users. Besides, our project is not robust to illumination variance. Image sequence taken under different light conditions may occur unpredictable problems. 83 Bibliography Conference / Journals / Technical Papers / Books [1] Anat Levin, Rob Fergus, Fredo Durand, and William T.Freeman, “Image and Depth from a Conventional Camera with a Coded Aperture”, In ACM SIGGRAPH 2007 papers (SIGGRAPH ‟07), ACM, New York, NY, USA, Article 70, DOI = 10.1145/1275808.1276464, http://doi.acm.org/10.1145/1275808.1276464, 2007. [2] Chia-Kai Liang, Tai-Hsu Lin, Bing-Yi Wong, Chi Liu and Homer H. Chen, “Programmable aperture photography: multiplexed light field acquisition”, SIGGRAPH '08 ACM SIGGRAPH 2008 papers, ACM New York, NY, USA, 2008, doi>10.1145/1399504.1360654. [3] D. Greig, B. Porteous, and A. Seheult, “Exact Maximum a posteriori Estimation for Binary Images,” Journal of the Association for Computing Machinery, 35(4):921-940, October 1988. [4] D.Scharstein, R.Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” International Journal of Computer Vision, 47(1): 7-42, May, 2002. [5] D. Scharstein and R. Szeliski, “http://vision.middlebury.edu/stereo/”, The Middlebury vision website. [6] E. Dahlhaus, D. Johnson, et al, “The Complexity of Multiway Cuts,” ACM symp. Theory of Computing, pp. 241-251, 1992. [7] E. P. Simoncelli, E. H.Adelson, and D. J. Heeger, “Probability distributions of optic flow”, In CVPR, pages 310–315, 1991. [8] F.Moreno-Noguer, P.N.Belhumeur and S.K.Nayar, “Active Refocusing of Images and Videos”, ACM Trans. on Graphics (also Proc. of ACM SIGGRAPH), Aug, 2007. [9] Harold Davis, Practical Artistry, “Light & Exposure for Digital Photographers (2008)”, O'Reilly Media, p. 62, ISBN 9780596529888. [10] I. J. Cox, S. L. Hingorani, S. B. Rao, and B. M. Maggs, “A maximum likelihood stereo algorithm”, CVIU, 63(3):542–567, 1996. 84 [11] J. Besag, “On the Statistical Analysis of Dirty Pictures,” Journal of the Royal Statistical Society, Series B 48 (1986) 259-302. [12] L. Matthies, R. Szeliski, and T. Kanade, “Kalman filter-based algorithms for estimating depth from image sequences”, IJCV, 3:209–236, 1989. [13] Lee, S., Eisemann, E., Seidel, H, “Depth-of-Field Rendering with Multiview Synthesis”, ACM Trans. Graph, 28, 5, Article 134 (December 2009), 6 pages. DOI = 10.1145/1618452.1618480, http://doi.acm.org/10.1145/1618452.1618480. [14] M. J. Hannah, “Computer Matching of Areas in Stereo Images”, PhD thesis, Stanford University, 1974. [15] Matthew Kozak, “Camera Aperture Design”, http://www.screamyguy.net/iris/index.htm [16] Noah Snavely, Steven M. Seitz, Richard Szeliski, “Modeling the World from Internet Photo Collections”, International Journal of Computer Vision, DOI 10.1007/s11263-007-0107-3. [17] O. Veksler, “Efficient Graph-based Energy Minimization Methods in Computer Vision”, PhD thesis, Cornell University, 1999. [18] P. Anandan, “A computational framework and an algorithm for the measurement of visual motion”, IJCV, 2(3):283–310,1989. [19] P. Felzenszwalb and P. Huttenlocher, “Efficient belief propagation for early vision,” In CVPR, 2004, 261-268. [20] P. N. Belhumeur, “A Bayesian approach to binocular stereopsis”, IJCV, 19(3):237–260, 1996. [21] Refocus Imaging, Inc. http://www.refocusimaging.com. [22] Rajagopalan, A. N., and Chaudhuri, S., “An mrf model-based approach to simultaneous recovery of depth and restoration from defocused images”, IEEE Trans. Pattern Anal. Mach, 1999, Intell. 21, 7, 577589. [23] Shade, Jonathan, Steven J. Gortler, Li-wei He, and Richard Szeliski, “Layered depth images”, In Proceedings of the 25th annual conference on computer graphics and interactive techniques (SIGGRAPH 1998), July 19-24, 1998, Orlando, Flor., ed. SIGGRAPH and Michael Cohen, 231-242. New York, N.Y.: ACM Press. [24] Soonmin Bae and Fredo Durand, “Defocus Magnification”, Computer Graphics Forum, Volume 26, Issue 3 (Proc. Of Eurographics 2007). 85 [25] Simon Baker, R.Szeliski and P.Anandan, “A Layered Approach to Stereo Reconstruction”, To appear in the 1998 conference on CVPR, Santa Barbara, CA, June 1998. [26] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721-741, 1984. [27] Subbarao, M., Wei, T., and Surya, G. 1995, “Focused image recovery from two defocused images recorded with different camera settings”, IEEE Trans. Image Processing 4, 12, 16131628. [28] S. Roy and I. J. Cox, “A maximum-flow formulation of the N-camera stereo correspondence problem”, In ICCV, pages 492–499, 1998. [29] S. T. Barnard, “Stochastic stereo matching over scale”, IJCV, 3(1):17–32, 1989. [30] T. Kanade, “Development of a video-rate stereo machine”, In Image Understanding Workshop, pages 549–557, Monterey, CA, 1994. Morgan Kaufmann Publishers. [31] Voodoo Camera Tracker, http://www.digilab.uni-hannover.de/docs/manual.html, Copyright (C) 2002-2010 Laboratorium fur Infomationstechnologie. [32] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts”, In ICCV, volume II, pages 508–515, 2001. [33] V. Kolmogorov and R. Zabih, “Multi-Camera Scene Reconstruction via Graph Cuts,” Proc. Seventh European Conf. Computer Vision, vol. III, pp. 82-96, May 2002. [34] V. Kolmogorov and R. Zabih, “What Energy Functions Can Be Minimized via Graph Cuts?” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147-159, Feb. 2004. [35] V. Kolmogorov and R. Zabih, “Graph Cut Algorithms for Binocular Stereo with Occlusions,” In N. Paragios, Y. Chen, and O. Faugeras, editors, The Handbook of Mathematical Models in Computer Vision. Springer, 2005. [36] Y. Boykov and V. Kolmogorov, “An Experimental Comparison of Min-Cut / Max-Flow Algorithms for Energy Minimization in Vision,” Proc. Int’l Workshop Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 359-374, Sept. 2001. [37] Y. Boykov and M. Jolly, “Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D images,” proc. Eighth IEEE int’l Conf. Computer Vision, vol.1, pp. 105-112, 2001. 86 [38] Y. Boykov, O. Veksler, and R. Zabih, “Markov Random Fields with Efficient Approximations,” Proc. IEEE conf. Computer Vision and Pattern Recognition, pp. 648-655, 1998. [39] Y. Boykov, O. Veksler and R. Zabih, “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No.11, 2001. [40] Y.Kuk-Jin, K.In-So, “Adaptive Support-Weight Approach for Correspondence Search”, IEEE PAMI, Vol.28, No.4, April, 2006. 87 [...]... pixel in the two stereo images With the energy function, several minimization algorithms such as belief propagation [19], graph cuts [28, 34, 17, 2], dynamic programming [20, 31], simulated annealing [26, 11] could be used to compute the final depth map 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures)... reconstruction result Figure 2.4 – example from [16] 2.2 Image Refocusing In our project, this part is based on the previous depth estimation phase All our research and implementation of refocusing are according to the single dense depth map that we have produced Therefore, for the introduction of related work in image refocusing, we will only describe the refocusing part of the paper, i.e we presume that... provided and focus on how to blur or sharpen the original image 17 One common approach is to compute the blur scale based on a set of images These given images can be different focused ones, or one all-focused image plus known focused settings or parameters In [22, 27], the degree of blur of different parts in the image is computed from the given defocused image sequence For this kind of approach, to compute... blur kernel can be estimated, and thus convolution can be applied to the blurry part of the image in order to recover an all-focused final image (in refocusing step) The output of this method is a coarse depth map, which is sufficient for the next refocusing phase in most applications 13 2.1.2 Stereo Matching – Two Images Input Stereo matching is one of the most active research areas in computer vision... and it can be applied to many applications as an significant intermediate step, such as view synthesis, image based rendering, 3D scene reconstruction Given two images/photos that are taken from slightly different viewpoints, the goal of stereo matching is to assign a depth value to each pixel in the reference image, where the final result is represented as a disparity map Disparity indicates the difference... along corresponding scan lines Therefore, we can easily view the disparity value as the offset between x-coordinates in the left and right images (Figure 2.2) The objects nearer to viewpoint have a larger translation, while farer ones have only a slight move Figure 2.2 - The stereo vision is captured in a left and a right image 14 An excellent review of stereo work can be found in [4] It presents the... optimal cuts Figure 3.10 shows some experimental results of the two -view stereo matching algorithms with graph cuts We can see that even for heavily texture images, graph cuts can still work to detect clear object boundaries and assign correct labels to the pixels left image result ground truth 33 left image result Figure 3.10 – Original images and their results Top row is the “lamp” data sequence and... exact depth value of each pixel in real world Given each pixel‟s x, 16 y, z value of an image (z value is the one we compute from the large dataset), we can easily reconstruct the 3D scene of this image and of course, the computed depth map is much more accurate than those produced from stereo matching or multi- view scene reconstruction See Figure 2.4 (the bear one if I still hold) The result of 3D... are single blurred photographs taken with the modified camera (obtain both depth information and an all-focus image) and output is coarse depth information together with a normal high resolution RGB image Therefore, to reconstruct the original sharp image, correct blur scale of an observed image has to be identified with the help of modified shape of camera aperture The process is based on probabilistic... 3 image with two to-be-assign labels For the edges connected between different nodes, a t-link is an edge that connects terminal nodes (source and sink) to image pixel nodes, while an n-link is an edge that connects two image nodes within a neighborhood system Figure 3.4 – Example of a constructed graph A similar graph cuts construction was first introduced in vision by Greig et al [3] for binary image ... Single Image Input 13 2.1.2 Stereo Matching – Two Images Input 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input 17 2.1.4 3D World Reconstruction from a large dataset 17 Image Refocusing. .. compute the final depth map 15 2.1.3 Multi- View scene reconstruction - Multiple Images Input A 3D scene can be roughly reconstructed from a small set of images (about 10 pictures) The result... the image 49 Chapter Refocusing from Multiple Images In Computer Vision, in order to produce a depth map, there are a large number of methods However, some of those approaches only permit two images

Định dạng
Số trang	87
Dung lượng	4,56 MB