Computer Vision and Image Understanding 115 (2011) 1516–1524 Contents lists available at SciVerse ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu A semi-interactive panorama based 3D reconstruction framework for indoor scenes Trung Kien Dang a,⇑, Marcel Worring a, The Duy Bui b a b Intelligent Systems Lab Amsterdam, Informatics Institute, University of Amsterdam, The Netherlands Human Machine Interaction Laboratory, University of Engineering and Technology, Vietnam National University, Hanoi, Viet Nam a r t i c l e i n f o Article history: Received 20 May 2010 Accepted 13 July 2011 Available online 27 July 2011 Keywords: 3D reconstruction Panorama Interactive a b s t r a c t We present a semi-interactive method for 3D reconstruction specialized for indoor scenes which combines computer vision techniques with efficient interaction We use panoramas, popularly used for visualization of indoor scenes, but clearly not able to show depth, for their great field of view, as the starting point Exploiting user defined knowledge, in term of a rough sketch of orthogonality and parallelism in scenes, we design smart interaction techniques to semi-automatically reconstruct a scene from coarse to fine level The framework is flexible and efficient Users can build a coarse walls-and-floor textured model in five mouse clicks, or a detailed model showing all furniture in a couple of minutes interaction We show results of reconstruction on four different scenes The accuracy of the reconstructed models is quite high, around 1% error at full room scale Thus, our framework is a good choice for applications requiring accuracy as well as application requiring a 3D impression of the scene Ó 2011 Elsevier Inc All rights reserved Introduction A realistic 3D model of a scene and the objects it contains is an ideal for applications such as giving an impression of a room in a house for sale, reconstruction of bullet trajectories in crime scene investigation, or building realistic settings for virtual training [1] It gives good spatial perception and enables functionalities such as measurement, manipulation, and annotation One broad categorization of scenes is outdoor versus indoor Outdoor scenes have been popular in many modeling applications [2,3], especially creating models of urban scenes [4,5] Indoor scenes are prevalent in applications like real estate management, home decoration, or crime scene investigation (CSI), but research on them is limited with some notable exceptions [6–8] In this paper we consider the 3D reconstruction of indoor scenes While in applications like real estate management, a coarse model of a room is sufficient, other applications need more complete models For instance, in CSI the model should be complete and show all the details in the crime scene as any object is potentially evidence Each application also requires a different level of accuracy Home decoration, for example, does not need extreme accuracy for its purpose is merely to give an impression of the scene For the CSI application, the model should be as accurate as possible in order to make measurements and hypothesis validation reliable Here we are seeking for a framework that can create complete and accurate models in highly demanding applications such as CSI, as well as coarse models for less demanding applications ⇑ Corresponding author Address: 42A – 144/4, Quan Nhan, Thanh Xuan, Hanoi, Viet Nam Fax: +84 437547460 E-mail address: dtkien123@gmail.com (T.K Dang) 1077-3142/$ - see front matter Ó 2011 Elsevier Inc All rights reserved doi:10.1016/j.cviu.2011.07.001 3D models are often built manually from measurements and images using the background map technique Modelers take images of the object from orthogonal views (top, side and front), and try to create a model matching those images A measurement is required to scale the model of the object to the right size Modeling from measurements and images is only suitable for simple scenes, as complex scenes with many objects require a lot of measurements, images, and interaction Even with measurements, accurately modeling objects is difficult since the assumption that the line of view is orthogonal to the object is hard to meet in practice Since manual reconstruction is cumbersome and time consuming [9], automatic or semi-interactive reconstruction is preferred Automatic methods exist and have shown good results for isolated objects and outdoor scenes [10,3,11–13] Those methods require a camera moving around and looking towards the scene to capture it from multiple viewpoints [14–17] Such moves maintain a large difference between viewpoints, giving accurately estimated 3D coordinates [18] Unfortunately in practice people tend not to follow such moves, making these methods inaccurate and unreliable Indeed in the well-known PhotoSynth system its has been observed that quality suffers when users not follow the appropriate moves [12] In simple cases, when modeling single-object scenes, automatic methods give results of 2–5% relative error [19] This is sufficient for visualization, but rather low for measurements such as in CSI applications In indoor scenes where the space is limited, the situation is even worse as it is difficult, if not impossible, to perform the capturing moves suitable for automatic reconstruction So, automatic reconstruction methods in their current state are not sufficient for accurate indoor scene reconstruction T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 Semi-interactive methods are potential solutions [20–22] A small amount of interaction helping computers in identifying important features makes reconstruction more reliable A few mouse clicks are enough to build a coarse model [8] Recent work, such as the VideoTrace system [13], shows that interaction can be made smart and efficient by exploiting automatically estimated geometric information While interaction helps to efficiently improve the reliability, there is still the problem of having limited space to move around in indoor scenes Using panoramas is a potential solution Panoramas give a broad field of view So a few panoramas are enough to completely capture a scene, and moving around the scene is no longer a problem Furthermore, building panoramas is reliable, thus using panoramas contributes to the reliability of the overall solution The advantages of interaction on the one hand and panoramas on the other, suggest that a combination of them would be a good solution for indoor scene reconstruction Following the above observations, we propose a multi-stage, semi-interactive, panorama based framework for indoor scenes In the first stage, a coarse model is build This stage extends upon the technique in [8] We make the interaction more efficient by providing a smart interaction technique, and rectify panoramas to guarantee the accuracy meets our aimed quality Furthermore, we give a reconstructability analysis and, based on that, present a capture assistant to guide the placement of the camera Results of the first stage, a coarse model and geometric constraints, facilitate efficient interaction to build a detailed model in the second stage This framework overcomes the problems mentioned and makes it easier to create accurate and complete models In the next section we summarize related work Section gives an overview of our framework Section describes how to turn panoramas into a floor-plan and how to build a coarse 3D model Section describes the interaction to add details to the coarse model Then we evaluate the accuracy and show how efficient the framework is We close the paper with a discussion on how to further automate the framework Related work 2.1 Reconstruction from panoramas A panorama is a wide-angle image, typically generated by stitching images from the same viewpoint [23] Since panoramas cover a wide view, they must be mapped on a cylinder or sphere to view Accordingly, they are called cylindric or spherical panoramas Being wide-angle, panoramas give a good overview of a scene, especially in indoor scenes where the field of view is limited On the other hand, they not give a good spatial perception since the viewpoint is fixed at one point There is work on creating panoramas using multiple viewpoints, called multi-perspective panoramas [7,24,25] However, multi-perspective panoramas only yield a 3D impression from the original viewpoints Other methods are needed to make real 3D models 3D reconstruction from panoramas is found in [6,7,22] In [6], a scene is modeled from geometric primitives, which are manually selected in panoramas of the scene Reconstruction is done separately for each panorama, and then results of different panoramas are merged together In [7] a dense 3D point cloud is estimated from multi-perspective panoramas It, however, requires a special rig for capturing the panoramas In [22] a method to reconstruction from a cylindric panorama is proposed It assumes that the scene, e.g a room, is composed of a set of connected rectangles This method requires that all corners of the room are visible, which is not often the case in practice In [8], 1517 a method to reconstruct an indoor scene from normal single-perspective panoramas is described The result is a coarse 3D model including walls onto which panoramas are projected Such a model is not sufficient for some applications such as CSI, but this simple and flexible method gives good intermediate results towards building a detailed model 2.2 Interaction in reconstruction There are many types of interaction in reconstruction In the simplest case users define geometric primitives, such as points, lines, or pyramids and match these to the image data [20] In [21], quadric surfaces are used to support more complex objects VideoTrace [13] lets users draw and correct vertices of a model in an image sequence The efficiency of interaction can be improved by exploiting what is already known about the scene The guiding principle is to get as much geometric constraints as possible, and use them to assist interaction These constraints can come from domain knowledge, the user interacting with the model, or through automatic estimation by the system, each of them we will now briefly describe Domain knowledge in the form of prior knowledge about the type of scenes to be reconstructed is helpful in designing efficient interaction For example, when modeling man-made scenes we can assume that parallel lines are many Thus, vanishing points are helpful in constraining the interaction [6,26,27] In urban scenes there are often repeated component such as windows Hence instead of modeling them separately, the user can copy them [21] In a man-made scene, objects are stacked on each other, e.g a table is on the floor and books are on the table We can exploit these to reduce the interaction and improve accuracy [9] Scene specific geometric constraints can be provided by users In [9], users define how an object should be bound to another one, to reduce the degrees of freedom in the interaction to reconstruct that object In [8], after roughly defining a room by a sketch, users can build a coarse model with a few mouse clicks Some geometric constraints can be reliably estimated by computers In some cases, coarse 3D structure and camera motion information can be estimated State-of-the-art interactive reconstruction systems including [13,12] take advantage of such information sources to create intuitive and efficient interaction For example, in VideoTrace [13] system, vertices drawn in one frame by the user are tracked and rendered in other frame by the system Users browse forward or backward in the video sequence to correct those vertices until satified For the user it is like refining a model rather than creating it from scratch In practice, those three sources of constraints are often mixed in the modeling flow, which is also what we will in this paper Framework overview Our framework is an A-to-Z solution, from capturing an indoor scene to modeling it, which is summarized in Fig The framework takes as input a sketch of the floor-plan, a topdown design drawing of a room (e.g 1a) that describes its walls and their relative positions drawn by the user The capture planning module analyzes the sketch to tell the user how many panoramas are needed to completely capture the scene, and suggests camera placement i.e the appropriate viewpoints Either calibrated or uncalibrated cameras can be used, but to guarantee good accuracy, we advise to pre-calibrate the camera and correct the lens distortion before stitching them into panoramas Users can use a software package of their own choice to estimate the camera 1518 T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 a A rectangular room b (Unwrapped)panorama of the room c The walls-and-floor model d Adding more detail to the model Fig Illustration of input and (intermediate) results of the reconstruction process A simple rectangular room is used as example motion and stitch corrected images together into panoramas, for example using Hugin (Fig 1b) To build a coarse model of a room, the users picks the corners, intersections of walls, in the panoramas The framework provides a smart corner picking method to make the interaction comfortable The location of the corners on the panoramas and the sketch are enough to estimate the correct floor-plan and build a coarse model of the scene [8] (Fig 1c) More expressively, we call this coarse model, which includes textured walls and floor, a walls-and-floor model A typical rectangular room needs only one panorama to build such a model, where irregular rooms may need more than one panorama depending on the shape of the room and the viewpoints of the panoramas This stage is discussed in detail in Section In order to add more detail efficiently, we exploit the geometric constraint resulting from the observation that indoor scenes contain many flat objects aligned to walls We iteratively use known surfaces to guide an interaction type that we call perspective extrusion to add objects This technique helps to quickly build a detailed model (Fig 1d) Details of this stage are given in Section Building a walls-and-floor model In this section we discuss methods for building a walls-andfloor model For easier comprehension, we present the floor-plan estimation and other elements prior to the capture planning For the moment, we assume that the set of panoramas given is sufficient for floor-plan estimation We let the user draw a sketch of the floor-plan indicating orthogonality and parallelism of walls, and use a method built upon the method in [8] to estimate an accurate floor-plan This method is based on the observation that the horizontal dimension of the panoramic image is proportional to the horizontal view angle of the panorama Thus a set of corners divides the panorama into horizontal view angles of known ratio If we assure that any panorama looks all around a room, the total horizontal view angle is obviously 360 degrees without any measurement Hence we know each horizontal view angle This observation is valid when the corners are perfectly aligned to the vertical dimension Thus, http://hugin.sourceforge.net to make a more accurate floor-plan estimation than in [8], we rectify the panoramas to meet that condition first Building 360-degree panoramas is well studied [23], thus we not discuss it here For the next step, indicating corners in panoramas, we provide smart corner picking Rectifying panoramas, and estimating the floor-plan are subsequently discussed below Then we present the reconstructability analysis and the capture assistant 4.1 Smart corner picking In order to estimate the floor-plan, coordinates of the top-down projections of corners are needed As panoramas may not be well aligned, getting one point on a corner is not enough Instead we need to identify a corner by a line segment One way to that is to ask a user to manually draw a line onto a panorama To make it even simpler, we provide a utility to let users just casually pick a point in a panorama and the system will automatically identify the corner line Since the straightness of lines is not preserved in the coordinate system of a panorama, here a cylindric one, we must project a user picked point into one of the images, from which the panorama is created, to work in the image coordinate system We assume that the best image is the one whose image plane is most orthogonal to the projection ray of the picked point Or in other words, the angle between the ray from the viewpoint to the image center and the projection ray rc of the picked point is smallest if ẳ arg \riị; r c ị 1ị i where r(i) is the principal ray of image i Since panoramas are usually approximately aligned, we limit the detection to a vertical image band around the picked point We detect vertical edges around that point, and fit a line through the picked point and edge points using RANSAC [28] The picked point is used here as an anchor to avoid the auto-detected line moving to a wrong location Since the picked point is not exactly at the right position, we afterwards relax the condition, optimizing the line without constraining it to go though the picked point to yield the final line The process is summarized in Table and two examples are given in Fig T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 1519 4.2 Rectifying panoramas To accurately estimate the floor-plan, we first rectify the panoramas so that corners are aligned to the vertical dimension for a cylindrical panorama Each corner together with the viewpoint defines a plane And these planes remain unchanged no matter how we move the coordinate system since they are defined by the scene and viewpoint To align the panorama cylinder we need to find the rotation R that makes those planes parallel to the vertical direction In other words, after transforming by R, the normals of planes are orthogonal to w= (0, 0, 1)T, i.e uTi RÀ1 :w ẳ 2ị where ui are the planes normals Using this constraint, given at least three corners, we can compute the last column of RÀ1, or equivalently the last row of R, by finding the least-square solution If the last row of R is r3 = (a, b, c), and from the constraint that R is orthogonal, we choose its other rows as: r ffi ðÀb; a; 0Þ r ac; bc; a2 ỵ b ị ð3Þ where ffi means equal up to a scale, and jr1j = jr2j = jr3j = Once having computed R, we resample the panoramic image to finish the rectification Fig Overview of the proposed framework Table Smart corner picking process Let the user pick a point in/near a corner from the panorama Find the best image, according to Eq (1) Perform canny edge detection in a horizontal band of one tenth of the image width around the picked point Fit a line through the picked point and the edges using RANSAC, where the line must go though the picked point Optimize the line without constraning it to the picked point 4.3 Estimating the floor-plan The locations of corners in panoramas, identified in the previous step, give sets of horizontal angles between the corners when viewed from the panorama viewpoint If we have a way to represent those angles in terms of coordinates of projections of corners and viewpoints in the floor-plan, we have a set of constraints to estimate the floor-plan and the viewpoints Here we briefly review such a method presented in [8], discuss its applicability, and show how we extend it for our work A sketch is a model of the floor-plan We force users to draw rectilinear lines parallel to the axes by providing them with a drawing grid Of course, this alignment can be done automatically, but drawing in such way helps users to correctly define parallelism and orthogonality Note, as only parallelism and orthogonality are important in the parameterization, a sketch of a rectangular room is any arbitrary rectangle Assuming that the room has n corners, we need at most 2n parameters to represent it A viewpoint, whose coordinates both have to be estimated, is represented by a pair of separate parameters Suppose that we have v panoramas, then the total number of parameter is 2n + 2v For each wall drawn in the sketch that is parallel to an axis, since the two corners of a wall share a horizontal or vertical coordinate, the number of parameters is reduced by one (Fig 4a) Hence the number of parameters is reduced by the number of those walls, m To further reduce the number of parameters, the origin of the coordinate system is set at one corner, and the length of a wall is set to one, as the reconstruction is up to a scale anyway These settings reduce the number of parameters by In summary, the number of parameters to be estimated is: 2n ỵ 2v À m À Fig Two examples of smart corner picking (a) The user picks a point (b) Edges are detected in a vertical image band; a line is fitted through the picked point and edges Note that there is another (even longer) vertical line but the algorithm smartly takes the edge close to the picked point (c) The final result ð4Þ From the model of the floor-plan that contains the coordinates of corners and viewpoints, we can estimate the angle between two corners as seen from a viewpoint (Fig 4b) These angles are equal to the set of angles defined by user-picked corners in the panoramas This set of constraints can be used to estimate the parameters of the floor-plan model and the viewpoints 1520 T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 a might be enough to estimate it A special, yet the most common, case is a rectangular room Since we see all four corners from any viewpoint, one panorama might be enough to reconstruct the walls-and-floor model We need more panoramas when the floor-plan is not a rectilinear polygon, and when from the chosen viewpoint we cannot see all corners Fig shows examples b 4.5 The capture assistant Fig Parameterization of the floor-plan model given a sketch, simplified from Fig in [8] (a) To reduce the number of parameters, corners are represented by shared parameters (b) Each viewpoint is parameterized separately Locations of corners in a panorama at the viewpoint give a set of angles between corners as viewed from the viewpoint Unseen corner viewpoint a b Fig When the floor-plan is not rectilinear (a), or if from the viewpoint we cannot see all corners (b), we may need more than one panorama to estimate it At this point, the coordinates of top-down projections of viewpoints are estimated But the viewpoints’ heights are missing Complete viewpoint coordinates are required to add more details to the model in the later stage Since we already know the the floor and the projection of the viewpoint on the floor, we only need one point to compute the relative distance from the viewpoint to the floor To get that point, we ask the user to pick any floor point in each panorama to compute its viewpoint height 4.4 Reconstructability analysis We now give an analysis of the floor-plan estimation method To estimate the floor-plan and the viewpoint coordinates, the number of constraint must be greater or equal to the number of unknowns given in Eq (4) of the previous sub-section Suppose that viewpoint i sees ci corners, since the sum of the angles is 360 degrees, we have ci À independent constraints Since the viewpoints are different, constraints of one viewpoint are independent of constraints of other viewpoints The problem is solvable when the number of constraints is greater than or equal to the number of parameters: v X ci P 2n ỵ 3v m 5ị i¼1 Common rooms have all walls parallel to an axis, i.e the floor-plan is a rectilinear polygon, thus m is equal to n Eq (5) then simplifies to: v X ci P n ỵ 3v The capture assistant helps users in planning viewpoints in the room so that the reconstruction is possible and the model covers all of the room To that end, it must know the number of unknowns given a sketch, the number of constraints produced by viewpoints and the area they cover Furthermore, it is preferred that the number of viewpoints is minimal The number of unknowns is computed easily using Eqs (5) and (6) In a convex polygon, a line segment from any point within it to any of its vertices does not go out of itself Hence if the floor-plan is convex, counting the constraints is trivial since from any viewpoint we see all the corners When the floor-plan is concave, the problem is nontrivial Since we keep the sketching simple, only asking users to align rectilinear lines of the sketch parallel to axes, the sketch is freely stretched unevenly along axes Our solution is to decompose the sketch into tiles and compute the minimal number of observable corners from each tile, invariant to how it is stretched along axes The algorithm is described in algorithm Algorithm Algorithm Decomposing a sketch into invariant observable areas Step 1: Cut the sketch into tiles using all distinguished x and y coordinates A sketch is turned into a set of rectangles and triangles (Fig 6a) Where each of them is called a tile (Fig 6b) Step 2: For each tile, find its invariant observable area (IOA) by the following steps: – Initiate the area contains only the tile itself – Iteratively add a tile if it together with some tiles already added forms a convex polygon containing the initial tile Lemma 4.1 If the sketch is different from the real floor plan by an unevenly scaling, the IOAs are invariant to unevenly scaling Proof The sketch is different from the real floor plan by an unevenly scaling, the coordinates of corners are transformed by an monotic function, thus the order between any pair of x or y coordinate is preserved That means if xa > xb in the floor-plan, or one sketch, in another sketch that still holds Consequently The order of tiles, as decomposed in the algorithm above, is horizontally and vertically unchanged in any sketch Consequently the IOAs, a set of tiles, built following step in Algorithm is unchanged h Lemma 4.2 Any point in an IOA is observable from any point in the initial tile 6ị iẳ1 Suppose that we can find a point from which all corners are visible, i.e ci = n, Eq (6) is then further simplified to v P So indeed given a rectilinear floor-plan, one panorama that sees all corners Proof Any point is observable from another point within a convex polygon Since the extending scheme only add new tile if it is a part of a convex polygon with the initial tile, all points in the IOA are observable from any point in the initial tile h T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 1521 In practice, since there are objects in the room, we might not be able to put the camera at the suggested positions, or see all the corners we should see according to the analysis Should an object, e.g a tall wardrobe, completely block corner(s), it must be considered as part of the walls The procedure to suggest viewpoints is the same If a suggested tile is inappropriate to place the camera, users can mark it so that Algorithm can ignore that tile when recomputing the suggested viewpoints This procedure has proven to give good results in practical cases a b Viewpoints also affect the accuracy of the floor-plan and the texture quality In practice, since the panorama is built from high resolution images, the texture quality should not be a problem To estimate the floor plan accurately, intuitively one should place the camera in the center of the room to balance the constraints After this stage, we have a textured walls and floor model In this model, objects are projected on the walls and on the floor It gives a good overview of the scene As indicated in applications such as real estate management it should be satisfactory However for an application such as CSI, the object localization is not detailed enough Thus, we need the second stage to add more detail Adding details using perspective extrusion c d Fig Illustration of the sketch decomposition algorithm (a) The sketch is cut into rectangles and triangles using all distinguished x and y coordinates (b) The tile graph indicates possibilities of traveling among tiles (c) For each tile the initial observable area is itself (black); then tiles reached by traveling parallel to axes are iteratively added (gray); finally tiles reached from two ways are added (diagonal pattern) (d) The number of corners contained in the observable area is the minimal number of observable corners from the tile Having IOAs we check if the planned viewpoints surely cover all the room and provide enough constrains to estimate the real floorplan The IOA of a viewpoint is the IOA of the tile containing it By checking if the union of the planned viewpoints’ IOAs, we can make sure that the set of viewpoints covers all the scene Checking whether the floor-plan is solvable is done by summing the number of corners observed by each IOA, and then comparing it to the condition in (5) Given the IOAs of a sketch, finding an optimal set of viewpoints, i.e smallest number of viewpoints that covers the scene completely and satisfies the reconstructibility condition (5), is a hard problem Let us construct a graph representing the problem Each tile is a node in the graph For each tile, we have edges connecting it to all tiles in its IOA Since if a tile is observable from another one, than from it we can also observe the other tile, the edges are undirected Put aside the reconstructibility condition, our problem is finding the minimal set of nodes from which we have edges connect to the rest of the nodes This is the minimal dominating set problem, one of the known NP-complete problems [29] With an additional condition, our problem is arguably of the same complexity To suggest users a solution in interactive time, we propose the following greedy Algorithm Algorithm Suggesting viewpoints, the greedy algorithm Step Find a dominating set Initialize an empty dominating set of tiles While the scene is not covered by the union of the IOAs of tiles in the set, add a tile whose IOA contains most uncovered tiles Step Satisfy the reconstructability condition While the condition of (5) is not satisfied, add a tile whose IOA contains most corners, i.e providing most number of constraints The model now contains planes of walls, the floor, and viewpoint locations We design interactive methods to add detail to the model in spirit of the whole framework: flexibly reconstructing objects from coarse to fine For example, a table is reconstructed first and then the stack of books on it Characteristics of indoor scenes are utilized in designing interaction methods meeting that idea In indoor scenes, many objects are composed of planes Since objects are often aligned to walls, those planes are likely parallel to at least one wall or the floor As indicated ealier, this gives a constraint to reconstruct objects This action is similar to an extrusion, a popular standard technique in manual 3D modeling In a normal extrusion, the orthogonal projection of the object’s boundary on a reference plane is orthogonally popped up with a known distance, creating a new object planar surface In our situation we not see the object in orthogonal views, but from a panorama viewpoint So, instead of moving the object’s boundary on lines orthogonal to the reference plane, we move it on rays from the viewpoint to their original locations in the reference plane (Fig 1d) Because of this constraining, we call it a perspective extrusion Our aim is to reconstruct an object surface S that has a surface parallel to an already reconstructed plane (Fig 7) S is reconstructed from a set of three parameters The reference plane l is a reconstructed plane to which the plane of S is parallel The distance S to l is denoted by d; and b is a projection of the boundary of S in a panorama The reconstruction procedure includes shifting the parallel plane l by distance d to get the object plane p, and cutting p by the pyramid of b and the viewpoint from which we see b Once we have S, users can choose whether the object is a solid box or just a planar surface The perspective extrusion process is summarized in Table In related work such as [9], object parameters are defined indirectly in terms of geometric objects, e.g a rectangular box In pictures of indoor scenes, objects are frequently occluded, making the use of geometric objects difficult To give more options in reconstructing an object, we choose to let users define those parameters directly and separately For example, a box is defined by one of its faces and the distance to the plane the face is parallel to The distance can be defined by an orthogonal line to any reconstructed plane The parallel plane l is picked from the current model We provide two ways to define d, namely using one or two viewpoints 1522 T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 Table Perspective extrusion process The user picks the reference plane l The user defines the distance from l to the object plane p, either from one or two viewpoints Compute the object plane p by shifting l by d The user defines the boundary though its projection b onto a panorama Compute initial S by cutting the object plane p by the pyramid of b and the panorama viewpoint The user choses object type, either a solid box or a planar surface Fig A perspective extrusion pops up an object from an already reconstructed plane To define d from a single viewpoint, the user draws a line from the object surface orthogonally to a reconstructed plane To define d from two viewpoints, the user picks the projections of a point on the object surface in two panoramas We then triangulate these two projections to estimate the 3D coordinates of that point, and its distance to l, which already reconstructed, is the distance d This strategy is useful when there is no physical clue for guiding the drawing of a line from the object’s surface orthogonally to a reconstructed plane For example, for a chair, whose legs are bended, standing in the middle of the room, there would be no physical clue to draw d from a single viewpoint The boundary b is a polygon drawn by users from the viewpoint To assist the drawing of b, we assume as a default that the boundary of S has orthogonal angles and is symmetric as long as the drawing of b does not break this assumption Using those assumptions, we predict the boundary and render it This is helpful to accurately define b, especially when a vertex is occluded For flexibility and accuracy, we let users define any parameter (l, d, or b) from any available panorama viewpoint A possible way to increase flexibility and accuracy is to let users adjust the boundary b from different viewpoints as in VideoTrace [13] However, that is only effective if we have many viewpoints, i.e observations of the boundary To keep the framework simple and the number of input panoramas small, we have decided not to use that technique To be reconstructible, objects must be seen and the parameters for perspective extrusion must be definable The capture assistant described in Section 4.5 handles part of this by ensuring all of the floor and walls will be seen Of course objects can be occluded completely by other objects, but that is hardly the case for the main objects in the scene For l and b, if objects are complex or curvy, we can only approximate them (Fig 11c and d) For a Table Floor-plan relative errors (in percent, mean ± standard deviation) To achieve the best accuracy lens distortion should be applied before panorama stitching, and panorama rectification (Section 4.2) should be used The floor-plan error of the fake crime scene is not available because of lacking ground truth Bedroom Dining room Kitchen Without rectification Uncalibrated images Calibrated & rectification 0.48 ± 1.45 7.50 ± 3.20 0.49 ± 0.16 7.48 ± 3.17 0.38 ± 0.14 1.18 ± 0.49 9.88 ± 3.24 0.48 ± 0.23 0.28 ± 0.05 ‘‘floating’’ object, like the chair in Fig 10a, since there is no solid connection from its surface to another surface, one should use two viewpoints to define d In general, if an object has sufficiently different appearance in two panoramas, then it is reconstructible Results We now present results showing that the proposed framework overcomes difficulties in indoor scene reconstruction to efficiently produce complete and accurate models 6.1 Datasets Four scenes are used in our evaluation (Fig 8) Three are rooms in a house captured by ourselves The last one is a fake crime scene captured by The Netherlands Forensic Institute The ground truth is defined by measurements made on objects in the scenes All scenes are typical indoor scenes, rather complex and the space is limited For every scene, the minimal number of panoramas required, as computed using our capture assistance, is one Because of obstacles (furniture) there was no good position for capturing all corners, thus we had to use two panoramas for the three rooms For the fake crime scene, we use one panorama panoramas panoramas panoramas panoramas a Bedroom b Dining room c Kitchen d Fake crime scene Fig Evaluated scenes, their sketches, and number of panoramas used T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 Table Average object errors (mean ± standard deviation) Average object error Bedroom Dining room Kitchen Fake crime scene Absolute (cm) Relative (%) 2.4 ± 1.9 1.6 ± 1.2 1.1 ± 1.0 6.2 ± 2.6 2.38 ± 1.41 1.84 ± 2.00 1.17 ± 1.06 1.84 ± 0.89 a Walls-and-floor model min, mouse clicks b All furniture model min, 10 extrusions s 1523 good as using pre-calibrated images The errors, with pre-calibrated images and panorama rectification, are about a few centimeters in a room of about ten squared meters The relative errors, computed by dividing the absolute error by the length of the diagonal of the rectangular bounding box of the true floor-plan, are about 1% The estimated floor-plan of the dining room is less accurate since it was hard to identify some of its corners in the panoramas Our accuracy is higher than in [8], where the error is about 4% Two differences responsible for the improvement are: the floor-plan estimation strategy we used, and our panorama rectification In [8], a sketch of several rooms is used to parameterize and estimate the floor-plan of multiple rooms It was noted that by doing so, and thus ignoring thickness of walls, might reduce the accuracy [8] To achieve high accuracy, we have estimated the floor-plan of each room separately More importantly, our rectification eliminates the inaccurate alignment in the input panoramas (see Table 4) For objects, since the angles between geometric primitives, lines and planes, are already enforced during the reconstruction, we only evaluate the length errors, absolute and relative to the ground truth lengths The accuracy of our framework is quite high, e.g comparing to [8,19] Object accuracy is slightly less accurate than scene accuracy in terms of relative error, but our examination shows that the absolute errors are about the same 6.3 Efficiency and completeness c Final model d Final textured model 10 min, 19 extrusions Fig Resulting models as function to time and amount of interaction spent The example is the fake crime scene 6.2 Accuracy Since the reconstructed model is up to a scale and a rotation, we have to eliminate that ambiguity in order to evaluate the accuracy To so we estimate a transformation from the estimated floorplan to the ground truth floor-plan We apply this to the model, and then evaluate the model at two levels: at room scale (i.e floor-plan error), and at object scale (i.e object measurements) Table shows floor-plan errors with and without rectifying panoramas In two out of three datasets the improvement is quite significant In one dataset, the Bedroom, the error without rectification is almost the same as rectified since the angles of the original panoramas almost perfect Using uncalibrated images (calibration done during stitching) is possible, though the results are not as a Bedroom Our framework is efficient A scene can be modeled in a dozen of minutes Fig shows the model of a rather complex scene namely the fake crime scene The walls-and-floor model is built in seconds All furniture is modeled in about The time taken to build the final model that includes small objects such as cups on tables is 10 Furthermore, users not need to measure objects for modeling at capture time Fig 10 shows models of some scenes built using our framework Close-ups of objects picked from reconstructed models are given in Fig 11 Objects composed of planar surfaces are well reconstructed, while complex curvy objects can only be approximated using perspective extrusions Conclusion We have proposed a panorama-based semi-interactive 3D reconstruction framework for indoor scenes The framework overcomes the problems of limited field of view in indoor scenes and has the desired properties: robustness, efficiency, and accuracy Those properties make it suitable for a broad range of applications, from a coarse model created in a few seconds for a presentation to a detailed model for measurement in crime scene b Dining room c Kitchen Fig 10 Models reconstructed using the proposed framework 1524 T.K Dang et al / Computer Vision and Image Understanding 115 (2011) 1516–1524 a Stove b Table c Couch d Fake body Fig 11 Model of objects picked from models in Figs and 10 It takes less than a minute to model an object Objects composed of planar surfaces (the stove and the table) are well reconstructed using our method, while complex objects like a fake body are hard to approximate using perspective extrusions alone investigation Models inexpensively created using our framework are an intuitive medium to manage and retrieve digitized information of scenes and use it in interactive applications A limitation of the framework is that it lacks the ability to model complex objects This could be counteracted by other more expensive techniques For example the VideoTrace technique [13] lets users model objects from video sequences The ortho-image technique [30] creates background maps from image sequences to assist artists in modeling objects in 3D authoring software As objects are complex, both techniques require images from many different angles and more interaction Since our panoramic images are calibrated, we can integrate those techniques into our framework as plugins Once the object is reconstructed using those techniques, we can automatically integrate it back into our model, by matching panoramic images to the image sequence used to model the object and then estimating the pose of the object Thus the framework is a useful tool for both quickly building coarse models as well as efficiently building accurate models In the accompanying video the system is demonstrated on a number of realistic scenes Acknowledgments This work is supported by the BSIK project MultimediaN and the Research Grant from Vietnam National University, Hanoi No QG.10.23 Appendix A Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.cviu.2011.07.001 References [1] T.L.J Howard, A.D Murta, S Gibson, Virtual environments for scene of crime reconstruction and analysis, in: SPIE – Visual Data Exploration and Analysis VII, vol 3960, 2000, pp 1–8 [2] M Pollefeys, L.J.V Gool, M Vergauwen, K Cornelis, F Verbiest, J Tops, Imagebased 3D acquisition of archaeological heritage and applications, in: Virtual Reality, Archeology, and Cultural Heritage, 2001, pp 255–262 [3] N Snavely, S.M Seitz, R Szeliski, Modeling the world from internet photo collections, International Journal of Computer Vision 80 (2) (2008) 189–210 [4] M Pollefeys, D Nistér, J.-M Frahm, A Akbarzadeh, P Mordohai, B Clipp, C Engels, D Gallup, S.J Kim, P Merrell, C Salmi, S.N Sinha, B Talton, L Wang, Q Yang, H Stewénius, R Yang, G Welch, H Towles, Detailed real-time urban 3D reconstruction from video, International Journal of Computer Vision 78 (2–3) (2008) 143–167 [5] N Cornelis, B Leibe, K Cornelis, L.V Gool, 3D urban scene modeling integrating recognition and reconstruction, International Journal of Computer Vision 78 (2–3) (2008) 121–141 [6] H.-Y Shum, M Han, R Szeliski, Interactive construction of 3D models from panoramic mosaics, in: Computer Vision and Pattern Recognition, 1998, pp 427–433 [7] Y Li, H.-Y Shum, C.-K Tang, R Szeliski, Stereo reconstruction from multiperspective panoramas, IEEE Transaction on Pattern Analysis and Machine Intelligence 26 (1) (2004) 45–62 [8] D Farin, W Effelsberg, P.H.N de With, Floor-plan reconstruction from panoramic images, in: ACM Multimedia, 2007, pp 823–826 [9] S Gibson, R.J Hubbold, J Cook, T.L.J Howard, Interactive reconstruction of virtual environments from video sequences, Computers & Graphics 27 (2) (2003) 293–301 [10] M Pollefeys, L Van Gool, M Vergauwen, F Verbiest, K Cornelis, J Tops, R Koch, Visual modeling with a hand-held camera, International Journal of Computer Vision 59 (2004) 207–232 [11] M Chandraker, S Agarwal, F Kahl, D Nister, D Kriegman, Autocalibration via rank-constrained estimation of the absolute quadric, in: IEEE Computer Vision and Pattern Recognition, 2007, pp 1–8 [12] S.N Sinha, D Steedly, R Szeliski, M Agrawala, M Pollefeys, Interactive 3D architectural modeling from unordered photo collections, ACM Transactions on Graphics 27 (5) (2008) 159 [13] A van den Hengel, A Dick, T Thormählen, B Ward, P.H.S Torr, VideoTrace: rapid interactive scene modelling from video, ACM Transactions on Graphics 26 (3) (2007) 86 [14] A Fitzgibbon, A Zisserman, Automatic 3D model acquisition and generation of new images from video sequences, in: European Signal Processing Conference, 1998, pp 1261–1269 [15] M Pollefeys, R Koch, L Van Gool, Selfcalibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters, in: IEEE International Conference on Computer Vision, 1998, pp 90–95 [16] M Pollefeys, F Verbiest, L Van Gool, Surviving dominant planes in uncalibrated structure and motion recovery, in: European Conference on Computer Vision, 2002, pp 837–851 [17] J Repko, M Pollefeys, 3D model from extended uncalibrated video sequences: Addressing key-frame selection and projective drift, in: International Conference on 3-D Digital Imaging and Modeling, 2005, pp 150–157 [18] R.I Hartley, P Sturm, Triangulation, Computer Vision and Image Understanding 68 (1998) 146–157 [19] M Pollefeys, R Koch, L Van Gool, Selfcalibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters, International Journal of Computer Vision 32 (1999) 7–25 [20] P.E Debevec, C.J Taylor, J Malik, Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach, in: SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, 1996, pp 11–20 [21] S El-Hakim, E Whiting, L Gonzo, 3D modeling with reusable and integrated building blocks, in: The 7th Conference on Optical 3-D Measurement Techniques, 2005, pp 3–5 [22] R Haeusler, R Klette, F Huang, Monocular 3D reconstruction of objects based on cylindrical panoramas, in: 3rd Pacific Rim Symposium on Advances in Image and Video Technology, 2008, pp 60–70 [23] R Szeliski, Image alignment and stitching: a tutorial, Foundations and Trends in Computer Graphics and Vision (1) (2006) [24] Z Zhu, A.R Hanson, LAMP: 3D layered, adaptive-resolution, and multiperspective panorama – a new scene representation, Computer Vision Image Understanding 96 (3) (2004) 294–326 [25] W Wei, G Hui, Z Maojun, X ZhiHui, Multi-perspective panorama based on the improved pushbroom model, in: Workshop on Digital Media and its Application in Museum & Heritage, 2007, pp 85–90 [26] R Cipolla, D Robertson, 3D models of architectural scenes from uncalibrated images and vanishing points, in: International Conference on Image Analysis and Processing, 1999, pp 824–829 [27] M Wilczkowiak, P Sturm, E Boyer, Using geometric constraints through parallelepipeds for calibration and 3D modeling, Pattern Analysis and Machine Intelligence 27 (2) (2005) 194–207 [28] M.A Fischler, R.C Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Communication of the ACM 24 (1981) 381–395 [29] B Korte, J Vygen, Combinatorial Optimization: Theory and Algorithms, third ed., Algorithms and Combinatorics, Springer, 2005 [30] T Thormählen, H.-P Seidel, 3D-modeling by ortho-image generation from image sequences, in: ACM SIGGRAPH, 2008, pp 1–5 ... one panorama panoramas panoramas panoramas panoramas a Bedroom b Dining room c Kitchen d Fake crime scene Fig Evaluated scenes, their sketches, and number of panoramas used T.K Dang et al /... further automate the framework Related work 2.1 Reconstruction from panoramas A panorama is a wide-angle image, typically generated by stitching images from the same viewpoint [23] Since panoramas... a scale and a rotation, we have to eliminate that ambiguity in order to evaluate the accuracy To so we estimate a transformation from the estimated floorplan to the ground truth floor-plan We apply