Multimed Tools Appl DOI 10.1007/s11042-013-1826-9 Building 3D event logs for video investigation Trung Kien Dang · Marcel Worring · The Duy Bui © Springer Science+Business Media New York 2014 Abstract In scene investigation, creating a video log captured using a handheld camera is more convenient and more complete than taking photos and notes By introducing video analysis and computer vision techniques, it is possible to build a spatio-temporal representation of the investigation Such a representation gives a better overview than a set of photos and makes an investigation more accessible We develop such methods and present an interface for navigating the result The processing includes (i) segmenting a log into events using novel structure and motion features making the log easier to access in the time dimension, and (ii) mapping video frames to a 3D model of the scene so the log can be navigated in space Our results show that, using our proposed features, we can recognize more than 70 percent of all frames correctly, and more importantly find all the events From there we provide a method to semi-interactively map those events to a 3D model of the scene With this we can map more than 80 percent of the events The result is a 3D event log that captures the investigation and supports applications such as revisiting the scene, examining the investigation itself, or hypothesis testing Keywords Scene investigation · Video analysis · Story navigation · 3D model Introduction The increasing availability of cameras and the reduced cost of storage have encouraged people to use image and videos in many aspects of their life Instead of writing a diary, nowadays many people capture their daily activities in video logs When such capturing is continuous this is known as “life logging” This idea goes back to Vannevar Bush’s Memex device [5] and is still topic of active research [2, 8, 33] Similarly, professional activities T K Dang ( ) · M Worring University of Amsterdam, Amsterdam, The Netherlands e-mail: dtkien123@gmail.com T K Dang · T D Bui University of Engineering and Technology, Vietnam National University Hanoi, Hanoi, Vietnam Multimed Tools Appl can be recorded with video to create logs For example, in home safety assessment, an investigator can walk around, examine a house and record speech notes when finding a construction issue Another interesting professional application is crime scene investigation Instead of looking for evidences, purposely taking photos and writing notes, investigators wear a head-mounted camera and focus on finding the evidence, while everything is automatically recorded in a log These professional applications all share a similar setup, namely a first person view video log recorded in a typically static scene In this paper, we focus on this group of professional logging applications, which we denote by scene investigation Our proposed scene investigation framework includes three phases: capturing, processing, and reviewing In the capturing phase, an investigator records the scene and all objects of interest it contains using various media including photos, videos, and speech The capturing is a complex process in which the investigator performs several actions to record different aspects of the scene In particular, the investigator records the overall scene to get an overview, walks around to search for objects of interest and then examines those objects in detail Together these actions form the events of the capturing process In the processing phase, the system analyzes all data to get information about the scene, the objects, as well as the events Later, in the reviewing phase, an investigator uses the collected information to perform various tasks: assessing the evidence, getting an overview of the case, measuring specific scene characteristics, or evaluating hypotheses In common investigation practice, experts take photos of the scene and objects they find important and add hand-written notes to them This standard way of recording does not provide sufficient basis for processing and reviewing for a number of reasons A collection of photos cannot give a good overview of the scene Thus, it is hard to understand the relation between objects In some cases, investigators use the pictures to create a panorama to get a better overview [4] But since the viewpoint is fixed for each panorama, it does not give a good spatial impression Measuring characteristics of the scene is also not easy with photos if the investigator has not planned for it in advance More complicated tasks, like making a hypothesis on how the suspect moved, are very difficult to perform using a collection of photos due to the lack of sense of space Finally, a collection of photos and notes hardly captures investigation events, which are important for understanding the investigation process itself The scene can also be captured with the aim to create 3D models 3D models make discussion easier, hypothesis assessment more accurate, and court presentation much clearer [11, 14] When 3D models are combined with video logs it could enhance scene investigation further Using a video log to capture the investigation is straightforward in the capturing phase Instead of taking photos and notes, investigators film the scene with a camera All moves and observations of investigators are thus recorded in video logs However, in order to take the benefit, it is crucial to have a method to extract information and events from the logs for reviewing (Fig 1) For example, a 3D model helps visualize the spatial relation between events; while details of certain parts of the scene can be checked by reviewing events captured in these parts Together, a 3D model of the scene and log events form what we call a 3D event log of the case Such a log enables event-based navigation in 3D, and other applications such as knowledge mining of expert moves, or finding correlation among cases The question is how to get the required information and how to combine them 3D models of a scene can be built in various ways [24, 28, 30, 35] In this work, we use 3D models reconstructed using a semi-interactive method that builds the model from Multimed Tools Appl shelf sofa c d shelf TTVV heater table sofa b dy bo table a moving path camera pose a b c d Fig An investigation log is a series of investigation events (like a taking overview, b searching, c getting details, or d examining) within a scene When reviewing the investigation, it is helpful if an analysis can point out which events happened and where in the scene they took place panoramas [6] These models are constructed prior to the work presented in this paper Here we focus on analyzing investigation log to find events and connecting them to the 3D model Multimed Tools Appl Analyzing an investigation log is different from regular video analysis The common target in video analysis is to determine the content of a shot, while in an investigation log analysis we already know the content (the scene) and the purpose (investigation) The investigation events we want to detect arise from both content and the intentions of the cameraman For example, when an investigator captures an object of interest she will walk around the object and zoom-in on it While the general context and purpose of the log is known, these events are not easy to recognize as the mapping from intention to scene movement is not well defined, and to some extent depends on the investigator To identify events features we need features that consider both content and camera movements Once we have the events identified in the logs we need to map the events to the 3D model of the scene When the data would be high quality imagery, accurate matching would be possible [19, 31] Video frames of investigation logs, however, are not optimal for matching as they suffer from intensity noise and motion blur This hinders the performance of familiar automatic matching methods Different mapping approaches capable of dealing with lower quality data need to be considered In the next section we review the related work Then two main components of our system are presented in subsequent sections: (i) analyzing an investigation log to segment it into events, (ii) and their mapping to a 3D model for reviewing (Fig 2) In order to segment an investigation log into events, we introduce in Section our novel features to classify frames into classes of events This turns a log into a story of the investigation, making it more accessible in the time dimension Section presents our semi-interactive approach to map events to a reconstructed 3D model Together with the log segmentation step, this builds a 3D event log containing investigation events and their spatial and temporal relations Fig Overview of the framework to build a 3D event log of an investigations Video log Automatic video log segmentation Investigation events Semiinteractive matching 3D event log 3D reconstructed model Multimed Tools Appl Section evaluates the results of the proposed solution at the two analysis steps: segmenting a log into events, and mapping the events to a 3D model Finally, in Section we present our interface which allows for navigating the 3D events Related work 2.1 Video analysis and segmentation Video analysis often starts with segmenting a video into units for easier management and processing The commonly used unit is a shot, a series of frames captured by a camera in an uninterrupted period of time Shot boundaries can be detected quite reliable, e.g based on motion [23] After shot segmentation, various low-level features and machine learning techniques are used to get more high-level information, such as whether a specific concept is present [32] Instead of content descriptors, we want to get information on the movements and actions of the investigator Thus attention and intention are two important aspects Attention analysis tries to capture the passive reaction of the viewer To that end the attention model defines what elements in the video are most likely to get the viewer’s attention Many works are based on the visual saliency of regions in frames They are then used as criteria to select key frames [1, 20, 22] When analyzing video logs intention analysis tries to find the motivation of the cameraman while capturing the scene This information leads to one more browsing dimension [21], or another way of summarization [1] Tasks such as summarization are very difficult to tackle as a general problem Indeed existing systems have been built to handle data in specific domains, such as news [3] and sports [27], In all examples mentioned, we see that domain specific summarization methods perform better than generic schemes New domains need new demands Social networking video sites, like YouTube, urge for research on analysis of user generated videos (UGV) [1] as well as life logs [2], a significant sub-class of UGVs Indeed, we have seen research in both hardware [2, 7, 10] and algorithms [8, 9] to meet that need Research on summarizing investigation logs is very limited Domains dictate requirements and limit the techniques applicable for analysis For example, life logs, and in general many UGVs, are one-shot This means that the familiar unit of video analysis (shots) is no longer suitable, and new units as well as new segmentation methods must be developed [1] The quality of those videos is also lower than professionally produced videos Unstable motion and varying types of scenes violate the common assumptions on the motion model A more difficult issue is that those videos are less structured, making it harder to analyze contextual information In this work we consider professional logs of scene investigation that shares many challenges with general UGVs, but also has its own domain specific characteristics 2.2 Video navigation The simplest video navigation scheme, as seen on every DVD, divides a video into tracks and presents them with a representative frame and description If we apply multimedia analysis and know more about the purpose of navigation, there are many alternative ways to navigate a video For example, in [16] the track and representative frame scheme is enhanced using an interactive mosaic as a customized interface The method to create the video mosaic takes into account various features including color distribution, existence of human faces, and time, to select and pack key frames into a mosaic template Apart from the familiar time Multimed Tools Appl dimension, we can also navigate in space Tour into video [15] shows the ability of spatial navigation in video by decomposing an object into different depth layers allowing users to watch the video from new perspectives The navigation can be object-based or frame based In [12], object tracking enables an object-based video navigation scheme For example, users can navigate video by dragging an object from one frame to a new location The system then automatically navigates to the frame in which the object location is closest to that expectation The common thing in the novel navigation schemes described above is that they depend on video analysis to get the information required for navigation That is also the way we approach the problem We first analyze video logs and then use the result as basis for navigation in a 3D model Analyzing investigation logs In this section we first discuss investigation events and their characteristics Based on that we motivate our solution for segmenting investigation logs, which is described subsequently 3.1 Investigation events Watching logs produced by professionals (policemen), we identify four types of events: search, overview, detail, and examination In a search segment investigators look around the scene for interesting objects An overview segment is taken with the intention to capture spatial relations between objects and to position oneself in the room In a detail segment the investigator is interested in a specific object e.g an important trace, and moves closer or zooms in to capture it Finally, in examination segments, investigators carefully look at every side of an important object The different situations lead to four different types of segments in an investigation log As a basis for video navigation, our aim is to automatically segment an investigation log into these four classes of events There are several clues for segmentation, namely structure and motion, visual content, and voice Voice is an accurate clue, however, as users usually add voice notes at a few important points only it does not cover all the frames of the video Since in investigation, the objects of interest vary greatly and are unpredictable, both in type and appearance, the visual content approach is infeasible So understanding the movement of the cameraman is the most reliable clue The class of events can be predicted by studying the trajectory of cameramen movement and his relative position to the objects In computer vision terms this represents the structure of the scene and the motion of the camera We observe that the four types of events have different structure and motion patterns For example, an overview has moderate pan and tilt camera motion and the camera is far from the objects Table summarizes the different types of events and characteristics of their motion patterns Though the description in Table looks simple, performing the log segmentation is not Some terms, such as “go around objects”, are at a conceptual level These are not considered in standard camera motion analysis, which usually classifies video motion into pan, tilt, and zoom Also it is not just camera motion For example, the term “close” implies that we also need features representing the structure (depth) of the scene Thus, to segment investigation logs, we need features containing both camera motion and structure information Multimed Tools Appl Table Types of investigation events, and characteristics of their motion patterns Investigation event Characteristics Search – Unstable, mixed motion Overview – Moderate pan/tilt; far from objects Detail – Zooming like motion; close to objects Examination – Go around objects; close to objects 3.2 Segmentation using structure-motion features As discussed, in order to find investigation events, we need features capturing patterns of motion and structure We propose such features below These features, employed in a threestep framework (Fig 3), help to segment a log into investigation events despite of varying contents 3.2.1 Extracting structure and motion features From the definition of the investigation event classes, it follows that to segment a log into these classes, we need both structure and motion information While it is possible to estimate the structure and motion even from an uncalibrated sequence [24], that approach is not robust and not efficient enough for the freely captured investigation videos To come to a solution, we look at geometric models capturing structure and motion We need models striking a balance between simple models which are robust to nosie and detailed models capturing the full geometry and motion but easily affected by noise In our case of investigation, the scene can be assumed static The most general model capturing structure and motion in this case is the fundamental matrix [13] In practice, many applications, taking advantage of domain knowledge, use more specific models Table shows those models from the most general to the most specific In studio shots where cameras are mounted on tripods, i.e no translation presents, the structure and motion are well captured by the homography model [13] If the motion between two consecutive frame is small, it can be approximated by the affine model This fact is well exploited in video analysis [26] When the only information required is to know whether a shot is a pan, tilt, or zoom then the three-parameter model is enough [17, 23] In that way, the structure of the scene, i.e variety in 3D depth, is ignored We base our method on the models in Table We first find the correspondences between frames to derive the motion and structure information How well those correspondences fit into the above models tells us something about the structure of the scene as well as the camera motion Such a measurement is called an information criterion (IC) [34] An IC measures the likelihood of a model being the correct one, taking into account both fitting errors and model complexity The lower the IC value, the more likely the model is correct Vice versa, the higher the value the more likely the structure and motion possesses the properties not captured by the model A series of IC values computed on the four models represented in Table characterize the scene and the camera motion within the scene Based on those IC values we can build features capturing structure and motion information in video Figure summarizes the proposed features and their meaning, derived from Table Multimed Tools Appl Fig Substeps to automatically sengment an investigation log into events Video log Extracting SM features Classifying frames Merging labeled frames Investigation events Multimed Tools Appl Table Models commonly used in video analysis, their degrees of freedom; names and structure and motion condition under which they hold Model (d.o.f) Structure and motion assumption Homography HP (8) Flat scene, or no translation in motion Affine model HA (6) Far flat scene Similarity model HS (4) Far flat scene, image plane parallel to Three-parameter model HR (3) Same as HS , no rotation around the scene plane principal ray The degrees of freedoms are required to compute the proposed features The IC we use here is the Geometric Robust Information Criterion (GRIC) [34] GRIC, as reflected in its name, is robust against outliers It has been successfully used in 3D reconstruction from images [25] The main purpose of GRIC is to find the least complex model capable of describing the data To introduce GRIC let us first define some parameters Let d denote the dimension of the model ; r the input dimension; k the model’s degrees of freedom; and E = [e1 , e2 , , en ] the set of residuals resulting from fitting n corresponding data points in the model and the input The GRIC is now formulated as: g(d, r, k, E) = ei ∈E ei2 , λ1 (r − d) + (λ2 nd + λ3 k) σ2 (1) where σ is the standard deviation of the residuals The left term of (1), derived from fitting residuals, is the model fitting error The minimum function used in ρ is meant to threshold outliers The right term, consisting of model parameters, is the model complexity λ1 , λ2 , and λ3 are parameters steering the influence of the fitting error and the model complexity on the criterion Their suggested values are 2, log(r), and log(rn) respectively [34] In our case, we consider a two dimensional problem, i.e d = 2; and the dimension of the input data is r = (two 2D points) The degrees of freedom k for the models are given Structure and motion criteria Motion model assumptions cP - Depth variety, translation in camera motion HP - Flat scene, or no translation cA – same as cP, plus scene's distance HA - Far flat scene cS – same as cA, plus scene-camera alignment cR – same as cS,, plus camera rotation around the principal ray Fig Proposed structure and motion criteria for video analysis Hs - Far flat scene, image plane parallel to scene plane HR - Same as HS, no rotation around the principal ray Multimed Tools Appl in Table 2; and n is the number of correspondences The GRIC equation, adding explicitly the dependence on the models in Table 2, is simplified to: gH (k, E) = ei ∈E ei2 , + (2n log(4) + k log(4n)) σ2 (2) In order to make the criteria comparable over frames, the number of correspondences n must be the same To enforce this condition, we compute motion fields by computing correspondences with a fixed sampling grid As mentioned, GRIC is robust against outliers, thus outliers often existing in motion fields should not be a problem For a pair of consecutive frames, we compute GRIC for each of the four models listed in Table and Fig For example, cp = gH (8, E), given E is the set of residuals of fitting correspondences to the Hp model Our features include estimations of the three 2D frame motion parameters of the HR model, and four GRIC values of the four motion models (Fig 4) The frame motion parameters (namely the dilation factor o, the horizontal movement x, and the vertical movement y) have been used in video analysis before [17, 23] e.g to recognize detail segments [17] We consider them as the baseline features Our proposed measurements (cP , cA , cS , and cR ) add more 3D structure and motion information to those baseline features To make it robust again noisy measurements and to capture the trend in the structure and the motion, we use the mean and variance of criteria/parameters over a window of frames This yields a 14-element feature vector for each frame F = o, ¯ x, ¯ y, ¯ c¯P , c¯A , c¯S , c¯R , o, ˜ x, ˜ y, ˜ c˜P , c˜A , c˜S , c˜R (3) where ¯ is the mean, and ˜ is the variance of the value over the feature window wf In summary, computing features includes: Compute optical flow though out the video Sample the optical flow using a fixed grid to get correspondence Fit correspondences to models HP , HA , HS , HR to get the residuals and three parameters o, x, y of HR Compute cP , cA , cS , cR using (2) Compute feature vectors as defined in (3) 3.2.2 Classifying frames Based on the new motion and structure features, we aim to classify the frames into four classes corresponding to the four types of events listed in Table The search acts as “the others” class, containing frames that not have a clear intention Since logs are captured by handheld or head-mounted cameras, the motion in the logs is unstable Consequently, the input features are noisy It is even hard for humans to classify every class correctly While the detail class is quite recognizable from its zooming motion, it is hard to distinguish the search and examination classes Therefore, we expect that the boundary between classes is not well defined by traditional motion features While the proposed features are expected to distinguish classes, we not know which features are best to recognize the classes Feature selection is needed here Two popular choices with implicit feature selection are support vector machines and random forest classifiers We have carried out experiments with both and the random forest classifier gave better results Hence, in Section 5, we only present result with the random forest classifier Multimed Tools Appl 3.2.3 Merging labeled frames As mentioned, the captured motion is unstable and the input data for classification is dirty We thus expect many frames of other classes to be misclassified as search frames To improve the result of the labeling step, we first apply a voting technique over a window of frames length wv , the voting window Within the voting window, each type of label is counted Then the center frame is relabeled using the label with the highest vote count Finally, we group consecutive frames having the same label into events Mapping investigation events to a 3D model The video is now represented in the time dimension as a series of events In this section, we present the method to enhance the comprehensiveness of the investigation in the space dimension by connecting events to a 3D model of the scene In this way we enable interaction with a log in 3D For each type of event, we need one or more representative frames These frames give a hint to the user which part of the scene is covered by an event, as well as a rough indication of the camera motion Overview and detail events are presented by the middle frames of the events For this we make the assumption that the middle frame of an overview or a detail event is close to the average pose of all the frames in the event Another way to create a representative frame for an overview event would be a large virtual frame, a panorama, that approximately covers the space captures by that overview It is, however, costly and not always feasible, and thus is not implemented in this work As for searching and examination events one frame is not sufficient to represent the motion these are presented by three frames, the first, the middle and the last To visualize the event in space, we have to match those representative frames to the 3D model Matching frames is a non-trivial operation Logs are captured at varying locations in the scene and with different poses and video frames are not as clear as high resolution images Also the number of images calibrated to the 3D model is limited Thus we expect that some representative frames may be poorly matched or cannot be matched at all To overcome those problems, we propose a semi-interactive solution containing two steps (Fig 5): (i) automatically map as many representative frames as possible to the 3D model, and then (ii) let users interactively adjust predicted camera poses of other representative frames 4.1 Automatic mapping of events Since our 3D model is built using an image-based method [6], the frame-to-model mapping is formulated as image-to-image matching Note that color laser scanners also use images, calibrated to the scanning points, to capture color information So our solution is also applicable for laser scanning based systems Let I denote the set of images from which the model is built, or more generally a set which is calibrated to the 3D model Matching a representative frame i to one of the images in I enables us to recover the camera pose To the matching, we use the well-known SIFT detector and descriptor [19] First SIFT keypoints and descriptors are computed for representative frame i and every image in I Keypoints of frames i are initially matched to keypoints of every image in I , based on comparing descriptors [19] only Correctly matched keypoints are found by robustly estimating the geometric constraints between the two images [13] When doing so there might be more than one image in I matched to frame i Multimed Tools Appl Fig Mapping events to a 3D model includes an automatic mapping that paves the way for interactive mapping Investigation events 3D reconstructed model Automatic mapping Interactive mapping 3D event log Since one matched image is enough to recover camera pose, we take the one with the most correctly matched keypoints, which potentially gives the most reliable camera pose estimation In our work, the numbers of images, of panorama and representative frames are reasonably small, so we use an exhaustive method to match them If the number of images is large, it is recommended to use more sophisticate methods such as Best Bin First [18], or Bag-of-Words [29] When a match is established we can recover 3D information To estimate the camera pose of each matched frame we use the 5-point algorithm [30] The geometric constraints computed in the estimation process also indicate the frames that cannot be mapped automatically Those frames are mapped using the interactive method presented below 4.2 Interactive mapping of events To overcome the missing matches between some events and the 3D model, we employ user interaction The simplest way is to ask the user to navigate the 3D model, to viewpoints close to the representative frames of those events However, this is ineffective as the starting viewpoints could be far from the appropriate viewpoints To overcome this problem, we make use of the observation that the log is continuously captured and events are usually short As a consequence, camera poses of events which are close in time are also close in space We exploit this closeness to reduce the time users need to navigate to find those viewpoints (Fig 6) For each representative frame that is not mapped to the 3D model, we search backward and forward to find the closest mapped representative frame (both automatically mapped and previously interactively mapped, in terms of frames) We use the camera pose of these closest representative frames to initialize the camera pose of the unmapped representative frame There are parameters defining a Multimed Tools Appl Fig Manually mapping an event to the 3D model could be hard a Fortunately, automatic mapping could provide an initial guess b that gives more visual similarity to the frames of the event From there, users can quickly adjust the camera pose to a satisfactory position c shelf Shelf heater TTVV table sofa shelf TV sofa dy bo table a Shelf Shelf elf Sh TV TV b Shelf Shelf elf Sh TV TV c camera pose: defining the coordinates in space, and defining camera orientation/rotation We interpolate each of them from the parameters of the two closest known camera poses: pu = pi tj + pj ti ti + tj (4) where pu is a parameter of the unknown camera pose; pi , pj are the same parameters of the two closest known camera poses; and ti , tj are the frame distances to frames of those known camera poses Applying this initialization, as illustrated in Fig 6, we effectively utilize automatically mapped and previously interactively mapped results to reduce the time of interaction to register an unmapped frames Having the camera pose of frames, we visualize each of them as a camera frustum A camera frustum is a pyramid, whose apex is drawn at camera viewpoint The image plane is visualized by the pyramid’s base, which has the same size as image size up to a scale s The distance from the apex to the pyramid’s base is equal to the focal length of the camera up to the same scale s The scale s can be adjusted for proper visualization The extension of the planes forming by connecting the apex to the pyramid’s base shows the field of view of the frame This field of view helps to compute points covered by a representative frame, or vice versa to find which events cover a certain point in the 3D model The camera frustum concept is illustrated in Fig Multimed Tools Appl Fig A camera frustum is a pyramid shape representing the camera pose and field of view of a representative frame The pyramid’s base and the distance from pyramid’s apex to its base are proportional to the image size (width w, height h) and the focal length (f ), with the same scale (s) s.w viewpoint covered area s.f s.h After interactive mapping we have a one-to-one correspondence between the video log and its temporal events and the 3D spatial position in the model That is to say we have arrived at our 3D event logs Evaluation In this section, we detail the implementation and give an evaluation of the log analysis method of Section and the method to connect these logs to a 3D model presented in Section 5.1 Dataset Getting access to real experts in realistic settings is difficult We were fortunate to have real experts operating in a realistic training setting Working with real experts we had to limit ourselves in the number of different indoor environments though However, in our opinion, the complexity of the scene and the number of cameramen compensates that limitation In order to obtain clear ground truth for training, we capture a set of videos of separate types of events, i.e each video purely contains frames of one type of event The setup for training data is a typical office scene, captured using a handheld camera In total there are more than 15 thousand frames captured for training For testing, we have captured logs in the same office scene Furthermore, we had policemen and others capture logs in fake crime scenes Those logs in total are about one hour of video The ground truth is obtained by manually segmenting those logs into segments corresponding to the four types of events defined in Table 5.2 Analyzing investigation logs 5.2.1 Criteria We evaluate the log analysis for the two stages of the algorithm, namely classifying frames and segmenting logs For the former, we look at the frame classification result For the later, which is more important as it is the purpose of analysis, we use three criteria to evaluate the quality of the resulting investigation story They are the completeness, the purity, and the continuity Multimed Tools Appl To define those criteria, we first define what we mean by a correct event Let S = {s1 , s2 , , sk } denote a segmentation Each event si has range gi (which is a tuple composed of the start and end frames of the event) and class li So let two segmentations Sˆ and S¯ be given To check whether a segment sˆi is correct with respect to the reference segmen¯ the first condition is that there exists an event s¯j in S¯ that sufficiently overlaps sˆi , tation S, and has the same class label Formally: α sˆi , S¯ = if ∃ s¯j , |gˆ i ∩gˆ j | min(|gˆ i |,|g¯ j |) otherwise > z ∧ lˆi ≡ l¯j (5) where |.| is the number of frames in a range, and z indicates how much the two events must overlap Here we use z = 0.75, i.e the overlap is at least 75 % of the shorter event Now suppose that Sˆ is the result of automatic segmentation and S¯ is the reference segmentation The completeness of the story C, showing whether all events are found, is ˆ defined as the ratio of segments of S¯ correctly identified in S C= ¯ |S| i=1 α s¯i , Sˆ ˆ |S| i=1 α sˆi , S¯ (6) ¯ |S| The purity of the story P , reflecting whether identified events are correct, is defined as ¯ the ratio of segments of Sˆ correctly identified in S P = (7) ˆ |S| where |S| is the total number of segments in a segmentation S The last criterion is the continuity of the story U reflecting how well events are recovered without being broken into several events or wrongly merged It is defined as the ratio of the number of events in the result and in the ground truth: U= ˆ |S| ¯ |S| (8) If U is greater than 1.0 then we are in the situation that the number of events in the result is greater than the real number of events, implying that there are false alarm events When U is less than 1.0 it means that the number of events found is less than the actual number of events, implying that we miss some events An important restriction on U is that we not want to have a high value of R as the number of events should be manageable for reviewing logs A perfect result has all criteria equal to 1.0 5.2.2 Implementation The motion fields are estimated using OpenCV’s implementation of the Lucas-Kanade method.1 Of course other implementations can also be used Results presented here are produced with feature window wf = 8, and voting window wv = 24 (i.e second) (Section 3.2.1) We use the random forest classifier implemented in the Weka package.2 http://opencv.willowgarage.com http://www.cs.waikato.ac.nz/ml/weka Multimed Tools Appl 5.2.3 Results The accuracy of classifying frames, defined as number of correctly classified frames over total number of frames, using the baseline features and using the proposed features with and without voting are given in Table When using only 2D frame motion features, the accuracy of the frame classification is 0.60 Our proposed structure and motion features improve the accuracy to 0.71 Looking into the confusion matrices (Table 4a,b) we see that the recall of most classes is increased The largest improvement is for the recall of the search class, increasing from 0.639 to 0.868 The recall of the examination class, is increased considerably from 0.074 to 0.195 Most of the incorrect results are frames misidentified as search frames As mentioned, this is an expected problem as video logs captured from handheld or head-mounted cameras are unstable The recall of the detail class undesirably decreased about 30 percent This seems to be due to an overfitting problem This class, as discussed in Section 3.2.1, is best recognized base on three parameters of the HR model, especially the dilation factor o Adding SnM features brings the problem into a higher dimensional space, causing overfitting problem Solution to consider for the future would be to use multiple classifiers, and select different sets of features to classify each class After we apply voting (Table 4c), the overall accuracy is further improved to 0.755 The recall of the examination class is decreased However, our ultimate goal is to recognize events not frames, and as shown later overall with voting the final result is improved We evaluate the log segmentation with the overlap threshold z set to 0.75 The results are given in Table 5, including results with and without post processing of the voting window with wv = 24 Without post processing the completeness is C = 1.0 As in reviewing an investigation, it is important to find all important events, this is a very good result The purity of the story is reasonable, P = 0.65 However the number of events is extremely high compared to the ground truth, (U = 17.41) This is undesirable as it would take much time to review the investigation Fortunately, applying the voting technique, the number of identified events is much less and acceptable (U = 2.16), while the completeness remains perfect The purity P is decreased to 0.58 percent This is practically acceptable as users can correct the false alarm events during reviewing Table gives a detailed evaluation for each class before and after voting After applying voting, P is slightly decreased for all classes, while U is greatly reduced to 1.0, the perfect value Results presented above are the merged results of the data captured ourselves in a lab room and the data captured by different others in the fake crime scene This because we have found no significant difference between them (the average accuracy is only about one percent better for the data we captured ourselves) This evidently shows that the method is stable Table Accuracy of the frame classification Accuracy Baseline 0.596 Proposed features 0.710 Proposed features & voting 0.755 Multimed Tools Appl Table Classification results (confusion matrices): (a) using only 2D motion parameters as features (baseline), (b) using the proposed structure and motion features, and (c) using the proposed features and voting Search Overview Detail Examination 0.639 0.136 0.174 0.051 Overview 0.476 0.325 0.131 0.068 Detail 0.229 0.053 0.660 0.058 Examination 0.535 0.077 0.314 0.074 Search 0.868 0.046 0.046 0.041 Overview 0.502 0.497 0.000 0.001 Detail 0.537 0.016 0.328 0.119 Examination 0.723 0.029 0.053 0.195 Search 0.931 0.028 0.026 0.015 Overview 0.515 0.485 0.000 0.000 Detail 0.649 0.012 0.280 0.059 Examination 0.834 0.000 0.022 0.144 (a) Search (b) (c ) Increased and decreased recall (comparing to the baseline) are in bold and italic respectively The recall is improved in most of the classes, especially the hard examination class 5.3 Mapping events to a 3D model In terms of representative frames, the percentage of frames mapped to the 3D model is about 70 percent, of which about 20 percent is mapped automatically The percentage of unmatched frames due to lack of frame-to-frame overlap, i.e no visual clue to map at all, is about 30 percent This results in 81.9 percent of events mapped to the 3D model (Table 7) Table also provides more insight in the mapability of each type of event All the overview events are matched, either automatically or interactively The examination events are hardest None of them is matched automatically This is due to the fact that those events are usually captured at close distance, while the panorama images are captured at a wide view We have tried to extract SIFT keypoints on downscaled representative frames to match the scale Unfortunately, this did not help, probably because the details are too blurred in the panoramas A possible solution is to work with panoramas at higher resolution of multiple focuses Table Log segmentation result with and without applying voting C P U Before voting 1.00 0.65 17.41 After voting 1.00 0.58 2.16 Multimed Tools Appl Table Log segmentation evaluation per class C P U Search 1.00 0.68 16.57 Overview 1.00 0.66 139.50 Detail 1.00 0.71 7.41 Examination 1.00 0.54 54.33 Search 1.00 0.58 2.13 Overview 1.00 0.61 19.00 Detail 1.00 0.68 1.15 Examination 1.00 0.33 4.00 (a) Before voting (b) After voting In conclusion, more than 80 percent of events can be mapped to the 3D model, of which about 25 is done automatically This provides sufficient connection to represent a log in a 3D model, giving us a spatial temporal representation of the investigation for review Navigating investigation logs We describe here our navigation system for investigation logs As discussed, the system aims to enable users to re-visit and re-investigate scenes The user interface, shown in Fig 8, includes the main window showing a 3D model of the scene and a storyboard at the bottom showing events in chronological order Those two components present the investigation in space and time Users navigate an investigation via interaction with the two components When the user selects one event, camera frustums are displayed in the model to hint the area in the scene covered by that segment Vice versa, when the user clicks at a point in the model, log segments covering that point are highlighted and camera frustums of those events are displayed Those interactions visualize the relation between the scene and the log, i.e the spatial and the temporal elements of the investigation To take a closer look at that segment, users click on the camera frustum to transform it into a camera viewpoint and watch the video segment in an attached window Those interactions are demonstrated in the accompanying video Table Percentage of events automatic/interactive mapped, and cannot be mapped Map 81.9 Miss 18.1 Automatic 20.9 Interactive 61.0 Search 9.6 29.4 6.8 Overview 1.7 1.1 0.0 Detail 9.6 28.2 10.2 Examination 0.0 2.3 1.1 Multimed Tools Appl Fig The log navigation allows users to dive into the scene a, check event location of events related to a location b, watch a video segment of an event and compare it to the scene c Multimed Tools Appl Conclusion and future work We propose to use a combination of video logs and 3D models, coined 3D event logs, to provide a new way to scene investigation The 3D events logs provide a comprehensive representation of an investigation process in time and space, helping users to easily get an overview of the process and understand its details To build such event logs we have to overcome two problems: (i) decomposing a log into investigation events, and (ii) mapping those events into a 3D model By using novel features capable of describing scene structure and camera motion and machine learning techniques, we can classify frames into event classes at more than 70 percent accuracy This helps recovering the investigation story completely, with a fairly good purity To map events to a 3D model, we use a semi-interactive approach that combines automatic computer vision techniques with user interaction More than 80 percent of the events in our experimental logs were mapped into a 3D model of the scene, providing a presentation that supports reviewing well Improvement to the results could be obtained by developing features capturing more structure and motion information by considering more than two frames An example could be computing GRIC of three-frame constraints [25] Also feature selection and multiple classifiers would be helpful to increase the accuracy of classification Structure and motion features could also be useful in semantic scene analysis Hence, we are also looking for other domains where those features are effective Acknowledgments We thank Jurrien Bijhold and the Netherlands Forensic Institute for providing the data and bringing in domain knowledge, and the police investigators for participating in the experiment This work is supported by the Research Grant from Vietnam’s National Foundation for Science and Technology Development (NAFOSTED), No 102.02-2011.13 References Abdollahian G, Taskiran CM, Pizlo Z, Delp EJ (2010) Camera motion-based analysis of user generated video IEEE Trans Multimed 12(1):28–41 Aizawa K (2005) Digitizing personal experiences: capture and retrieval of life log In: MMM ’05: Proceedings of the 11th international multimedia modelling conference, pp 10–15 Albiol A, Torrest L, Delpt EJ (2003) The indexing of persons in news sequences using audio-visual data In: IEEE international conference on acoustic, speech, and signal processing Bijhold J, Ruifrok A, Jessen M, Geradts Z, Ehrhardt S, Alberink I (2007) Forensic audio and visual evidence 2004–2007: a review 15th INTERPOL forensic science symposium Bush V (1945) As we may think The atlantic Dang TK, Worring M, Bui TD (2011) A semi-interactive panorama based 3D reconstruction framework for indoor scenes Comp Vision Image Underst 115:1516–1524 Dickie C, Vertegaal R, Fono D, Sohn C, Chen D, Cheng D, Shell JS, Aoudeh O (2004) Augmenting and sharing memory with eyeblog In: CARPE’04: Proceedings of the the 1st ACM workshop on continuous archival and retrieval of personal experiences, pp 105–109 Doherty AR, Smeaton AF (2008) Automatically segmenting lifelog data into events In: WIAMIS ’08: Proceedings of the 2008 9th international workshop on image analysis for multimedia interactive services, pp 20–23 Doherty AR, Smeaton AF, Lee K, Ellis DPW (2007) Multimodal segmentation of lifelog data In: Proceedings of RIAO 2007 Pittsburgh 10 Gemmell J, Williams L, Wood K, Lueder R, Bell G (2004) Passive capture and ensuing issues for a personal lifetime store In: CARPE’04: Proceedings of the the 1st ACM workshop on continuous archival and retrieval of personal experiences, pp 48–55 11 Gibson S, Hubbold RJ, Cook J, Howard TLJ (2003) Interactive reconstruction of virtual environments from video sequences Comput Graph 27(2):293–301 Multimed Tools Appl 12 Goldman DB, Gonterman C, Curless B, Salesin D, Seitz SM (2008) Video object annotation, navigation, and composition In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface software and technology, pp 3–12 13 Hartley R, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn Cambridge University Press 14 Howard TLJ, Murta AD, Gibson S (2000) Virtual environments for scene of crime reconstruction and analysis In: SPIE – visual data exploration and analysis VII, vol 3960, pp 1–8 15 Kang HW, Shin SY (2002) Tour into the video: image-based navigation scheme for video sequences of dynamic scenes In: VRST ’02: Proceedings of the ACM symposium on virtual reality software and technology, pp 73–80 16 Kim K, Essa I, Abowd GD (2006) Interactive mosaic generation for video navigation In: MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on multimedia, pp 655–658 17 Lan DJ, Ma YF, Zhang HJ (2003) A novel motion-based representation for video mining In: International conference on multimedia and expo, vol 3, pp 469–472 18 Lowe DG (1999) Object recognition from local scale-invariant features In: International conference on computer vision, vol 2, pp 1150–1157 19 Lowe DG (2004) Distinctive image features from scale-invariant keypoints Int J Comput Vis 60(2):91– 110 20 Ma YF, Lu L, Zhang HJ, Li M (2003) A user attention model for video summarization In: ACM multimedia, pp 533–542 21 Mei T, Hua XS, Zhou HQ, Li S (2007) Modeling and mining of users’ capture intention for home video IEEE Trans Multimed 9(1) 22 Meur OL, Thoreau D, Callet PL, Barba D (2005) A spatial-temporal model of the selective human visual attention In: International conference on image processing, vol 3, pp 1188–1191 23 Ngo CW, Pong TC, Zhang H (2002) Motion-based video representation for scene change detection Int J Comput Vis 50(2):127–142 24 Pollefeys M, Van Gool L, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R (2004) Visual modeling with a hand-held camera Int J Comput Vis 59:207–232 25 Pollefeys M, Verbiest F, Van Gool L (2002) Surviving dominant planes in uncalibrated structure and motion recovery In: European conference on computer vision, pp 837–851 26 Robinson D, Milanfar P (2003) Fast local and global projection-based methods for affine motion estimation J Math Imaging Vis 8(1):35–54 27 Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for TV baseball program In: ACM multimedia, pp 105–115 28 Sinha SN, Steedly D, Szeliski R, Agrawala M, Pollefeys M (2008) Interactive 3D architectural modeling from unordered photo collections ACM Trans Graph 27(5):159 29 Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval IEEE Trans Pattern Anal Mach Intell 31(4):591–606 30 Snavely N, Seitz SM, Szeliski R (2006) Photo tourism: exploring photo collections in 3D ACM Trans Graph 25(3):835–846 31 Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections Int J Comput Vis 80(2):189–210 32 Snoek CGM, Worring M (2009) Concept-based video retrieval Found Trends Inf Retr 4(2):215–322 33 Tancharoen D, Yamasaki T, Aizawa K (2005) Practical experience recording and indexing of life log video In: CARPE ’05: Proceedings of the 2nd ACM workshop on continuous archival and retrieval of personal experiences, pp 61–66 34 Torr P, Fitzgibbon AW, Zisserman A (1999) The problem of degeneracy in structure and motion recovery from uncalibrated image sequences Int J Comput Vis 32(1) 35 van den Hengel A, Dick A, Thormăahlen T, Ward B, Torr PHS (2007) VideoTrace: rapid interactive scene modelling from video ACM Trans Graph 26(3):86 Multimed Tools Appl Trung Kien Dang got M.Sc in Telematics from University of Twente, The Netherlands, in 2003 and received PhD in Computer Science from University of Amsterdam, the Netherlands in 2013 His research includes 3D model reconstruction and video log analysis Marcel Worring received the M.Sc degree (honors) in computer science from the VU Amsterdam, The Netherlands, in 1988 and the Ph.D degree in computer science from the University of Amsterdam in 1993 He is currently an Associate Professor in the Informatics Institute of the University of Amsterdam His research focus is multimedia analytics, the integration of multimedia analysis, multimedia mining, information visualization, and multimedia interaction into a coherent framework yielding more than its constituent components He has published over 150 scientific papers covering a broad range of topics from low-level image and video analysis up to multimedia analytics Dr Worring was co-chair of the 2007 ACM International Conference on Image and Video Retrieval in Amsterdam, co-initiator and organizer of the VideOlympics, and program chair for both ICMR 2013 and ACM Multimedia 2013 He was an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and the Pattern Analysis and Applications journal Multimed Tools Appl The Duy Bui got his B.Sc in Computer Science from University of Wollongong, Australia in 2000 and his Ph.D in Computer Science from University of Twente, the Netherlands in 2004 He is now Associate Professor at Human Machine Interaction Laboratory, Vietnam National University, Hanoi His research includes computer graphics, image processing and artificial intelligence with focus on smart interacting systems ... a combination of video logs and 3D models, coined 3D event logs, to provide a new way to scene investigation The 3D events logs provide a comprehensive representation of an investigation process... builds a 3D event log containing investigation events and their spatial and temporal relations Fig Overview of the framework to build a 3D event log of an investigations Video log Automatic video. .. into events, and mapping the events to a 3D model Finally, in Section we present our interface which allows for navigating the 3D events Related work 2.1 Video analysis and segmentation Video