Get Complete Project Material File(s) Now! »
Related work
The existing literature on motion segmentation is vast. This is a hint that the problem is both di cult to solve and that a solution is valuable [1]. Among the researchers that have reviewed methods on motion segmentation are Megret and DeMenthon [3], who categorize existing techniques until 2002 in a survey on video segmentation algorithms. Zappella covers research on motion segmentation until 2008 in his master thesis [4], and reviews trends in a survey [1] in 2009. These sources provide a framework for comparing the assumptions, strengths, and weaknesses of motion segmentation methods. This framework is used in the following to review mostly recent research that is not already covered by these surveys.
General approaches
The assumptions on the scene and the camera motion control the di culty of the problem. These assumptions di er from application to application, and it is vital to understand the implications. In this thesis, we assume that the motion in the scene is rigid and that the camera might move. Furthermore, we assume that dynamic objects can appear, temporarily stop, and disappear from the scene.
In contrast, in surveillance scenarios, one can often assume a xed camera. In such a case, the motion segmentation problem turns into the problem of subtraction of largely static background. For example, Kim et al. [5] describe a system that learns a model of the background and can then use the model to detect moving objects.
The idea of building a model of the background can also be found in applications with moving cameras. Izadi et al. [6] present an extension to the KinectFusion [7] pipeline. Here, a dense surface of the background is reconstructed from the input of a RGB-D camera. Once this model is built, dynamic objects can be detected by determining whether newly arriving data ts to the model of the background or not. This method is reviewed in more detail later in this chapter. A similar approach to motion segmentation is taken by Keller et al. [8], who use point clouds to model both the background and dynamic objects. In order to subtract the background with moving cameras, Sheik et al. [9] t a subspace to trajectories on the background by random sample consensus [10], thereby assuming that the background is the predominant structure in the scene. In summary, there is a group of approaches [5, 6, 7, 8, 9] that rst construct a model and then identify moving objects as outliers with respect to that model. Zhou et al. [11] avoid the problem of the training phase by outlier detection by using a low-rank approximation of the video frames.
Assuming the background to be predominant in the scene, or assuming the objects to appear in a certain order might not always be viable. There are motion segmentation methods that treat all the objects in the scene equally. To these belong the trajectory grouping methods by Brox and Malik [2], and by Perera and Barnes [12]. Since the ideas for the method presented in this thesis e ectively stem from these papers, they are discussed in more detail.
Besides the di erences in the assumptions made on the input, there are dif-ferent requirements on the output of motion segmentation. There seems to be an agreement that motion segmentation is about obtaining a pixel-wise annotation of the moving objects in each frame. Trajectory-based methods often result only in a sparse annotation of the moving objects, where the keypoints are labeled with the object they belong to [12, 13, 14, 15, 16]. There are e orts to obtain dense seg-mentations by increasing the density of the tracked points [2, 17, 18]. If this is not enough, \densi cation » [3] can be done as a post-processing step [19]. In contrast, statistical approaches can lead directly to a dense segmentation [1, 20].
So far, di erent assumptions on the input and requirements on the output of motion segmentation have been treated. Both the assumptions and the requirements are determined by the intended application of the motion segmentation algorithm. When these are clear, it is time to have a closer look on how motion can be segmented in various settings. For this purpose, Megret and DeMenthon [3] provide a way to categorize the existing approaches. Particularly helpful appears the division of methods into three bins: (1) methods that focus on segmenting single frames rst and care about time later, (2) methods that explore time rst and obtain a spatial segmentation later, and (3) methods that jointly estimate the motion in both time and space.
Spatial priority
Focusing on the frame-wise motion rst is a natural extension to image segmenta-tion. Given two successive frames in a video, the question becomes on how to nd regions of coherent motion between these frames. For the robotic application of singulating objects in cluttered scenes [21], this is already su cient for segmenting objects based on motion cues [22, 23], especially if the robot is allowed to interact with the scene [24, 25, 26]. Commonly, such approaches use or relate to optical ow [27], scene ow [23, 28, 29], or employ feature matching [26, 30] in order to detect the changes between pairs of images. However, these approaches o er no natural way to maintain a temporally consistent segmentation.
Temporal priority
An alternative is to follow points over time rst, and decide how to cluster the points into groups with coherent motion in a later stage. This is the trajectory grouping category of methods [3]. Here, trajectories describe local motion over a long period. The trajectories can be found with optical ow [17, 18, 31, 32] or by matching feature descriptors [33, 34, 35, 36]. The long-term observations seem to provide a natural way to build temporal consistency into the method right from the beginning. The disadvantage is that drift in the tracker can cause these trajectories to cross motion boundaries, meaning that they are susceptible to become corrupted over time [37]. We have adopted the trajectory grouping approach in this work, building in particular on the ideas of Brox and Malik [2], who use spectral clustering to group trajectories based on pairwise similarities. Prior to their work, many other methods have been developed on the idea of trajectory grouping. Assuming an a ne camera, the trajectories of points belonging to the same rigid motion lie in a linear subspace [38] such that nding these subspaces solves the motion segmentation problem. The local subspace algorithm of Yan and Pollefeys [13] is a particular method that is treated in more detail in this chapter. The benchmark by Tron and Vidal [39] summarizes a few of these methods, and Zappella [1, 4, 40] covers them well. These methods are mathematically elegant but have the weakness that they usually require complete trajectories. With both moving objects and a moving camera, this requirement limits the use cases considerably. However, there are e orts to overcome the problem of incomplete trajectories [15, 41].
Joint segmentation
Finally, there is the category of methods that segment motion jointly in space and time. Statistical methods seem to be suitable, as they allow to combine the un-knowns with the prior knowledge about spatial and temporal constraints in a single framework. Cremers and Soatto [20] formulate motion segmentation as a problem of Bayesian inference, which is solved with variational methods. They represent the motion boundaries either with splines or level sets. A recent method by Schoene-mann and Cremers [42] segments motion into layers such that a cost for encoding the video is minimized.
Exemplary approaches
Clustering of long-term trajectories
Brox and Malik [2] cluster long-term point trajectories in 2-d videos. Their basic idea is to detect motion between pairs of points over time, arrange the collected evidence in a similarity graph, and then use spectral clustering to partition the similarity graph into clusters of similarly moving points. Their work is relevant for this thesis, as the approach allows for a moving camera, handles missing data, aims for temporally consistent segmentations, and has reasonable complexity with regard to the implementation.
The authors do not assume a static camera. However, they assume that the objects are moving mostly translational with respect to the camera. The assumption is necessary because motion on the image plane is ambiguous. Follow-up work by Ochs and Brox [43] addresses this issue by considering triplets instead of pairs, though at the expense of increased computational costs [44]. This gives rise to the question whether it is possible to extend the approach to 3-d trajectories to get rid of this ambiguity. This thesis di ers from their work mostly through the use of 3-d data and a di erent de nition of similarity.
The graph-partitioning approach to clustering handles incomplete trajectories naturally. Non-overlapping trajectories can still be related to each other via transi-tive links in the similarity graph. Brox and Malik [2] do not report how much missing data can be handled. This is an important question as it in uences the choice of the keypoint tracker. They use the large displacement optical ow tracker on the GPU by Sundaram et al. [17], which has been reported to improve in accuracy and quantity over a GPU-based version of the Kanade-Lucas-Tomasi tracker [45]. Yet, the choice of a more advanced point tracker comes at the expense of computational costs. Sundaram et al. [17] report a runtime of 3-4 seconds per frame in 2010. For the longest of the videos used in this thesis, extrapolation shows that tracking alone would accumulate to an hour on the same machine and the same image resolution at 640 ×480 pixels. Likely due to better hardware, Ochs et al. [44] report a runtime of less than two seconds in 2013. Despite of the advances in hardware, a fast, serial point tracker based on the OpenCV library [46] is used in this work instead because it runs at about half real-time frame rate on our machine.
The trajectories provide temporal links between the video frames. Brox and Ma-lik [2] claim that the ability to obtain \temporally consistent segmentations » is one of the main bene ts of the approach. With regard to the survey of Megret and De-Menthon [3], their method falls into the category of trajectory grouping approaches that gather information in time rst, and segment in space later. Analyzing along the temporal dimension rst before taking further steps makes the method o ine, as it needs to see all video frames. Brox and Malik [2] motivate their work by ar-guing that the long-term trajectories reveal more about the moving objects in the scene than the ow between two scenes. The same position is taken in this thesis.
The way we de ne similarity between trajectories di ers from the work by Brox and Malik [2]. They concentrate on identifying the instant where the motion be-tween points seen on the image plane is maximum. We experiment with both the maximum overall di erence in point distance, and alternatively the variance as a measure for dissimilarity, formulated in a generic framework where alternative de – nitions can be plugged in. Which method works best in which scenarios has to the best of our knowledge not been treated in depth yet and remains an open question.
Follow-up work by Ochs et al. [19, 44] explores the method further. It includes a densi cation step [3] for obtaining pixel-wise motion segmentation rather than the sparse segmentation obtained by labeling the trajectories. The densi cation step is particularly interesting for possible 3-d object reconstruction after motion segmentation because it identi es the regions in the image and in the depth data that shall be included in the 3-d model and its texture. Fragkiadaki et al. [47] treat the discontinuities between clusters explicitly and obtain a dense segmentation from the trajectory clusters via Gabriel graphs [48].
Clustering trajectories with maximal cliques
Perera and Barnes [12] build upon the fact that none of the distances between points on a rigid object can ever change. They group trajectories by nding maximal cliques in a thresholded similarity graph constructed from the sample standard deviation between pairs of points.
The maximal cliques correspond to groups of points that are compatible with a rigid motion hypothesis. They solve the maximum coverage problem in order to determine the nal segmentation from the, not necessarily disjoint, maximal cliques. The problems of enumerating all maximal cliques in a graph, and the maximum coverage problem are NP-hard [49, 50].
In contrast to many existing motion segmentation algorithms, Perera and Barnes [12] use RGB-D videos as input data. They evaluate their method on synthetic data and real datasets. However, the latter consists only of image sequences with a handful of images with less than 150 keypoints. The RGB-D videos targeted in this thesis contain fty or more frames, and the quantity of keypoints needs to be larger. While theoretically elegant, it is therefore not clear to see how this method scales to the problem treated in this thesis.
Perera and Barnes [12] also suggest to use the sample standard deviation in order to summarize changes of distances between pairs of 3-d points. These aspects make their research relevant for this thesis, claiming that 3-d data can be used for motion segmentation. We generally follow their notation where possible.
There are many parallels between the maximal clique based approach in [12] and the spectral clustering approach in [2]. Both construct a similarity graph. The maximal clique based approach requires an early decision about which edges in the graph link points on the same rigid object. Hence, noise is addressed early in the algorithm while the uncertain information is kept in spectral clustering almost until the end. The approach taken in this thesis can be seen as a fusion of the ideas in [2] and [12]. The construction of the similarity graph from 3-d trajectories follows [12] in certain aspects, whereas the grouping step is performed via spectral clustering in a similar way as done in [2].
Clustering trajectories by local subspace affinity
The local subspace a nity (LSA) method by Yan and Pollefeys [13] exploits the fact that trajectories of points on a rigidly moving object lie in an up to four-dimensional linear subspace under a ne projection [38]. The trajectories are then grouped by estimating the subspace for each trajectory locally. Spectral clustering partitions the similarity graph based on the principal angles between a pair of estimated subspaces.
LSA belongs to a class of methods that make use of the measurement matrix W ∈ R2F ×N (see [15, 38]). This matrix consists of the stacked 2-d coordinates u, v of all N points in all F frames.
This matrix is formed by projections of the 3-d trajectories with an a ne camera. The idea now is to nd trajectories that lie in the same subspace. To this end, LSA estimates the matrix rank K and projects each of the N columns onto the K-dimensional unit hypersphere with the help of singular value decomposition. The linear subspace for each trajectory is estimated from the local neighborhood on the sphere. Finally, Yan and Pollefeys [13] use principal angles between the estimated linear subspaces for de ning similarity between the subspaces. The similarity graph is partitioned via spectral clustering.
While mathematically elegant, the downside of LSA is that it does not provide a built-in mechanism to treat incomplete trajectories. This is a too strong limitation given the RGB-D videos considered in this thesis. Also, estimating the rank of ma-trix K and the dimensions of the linear subspaces has been reported as di cult [39]. Local subspace a nity is studied in more detail in the master thesis by Zappella [4], which provides good references to related work on the subject.
Outlier-based segmentation with KinectFusion
So far, we have presented approaches that work on sparse to semi-dense trajectories. KinectFusion by Newcombe et al. [7] is a depth-based real-time approach to the simultaneous localization and mapping (SLAM) problem. Depth readings from a depth sensor are integrated frame by frame into a global model of the surface.
Izadi et al. [6] extend the system to deal with dynamic scenes in order to support interaction. This involves a user interacting with objects in front of the camera. They also state the 3-d reconstruction of a single object as the goal, motion being the cue for segmentation. Hence, their work is relevant to this thesis.
The extension to KinectFusion is based on the assumptions that the background is the dominant structure in the scene, and that parts of the background have already been integrated into the surface model prior to interaction. The idea behind these assumptions is that the camera motion can still be recovered from the dominant background and that dynamic objects reveal themselves as outlier data when tted to the global model of the surface. In the interactive scenario, users might not only interact with the scene but the system might interact with the user. This allows for giving the user feedback and instructions on how to make sure that these assumptions hold. These assumptions are not made in this thesis, giving all rigid structures equal treatment and imposing no speci c order in which the objects have to appear in RGB-D videos.
Object-centered reconstruction
Shin et al. [51] introduce a bottom-up approach targeted solely at reconstructing moving objects in a dynamic scene. Their idea is based on co-recognizing objects that are common in a set of unordered images by matching image patches between each image pair. The correspondences are then used to obtain a dense segmentation of the objects in the images prior to 3-d reconstruction of the method.
Unfortunately, the method scales quadratically with the number of frames. The reported running times of the reference implementation are too long with respect to the practical orientation of this thesis.
Benchmarks
The eld of RGB-D video segmentation is relatively new and seems to lack standard, large-scale benchmark datasets. No suitable public benchmark could be found that provides RGB-D videos with ground truth segmentation of each moving object in a dynamic scene. In the following, three publicly available benchmarks are presented, but none of them is really suitable for this thesis.
TUM RGB-D Benchmark
The TUM RGB-D Benchmark dataset [52, 53, 54] is a collection of static and dy-namic scenes recorded with RGB-D cameras. The benchmark is intended for the evaluation of SLAM systems. Each recording provides color and depth images. The datasets are annotated with the ground truth camera path relative to the back-ground in the scene. The dynamic scenes contain motion such as people moving. The benchmark targets researchers who want to test the camera tracking and map-ping capabilities of a SLAM system. There is no pixel-wise annotation of the moving objects in the image sequences, which limits the usefulness of the dataset in the con-text of motion segmentation, at least as long as the motion segmentation algorithm is not embedded into a system for robust camera tracking.
Hopkins 155 Dataset
The Hopkins 155 Dataset [39] provides 155 videos together with point trajectories. Most of scenes consist of rigidly moving checkerboard objects, recorded with hand-held cameras. There are also scenes with non-rigid motion and articulated motion. The trajectories are labeled with ground truth according to which dynamic object they belong to. However, the Hopkins 155 Dataset does not contain outlier or in-complete trajectories, has less than 500 trajectories per image sequence, and does not provide any depth data as input, which makes it unsuitable for the evaluation of RGB-D video segmentation methods. An update to the dataset adds the capa-bility to insert synthetic outliers into the dataset, and provides 16 sequences with incomplete trajectories and outliers from [15, 55].
Freiburg-Berkeley Motion Segmentation Dataset
The Freiburg-Berkeley Motion Segmentation Dataset [2, 44] contains 59 videos with pixel-wise ground truth. It features mostly non-rigid motion but includes some of the car sequences from the Hopkins 155 Dataset. As with the Hopkins 155 Dataset, no depth data is included. It is thus also not suitable for the evaluation in this thesis.
Table of contents :
1 Introduction
1.1 Motivation
1.2 Problem statement
1.3 Selected approach
1.4 Thesis outline
2 Related work
2.1 General approaches
2.2 Exemplary approaches
2.2.1 Clustering of long-term trajectories
2.2.2 Clustering trajectories with maximal cliques
2.2.3 Clustering trajectories by local subspace anity
2.2.4 Outlier-based segmentation with KinectFusion
2.2.5 Object-centered reconstruction
2.3 Benchmarks
2.3.1 TUM RGB-D Benchmark
2.3.2 Hopkins 155 Dataset
2.3.3 Freiburg-Berkeley Motion Segmentation Dataset
3 Background
3.1 RGB-D cameras
3.2 Point trackers
3.3 Graph theory
3.4 Spectral clustering
4 Method
4.1 Concept
4.1.1 Trajectories
4.1.2 Scenarios without noise and full observation
4.1.3 Scenarios with noise and partial observation
4.1.4 Towards pose estimation
4.2 Point tracking
4.3 Trajectory statistics
4.3.1 Statistics of scene distances
4.3.2 Statistics of image distances
4.3.3 Computational complexity
4.4 Trajectory clustering
4.4.1 Similarity graph
4.4.2 Generalized eigensystem
4.4.3 K-Means
4.4.4 Model selection
5 Evaluation
5.1 Evaluation method
5.1.1 Tracking performance
5.1.2 Analysis of trajectory statistics
5.1.3 Clustering performance
5.2 Datasets
5.2.1 Oce Dataset
5.2.2 Smart Mobility Lab Dataset
5.2.3 Simulated Dynamic Scene Dataset
5.2.4 TUM RGB-D Benchmark
5.3 Analysis
5.3.1 Point tracking
5.3.2 Trajectory statistics
5.3.3 Trajectory clustering
6 Conclusion
6.1 Key ndings
6.2 Future directions
Bibliography
Notation
Abbreviations and Acronyms
A Plots
A.1 Trajectory lengths
A.2 Dissimilarity box plots
A.3 Segmentations