cvpr cvpr2013 cvpr2013-370 knowledge-graph by maker-knowledge-mining

370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning

Source: pdf

Author: David Weiss, Ben Taskar

Abstract: We propose SCALPEL, a flexible method for object segmentation that integrates rich region-merging cues with mid- and high-level information about object layout, class, and scale into the segmentation process. Unlike competing approaches, SCALPEL uses a cascade of bottom-up segmentation models that is capable of learning to ignore boundaries early on, yet use them as a stopping criterion once the object has been mostly segmented. Furthermore, we show how such cascades can be learned efficiently. When paired with a novel method that generates better localized shapepriors than our competitors, our method leads to a concise, accurate set of segmentation proposals; these proposals are more accurate on the PASCAL VOC2010 dataset than state-of-the-art methods that use re-ranking to filter much larger bags of proposals. The code for our algorithm is available online.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose SCALPEL, a flexible method for object segmentation that integrates rich region-merging cues with mid- and high-level information about object layout, class, and scale into the segmentation process. [sent-3, score-0.464]

2 Unlike competing approaches, SCALPEL uses a cascade of bottom-up segmentation models that is capable of learning to ignore boundaries early on, yet use them as a stopping criterion once the object has been mostly segmented. [sent-4, score-0.629]

3 ” These proposals can then be evaluated by a more complex model to determine a final set of localized and segmented objects in the image. [sent-11, score-0.422]

4 These pairwise features are not sufficient to discriminate between full objects and partial segmentations, so many proposals must necessarily be generated per seed, and many seeds must be sampled. [sent-23, score-0.335]

5 Therefore, the final step of the process involves learning a re-ranking classifier to filter the fixed set of segment proposals using features computed over the entire region, i. [sent-24, score-0.348]

6 In this work, we propose incorporating region features normally reserved for a re-ranker directly into the segmentation process. [sent-27, score-0.313]

7 This allows us to be far more efficient in terms of the number of proposals generated, as our method can provide similar or better accuracy as state-of-the-art reranking based systems with only a single proposal per region of interest. [sent-28, score-0.359]

8 Because we can evaluate features such as normalized cut energy during the segmentation process, our procedure can find the right balance between filling out a given region and finding a segmentation that has object-like properties as a whole. [sent-29, score-0.462]

9 Given a prior belief corresponding to an object in the image, our approach is more likely to get the segmentation right the first time, without needing to generate multiple guesses. [sent-30, score-0.28]

10 However, incorporating region features comes at a price: we must forego using efficient graph-cut algorithms to produce our segmentation proposals. [sent-31, score-0.271]

11 To retain efficiency, we adopt instead a greedy superpixel selection algorithm. [sent-32, score-0.445]

12 While the concept of segmentation through greedy superpixel selection is at least a decade old (e. [sent-33, score-0.602]

13 Specifically, we incorporate high-level information about the scale, probable layout, class of the object, and the current stage of segmentation into the procedure. [sent-36, score-0.289]

14 This is because there are intrinsic variations across and within objects that will affect whether or not a given feature is useful during the greedy segmentation process. [sent-37, score-0.4]

15 For example, the usefulness of color and 222000333533 input image shape/object prior model selection cascade inference Figure 1. [sent-38, score-0.531]

16 The input is a shape prior annotated with class and size information (automatically generated from the prior generator. [sent-40, score-0.295]

17 ) The class and size are used to select a scale and class specific cascade model ws,c from a lookup table. [sent-41, score-0.507]

18 The cascade greedily grows a fill region (initialized from the shape prior), with each sub-model adding up to a fixed number of segments before passing the region to the next level of the cascade, stopping whenever any level finds no superpixel candidate scoring above zero. [sent-42, score-1.014]

19 texture information from the prior depends on whether or not the object class has consistent color and texture, as well as if the object is too large or small to get good estimates of color and texture. [sent-44, score-0.27]

20 Similarly, segmenting a large object requires ignoring interior boundaries early on in the greedy selection process, yet respecting exterior boundaries once the object has been fully segmented in the later stage of the process. [sent-45, score-0.588]

21 We learn to infer these properties using a simple localized shape prior generation scheme that localizes objects with higher recall than either purely bottom-up or top-down methods. [sent-50, score-0.498]

22 cascade of selection models in which different stages of the segmentation process have different parameters. [sent-52, score-0.51]

23 Unlike a fixed model, the cascade is capable of learning to ignore boundaries early on in the process yet use them as a stopping criterion once the object has reached a certain size. [sent-53, score-0.472]

24 We demon- strate that it is feasible to incorporate features normally reserved for re-ranking directly into the segmentation process, using a simple greedy method for superpixel selection. [sent-59, score-0.623]

25 Related Work Several previous works have attempted to form segmentations of objects in the image given a detection bounding box [6, 22, 10, 14, 15]. [sent-63, score-0.284]

26 [6] and [22] both learn several shape priors using the root or parts of the DPM [9], whereas we learn hundreds of holistic shapes from a fine-grained clustering in mask pixel space. [sent-64, score-0.444]

27 Furthermore, [6, 22] trust the class assignments of the detections; we use bottom-up bounding boxes with no class information to increase recall, and instead generate shape predictions based on the content of the bounding boxes. [sent-65, score-0.538]

28 Both [7, 6] incorporate learning into their segmentation method by learning either a set of unary scores for graphcut [7] based on harvested region pairs or by directly using max-margin structured learning with graph-cut as inference [6]. [sent-67, score-0.406]

29 900 Bounding boxes per image are generated from several (2) Top: Each box is evaluated by a shape classifier (Section 4) that provides a rough estimate of the shape of an object inside the box. [sent-71, score-0.378]

30 Bottom: These estimates are integrated over superpixels localized shape prior. [sent-72, score-0.444]

31 (3) Superpixels to form a are greedily selected using a cascaded segmentation model to form the final output (Section 3). [sent-73, score-0.38]

32 Furthermore, these approaches are again limited to sub-modular edge weights and graph-cut as inference, while we in contrast learn cascaded weights on features computed over arbitrary groups of superpixels that are not required to be sub-modular. [sent-75, score-0.521]

33 However, our goal is to group super-pixels into a single coherent region, not to produce a tiling of segments that cover coherent regions over the entire image; we stop agglomerating superpixels when the score no longer exceeds a desired threshold. [sent-77, score-0.315]

34 More importantly, our cascaded weight vector allows for the scoring function to change as inference proceeds, features can be computed over arbitrary groups of superpixels (not only pairwise), and we use the localized shape priors to seed our method and guide inference. [sent-78, score-0.977]

35 Rather, our segmentation approach is more related to the greedy MCMC inference used by [17], again with the ad- dition of shape priors and our modeling innovations. [sent-79, score-0.611]

36 Our cascade approach is most similar to that developed independently by [18], but we incorporate high-level information rather than taking a purely bottom-up approach. [sent-80, score-0.331]

37 Finally, we note that our shape priors are inspired by mask transfer approaches to segmentation, i. [sent-81, score-0.384]

38 Learning to segment with SCALPEL Given a prior belief about an object in an image, our goal is to find a selection of superpixels that both match the prior and have excellent support from image cues. [sent-88, score-0.582]

39 One price we pay by incorporating region-based features into the segmentation process is that pixel-wise seg- mentation becomes prohibitively expensive. [sent-90, score-0.272]

40 Therefore, we opt to perform segmentation at the superpixel level, using the output of gPb-owt-ucm [2] with 200 superpixels. [sent-91, score-0.384]

41 Segment selection algorithm We first describe a greedy segment selection algorithm without a cascade; we will then extend the algorithm to the cascaded setting. [sent-94, score-0.541]

42 Intuitively, our algorithm begins with a single superpixel and then repeatedly adds neighboring superpixels to the set until a stopping criterion is reached. [sent-95, score-0.494]

43 xW)e = represent a f0i0l}led b-ein t object omf sauskp as a seulsbs feotr ro afn superpixels that we turn “on. [sent-100, score-0.281]

44 Because the greedy inference algorithm =se l−ec1ts o superpixels sequentially, we ya ilsnodefine a selection order z to be an ordered subset of S(x) indicating the order in which superpixels were selected by the greedy algorithm. [sent-102, score-0.97]

45 We next define our features in terms of the decisions made by the greedy inference scheme. [sent-103, score-0.334]

46 Given a selection order z and a candidate superpixel s, the algorithm computes features Δf(x, z, s) that measure the change in region properties when s is selected as the next element. [sent-104, score-0.479]

47 The feature vector Δf(x, y, cascade weights candidate scores s) consists of 8 region features and 3 unary features. [sent-108, score-0.501]

48 The features are scored according row of the cascaded weight vector to produce the final scores over candidate superpixels on the right. [sent-113, score-0.506]

49 to a single selected Given a weight vector w, feature generating function Δf, and initialization s1, the inference procedure greedily optimizes the following linear scoring function: z? [sent-114, score-0.271]

50 Δf(x,z(t),s), (2) where N(z) are the neighboring superpixels to those already in the selection order z. [sent-121, score-0.326]

51 In other words, is the neighboring superpixel with largest score according to the current selection order . [sent-122, score-0.321]

52 We then define the greedy update to the selection order at step t to update only if the estimated change in score is positive: s(t) z(t) z(t) z(t+1)←? [sent-123, score-0.279]

53 x,z(t),s(t) < 0 (3) We now extend the greedy algorithm to a cascaded setting in a straightforward fashion. [sent-126, score-0.291]

54 Inference then proceeds in a stage-wise fashion, where stage k uses wk to either stop inference or select one or more additional superpixels before passing to the next stage (Figure 1). [sent-131, score-0.545]

55 , K} is a cascade lookup function that defines )th? [sent-136, score-0.321]

56 f f(exrj t , oz t)h ·e Δ seft( oxfj w , z, gs)h tso j boien tthlye, SVM-style margin of selecting superpixel s given selection z on input xj . [sent-156, score-0.291]

57 Our training procedure is simple, and learns the cascade in a bottom-up fashion. [sent-157, score-0.288]

58 While many of the features we use are generic, there are intrinsic variations across objects that will affect whether or not a given feature is useful to greedily segment that object. [sent-190, score-0.28]

59 Although the cascaded model introduced in the previous section can learn to discount features early on in the segmentation process and change weights as inference proceeds, it cannot handle a priori variations between objects due to either object size or object category. [sent-191, score-0.736]

60 At test time, we use the area of the target bounding box to determine the scale bin and the output of the shape classifier to determine the appropriate class bin, and run the selected model accordingly (Figure 1). [sent-193, score-0.398]

61 Note that even if the class and scale predictions are incorrect at test time, selecting a different model for each prior is a useful way to generate a diverse set of proposals from a pool of priors. [sent-194, score-0.529]

62 The segmentation model uses 8 features computed on groups of segments and 6 unary features for a total of 14 features. [sent-196, score-0.347]

63 The unary features are computed once for each superpixel s, while the region features are computed during inference, relative to a fixed already-filled region z and a candidate superpixel s. [sent-198, score-0.71]

64 The final two region features are the average similarity between superpixels within z and the candidate s in terms of color and texture. [sent-207, score-0.389]

65 While(a)is a typical informative shape prior from the sheep class, the cluster of aeroplane objects in (b) are mis-aligned, do not provide an informative prior, and are discarded at run-time. [sent-212, score-0.289]

66 First, we sample bounding boxes from three different publicly available bounding box generation methods: purely bottom-up boxes from the segmentation hierarchy of gPb-owt-ucm [2], category- independent boxes from [16], and purely top-down classspecific boxes from [9]. [sent-217, score-0.931]

67 Our next step is to predict a soft object mask (Figure 4) for each box, which we in turn use to compute a localized shape prior for the image. [sent-220, score-0.657]

68 The final input to segmentation algorithm is therefore a pool of shape priors annotated by (scale, class) pairs, where the area of the bounding box is used for the scale of the object, and the class is taken from the predicted object mask. [sent-221, score-0.784]

69 In order to predict soft masks for each bounding box, we first need a dictionary to define the space ofpossible soft masks (Figure 4). [sent-223, score-0.426]

70 The use of thumbnails ensures that minor variations in shape will not significantly change distance in mask pixel space. [sent-226, score-0.312]

71 Finally, we pool all of the mask clusters for a given aspect ratio into a single set, yielding roughly 350 shapes per aspect ratio and a total of 1428 shapes. [sent-228, score-0.457]

72 Given a bounding box, we need to choose the soft mask that best matches the object inside the box. [sent-236, score-0.449]

73 In the interests of generalization and efficiency, we opt not to use nearest-neighbor methods, and instead learn a linear SVM classifier using Histogram of Gradients (HoG) features to differentiate between exemplars in each soft mask cluster. [sent-237, score-0.441]

74 After the first round of training, we harvest false-positives from the bounding box pools on the training set and introduce them as examples of an additional negative class for a second round of training (Figure 5). [sent-239, score-0.309]

75 The predicted soft object mask is often only roughly aligned with the object (e. [sent-241, score-0.442]

76 To fix this, we integrate the soft mask over underlying superpixels and normalize by the area of each superpixel. [sent-244, score-0.522]

77 This largely elim- inates bleeding into the background when the background consists of large superpixels and the mask at least partially covers the object. [sent-245, score-0.42]

78 We also discard soft object masks that suffer from misalignment in the corresponding cluster by throwing out any predicted mask classifications where the average soft mask accounts for less than 40% of the hypothesized bounding box (Figure 5). [sent-246, score-0.918]

79 The schedule should ideally be aware of not simply how many superpixels have been selected, but how much of the object has been segmented, so that different weights can be used for early vs. [sent-252, score-0.396]

80 superpixel will use up significant percentages of the prior for smaller objects. [sent-302, score-0.269]

81 To seed our segmentations, we use the single superpixel with highest localized prior score. [sent-306, score-0.453]

82 Our system runs at speeds comparable to other state-of-the-art systems; after roughly 4-5 min of preprocessing to compute gPb-owt-ucm and features per image, prior prediction takes roughly 30s and segmentation takes 2-4 min per image, running unoptimized MATLAB code on a 2. [sent-310, score-0.377]

83 Diversity of weight vectors for large object cascade for different classes. [sent-318, score-0.357]

84 Exterior edges are far more important for bicycles than for buses and people; the people cascade learns to ignore exterior edges until the object is mostly segmented. [sent-323, score-0.588]

85 As we are interested in the precision and recall of the segment pools, we compute average best overlap across all objects in the test set, as well as recall percentage at the standard 0. [sent-325, score-0.27]

86 Because the covering penalizes incorrect segmentation of large objects greater than small objects, we also investigate the average overlap as a function of object size. [sent-329, score-0.392]

87 We calibrated each method to output roughly 650 proposals per image, which is the number of proposals produced with the CPMC default parameters (we did not run the CPMC re-ranking step. [sent-333, score-0.537]

88 We first evaluated the quality of the bounding boxes and localized priors themselves. [sent-336, score-0.429]

89 We first compared our mixed bounding box sampling approach against sampling 900 boxes of each method individually (Figure 6), and found that our method greatly increases recall compared to any individual method. [sent-337, score-0.33]

90 Next, to generate a proposed segmentation for each prior, we greedily selected superpixels to obtain a segmentation with highest overlap with the soft mask. [sent-338, score-0.784]

91 First, we investigated the contribution of the various techniques we applied to make the greedy inference procedure robust to variations within and across objects (Table 2). [sent-351, score-0.348]

92 We find that both cascades and class- and scale-specific models are important, effective means of improving the performance of the greedy inference scheme. [sent-352, score-0.368]

93 While CPMC sacrifices recall for covering and Object Proposals sacrifices covering for recall, SCALPEL outperforms both on IoU, recall, and covering simultaneously. [sent-356, score-0.411]

94 3% covering; by proposing two shape priors per bounding box, we can increase our proposals to 1456 and achieve 84. [sent-358, score-0.546]

95 SCALPEL performs slightly worse than CPMC for the largest objects, most likely due to the greedy inference being unable to handle occlusions that separate objects into multiple disconnected regions. [sent-365, score-0.306]

96 We show the weight values for two different features of the cascade in Figure 8. [sent-367, score-0.352]

97 As desired, the cascade learns to weight features differently at different stages of inference; for ex222000334199 thresholding the prior would result in failures on the more difficult examples. [sent-368, score-0.455]

98 ample, the (Large,Person) cascade learns to down-weight exterior edges until nearing completion ofthe inference process. [sent-369, score-0.499]

99 Conclusion We have presented SCALPEL, a novel method for stateof-the-art segment proposal generation with efficient training of class- and scale-specific segmentation cascades. [sent-371, score-0.299]

100 Furthermore, our approach can be extended to incorporate arbitrary new features or bounding box proposals, and additional specifities besides class and scale (such as shape or color) could be explored as well. [sent-373, score-0.472]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('scalpel', 0.478), ('cascade', 0.258), ('proposals', 0.243), ('superpixels', 0.231), ('superpixel', 0.196), ('mask', 0.189), ('cpmc', 0.176), ('segmentation', 0.157), ('greedy', 0.154), ('cascaded', 0.137), ('localized', 0.132), ('priors', 0.114), ('cascades', 0.109), ('bounding', 0.108), ('exterior', 0.106), ('inference', 0.105), ('soft', 0.102), ('selection', 0.095), ('box', 0.091), ('covering', 0.087), ('greedily', 0.086), ('shape', 0.081), ('boxes', 0.075), ('prior', 0.073), ('zj', 0.073), ('region', 0.069), ('class', 0.068), ('stopping', 0.067), ('pool', 0.065), ('lookup', 0.063), ('bicycles', 0.063), ('segment', 0.06), ('masks', 0.057), ('feautre', 0.057), ('selections', 0.056), ('recall', 0.056), ('wk', 0.056), ('segments', 0.054), ('proceeds', 0.053), ('seed', 0.052), ('competitors', 0.052), ('overlap', 0.051), ('roughly', 0.051), ('objecst', 0.05), ('buses', 0.05), ('object', 0.05), ('scale', 0.05), ('weight', 0.049), ('objects', 0.047), ('pascal', 0.047), ('proposal', 0.047), ('sacrifices', 0.047), ('unary', 0.046), ('aspect', 0.046), ('features', 0.045), ('sharing', 0.045), ('purely', 0.044), ('exemplars', 0.044), ('candidate', 0.044), ('classspecific', 0.044), ('variations', 0.042), ('reserved', 0.042), ('schedule', 0.042), ('pools', 0.042), ('iou', 0.04), ('mentation', 0.04), ('weights', 0.039), ('orderings', 0.039), ('taskar', 0.039), ('segmentations', 0.038), ('arbelaez', 0.038), ('strength', 0.037), ('stage', 0.035), ('generation', 0.035), ('cut', 0.034), ('early', 0.034), ('xw', 0.033), ('carreira', 0.033), ('concise', 0.032), ('boundaries', 0.032), ('ignore', 0.031), ('opt', 0.031), ('paired', 0.031), ('substitute', 0.031), ('scoring', 0.031), ('learn', 0.03), ('predictions', 0.03), ('cluster', 0.03), ('learns', 0.03), ('score', 0.03), ('ratio', 0.03), ('next', 0.03), ('price', 0.03), ('fragments', 0.03), ('liblinear', 0.03), ('informative', 0.029), ('texture', 0.029), ('incorporate', 0.029), ('fill', 0.029), ('tiny', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999869 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning

Author: David Weiss, Ben Taskar

2 0.23177543 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

Author: Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, Raquel Urtasun

Abstract: In this paper we are interested in how semantic segmentation can help object detection. Towards this goal, we propose a novel deformable part-based model which exploits region-based segmentation algorithms that compute candidate object regions by bottom-up clustering followed by ranking of those regions. Our approach allows every detection hypothesis to select a segment (including void), and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model “blends ” between the detector and segmentation models. Since our features can be computed very efficiently given the segments, we maintain the same complexity as the original DPM [14]. We demonstrate the effectiveness of our approach in PASCAL VOC 2010, and show that when employing only a root filter our approach outperforms Dalal & Triggs detector [12] on all classes, achieving 13% higher average AP. When employing the parts, we outperform the original DPM [14] in 19 out of 20 classes, achieving an improvement of 8% AP. Furthermore, we outperform the previous state-of-the-art on VOC’10 test by 4%.

3 0.2137626 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels

Author: Guang Shu, Afshin Dehghan, Mubarak Shah

Abstract: We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specificfeatures like consistent colorpatterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop and algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.

4 0.21163693 212 cvpr-2013-Image Segmentation by Cascaded Region Agglomeration

Author: Zhile Ren, Gregory Shakhnarovich

Abstract: We propose a hierarchical segmentation algorithm that starts with a very fine oversegmentation and gradually merges regions using a cascade of boundary classifiers. This approach allows the weights of region and boundary features to adapt to the segmentation scale at which they are applied. The stages of the cascade are trained sequentially, with asymetric loss to maximize boundary recall. On six segmentation data sets, our algorithm achieves best performance under most region-quality measures, and does it with fewer segments than the prior work. Our algorithm is also highly competitive in a dense oversegmentation (superpixel) regime under boundary-based measures.

5 0.20541592 455 cvpr-2013-Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions

Author: Dong Zhang, Omar Javed, Mubarak Shah

Abstract: In this paper, we propose a novel approach to extract primary object segments in videos in the ‘object proposal’ domain. The extracted primary object regions are then used to build object models for optimized video segmentation. The proposed approach has several contributions: First, a novel layered Directed Acyclic Graph (DAG) based framework is presented for detection and segmentation of the primary object in video. We exploit the fact that, in general, objects are spatially cohesive and characterized by locally smooth motion trajectories, to extract the primary object from the set of all available proposals based on motion, appearance and predicted-shape similarity across frames. Second, the DAG is initialized with an enhanced object proposal set where motion based proposal predictions (from adjacent frames) are used to expand the set of object proposals for a particular frame. Last, the paper presents a motion scoring function for selection of object proposals that emphasizes high optical flow gradients at proposal boundaries to discriminate between moving objects and the background. The proposed approach is evaluated using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods. 1. Introduction & Related Work In this paper, our goal is to detect the primary object in videos and to delineate it from the background in allframes. Video object segmentation is a well-researched problem in the computer vision community and is a prerequisite for a variety of high-level vision applications, including content based video retrieval, video summarization, activity understanding and targeted content replacement. Both fully automatic methods and methods requiring manual initialization have been proposed for video object segmentation. In the latter class of approaches, [2, 15, 23] need annotations of object segments in key frames for initialization. Frame #38 #39 #61 #62 V ideo Frames Key-?fram e Object Regions [13] ? PrimaryObjectRegionsExtractedbyProposedMethod Figure 1. Primary object region selection in the object proposal domain. The first row shows frames from a video. The second row shows key object proposals (in red boundaries) extracted by [13]. “?” indicates that no proposal related to the primary object was found by the method. The third row shows primary object proposals selected by the proposed method. Note that the proposed method was able to find primary object proposals in all frames. The results in row 2 and 3 are prior to per-pixel segmentation. In this paper we demonstrate that temporally dense extraction of primary object proposals results in significant improvement in object segmentation performance. Please see Table 1for quantitative results and comparisons to state of the art.[Please Print in Color] Optimization techniques employing motion and appearance constraints are then used to propagate the segments to all frames. Other methods ([16, 20]) only require accurate object region annotation for the first frame, then employ region tracking to segment the rest of frames into object and background regions. Note that, the aforementioned semi-automatic techniques generally give good segmenta666222668 Figure 2. Object proposals from a video frame employing the method in [7]. The left side image is one of the video frames. Note that the monkey is the object of interest in the frame. Images on the right show some of the top ranked object proposals from the frame. Most of the proposals do not correspond to an actual object. The goal of the proposed work is to generate an enhanced set of object proposals and extract the segments related to the primary object from the video. tion results. However, most computer vision applications involve processing of large amounts of video data, which makes manual initialization cost prohibitive. Consequently, a large number of automatic methods have also been proposed for video object segmentation. A subset of these methods employs motion grouping ([19, 18, 4]) for object segmentation. Other methods ([10, 3, 21]) use appearance cues to segment each frame first and then use both appearance and motion constraints for a bottom-up final segmentation. Methods like [9, 3, 11, 22] present efficient optimization frameworks for spatiotemporal grouping of pixels for video segmentation. However, all of these automatic methods do not have an explicit model of how an object looks or moves, and therefore, the segments usually don’t correspond to a particular object but only to image regions that exhibit coherent appearance or motion. Recently, several methods ([7, 5, 1]) were proposed that provided an explicit notion of how a generic object looks like. Specifically, the method [7] could extract object-like regions or ‘object proposals’ from images. This work was built upon by Lee et al. [13] and Ma and Latecki [14] to employ object proposals for object video segmentation. Lee et al. [13] proposed to detect the primary object by collecting a pool of object proposals from the video, and then applying spectral graph clustering to obtain multiple binary inlier/outlier partitions. Each inlier cluster corresponds to a particular object’s regions. Both motion and appearance based cues are used to measure the ‘objectness’ of a proposal in the cluster. The cluster with the largest average ‘objectness’ is likely to contain the primary object in video. One shortcoming of this approach is that the clustering process ignores the order of the proposals in the video, and there- fore, cannot model the evolution of object’s shape and location with time. The work by Ma and Latecki [14] attempts Input Videos Figure 3. The Video Object Segmentation Framework to mitigate this issue by utilizing relationships between object proposals in adjacent frames. The object region selection problem is modeled as a constrained Maximum Weight Cliques problem in order to find the true object region from all the video frames simultaneously. However, this problem is NP-hard ([14]) and an approximate optimization technique is used to obtain the solution. The object proposal based segmentation approaches [13, 14] have two additional limitations compared to the proposed method. First, in both approaches, object proposal generation for a particular frame doesn’t directly depend on object proposals generated for adjacent frames. Second, both approaches do not actually predict the shape of the object in adjacent frames when computing region similarity, which degrades segmentation performance for fast moving objects. In this paper, we present an approach that though inspired from aforementioned approaches, attempts to remove their shortcomings. Note that, in general, an object’s shape and appearance varies slowly from frame to frame. Therefore, the intuition is that the object proposal sequence in a video with high ‘objectness’, and high similarity across frames is likely to be the primary object. To this end, we use optical flow to track the evolution of object shape, and compute the difference between predicted and actual shape (along with appearance) to measure similarity of object proposals across frames. The ‘objectness’ is measured using appearance and a motion based criterion that emphasizes high optical flow gradients at the boundaries between objects proposals and the background. Moreover, the primary object proposal selection problem is formulated as the longest path problem for Directed Acyclic Graph (DAG), for which (unlike [14]) an optimal solution exists in linear time. Note that, if the temporal order of object proposals locations (across frames) is not used ([13], then it can result in no proposals being associated with the prima666222779 ry object for many frames (please see Figure 1). The proposed method not only uses object proposals from a particular frame (please see Figure 2), but also expands the proposal set using predictions from proposals of neighboring frame. The combination of proposal expansion, and the predicted shape based similarity criteria results in temporally dense and spatially accurate primary object proposal extraction. We have evaluated the proposed approach using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods In Section 2, the proposed layered DAG based object selection approach is introduced and discussed in detail; In Section 3, both qualitative and quantitative experiments results for two publicly available datasets and some other challenging videos are shown; The paper is concluded in Section 4. 2. Layered DAG based Video Object Segmentation 2.1. The Framework The proposed framework consists of 3 stages (as shown in Figure 3): 1. Generation of object proposals per-frame and then expansion of the proposal set for each frame based on object proposals in adjacent frames. 2. Generation of a layered DAG from all the object proposals in the video. The longest path in the graph fulfills the goal of maximizing ob- jectness and similarity scores, and represents the most likely set of proposals denoting the primary object in the video. 3. The primary object proposals are used to build object and background models using Gaussian mixtures, and a graph-cuts based optimization method is used to obtain refined per-pixel segmentation. Since the proposed approach is centered around layered DAG framework for selection of primary object regions, we will start with its description. 2.2. Layered DAG Structure We want to extract object proposals with high objectness likelihood, high appearance similarity and smoothly varying shape from the set of all proposals obtained from the video. Also since we want to extract the primary object only, we want to extract at most a single proposal per frame. Keeping these objectives in mind, the layered DAG is formed as follows. Each object proposal is represented by two nodes: a ‘beginning node’ and an ‘ending node’ and there are two types of edges: unary edges and binary edges. The unary edges have weights which measure the objectness of a proposal. The details of the function for unary weight assignments (measuring objectness) are given in section 2.2. 1. All the beginning nodes in the same frame form a layer, so as the ending nodes. A directed unary edge is built from beginning node to ending node. Thus, each video frame is represented by two layers in the graph. DiFrame i-1 Frame i Frame i+1 s… … La2i-ye3rL2ai-y2erL2ayi-1erLa2yierL2ai+y1erL2ai+y2er… … t Figure 4. Layered Directed Acyclic Graph (DAG) Structure. Node “s” and “t” are source and sink nodes respectively, which have zero weights for edges with other nodes in the graph. The yellow nodes and the green nodes are “beginning nodes” and “ending nodes” respectively and they are paired such that each yellow-green pair represents an object proposal. All the beginning nodes in the same frame are arranged in a layer and the same as the ending nodes. The green edges are the unary edges and red edges are the binary edges. rected binary edges are built from any ending node to all the beginning nodes in latter layers. The binary edges have weights which measure the appearance and shape similarity between the corresponding object proposals across frames. The binary weight assignment functions are introduced in Section 2.2.2. Figure 4 is an illustration of the graph structure. It shows frame i− 1, iand i 1 of the graph, with corresponding layers oif − −2i 1 1−,3 i, a2nid d− i2, + +2 i1 1− o1f, h2ie, 2gira+p 1h ,a wndi t2hi +co2rr. eNspooten tdhinagt, only 3s object proposals are s1h, o2wi,n 2 ifo+r 1e aacnhd layer f.or N simplic- + ity, however, there are usually hundreds of object proposals for each frame and the number of object proposals for different frames are not necessary the same. The yellow nodes are “beginning nodes”, the green nodes are “ending nodes”, the green edges are unary edges with weights indicating objectness and the red edges are binary edges with weights indicating appearance and shape similarity (note that the graph only shows some of the binary edges for simplicity). There is also a virtual source node s and a sink node t with 0 weighted edges (black edges) to the graph. Note that, it is not necessary to build binary edges from an ending node to all the beginning nodes in latter layers. In practice, only building binary edges to the next three subsequent frames is enough for most of the videos. 2.2.1 Unary Edges Unary edges measure the objectness of the proposals. Both appearance and motion are important to infer the objectness, so the scoring function for object proposals is defined as Sunary (r) = A(r) + M(r), in which r is any object proposal, A(r) is the appearance score and M(r) is the motion score. We define M(r) as the average Frobenius norm of optical flow gradient around the boundary of object pro666232880 Figure 5. Optical Flow Gradient Magnitude Motion Scoring. In row 1, column 1 shows the original video frame, column 2 is one of the object proposals and column 3 shows dilated boundary of the object proposal. In row 2, column 1 shows the forward optical flow of the frame, column 2 shows the optical flow gradient magnitude map and column 3 shows the optical flow gradient magnitude response for the specific object proposal around the boundary. [Please Print in Color] posal r. The Frobenius norm of optical flow gradients is defined as: ??UX??F=?????uvxx uvy ?????F=?ux2+ u2y+ vx2+ vy2, in ?whic?h U =? (1) (u, v) is th??e forward optical flow of the frame, ux , vx and uy, vy are optical flow gradients in x and y directions respectively. The intuition behind this motion scoring function is that, the motions of foreground object and background are usually distinct, so boundary of moving objects usually implies discontinuity in motion. Therefore, ideally, the gradient of optical flow should have high magnitude around foreground object boundary (this phenomenon could be easily observed from Figure 5). In equation 1, we use the Frobenius norm to measure the optical flow gradient magnitude, the higher the value, the more likely the region is from a moving object. In practice, usually the maximum of optical flow gradient magnitude does not coincide exactly with the moving object boundary due to underlying approximation of optical flow calculation. Therefore, we dilate the object proposal boundary and get the average optical flow gradient magnitude as the motion score. Figure 5 is an illustration of this process. The appearance scoring function A(r) is measured by the objectness ([7]). 2.2.2 Binary Edges Binary edges measure the similarity between object proposals across frames. For measuring the similarity of regions, color, location, size and shape are the properties to be considered. We define the similarity between regions as the weight of binary edges as follows: Sbinary(rm, rn) = λ · Scolor(rm, rn) · Soverlap(rm, rn), (2) in which rm and rn are regions from frame m and n, λ is a constant value for adjusting the ratio between unary and binary edges, Soverlap is the overlap similarity between regions and Scolor is the color histogram similarity: Scolor(rm, rn) = hist(rm) · hist(rn)T, (3) in which hist(r) is the normalized color histogram for a region r. Soverlap(rm,rn) =||rrmm∩∪ wwaarrppmmnn((rrnn))||, (4) in which warpmn (rn) is the warped region from rn by optical flow to frame m. It is clear that Scolor encodes the color similarity between regions and Soverlap encodes the size and location similarity between regions. If two regions are close, and the sizes and shapes are similar, the value would be higher, and vice versa. Note that, unlike prior approaches [13, 14], we use optical flow to predict the region (i.e. encoding location and shape), and therefore we are better able to compute similarity for fast moving objects. 2.2.3 Dynamic Programming Solution Until now, we have built the layered DAG and the objective is clear: to find the highest weighted path in the DAG. Assume the graph contains 2F + 2 layers (F is the frame number), the source node is in layer 0 and the sink node is in layer 2F + 2. Let Nij denotes the jth node in ith layer and E(Nij , Nkl) denotes the edge from Nij to Nkl. Layer i has Mi nodes. Let P = (p1, p2 , ..., pm+1) = (N01, Nj1j2, ..., Njm−1jm, N(2n+2)1) be a path from source to sink node. Therefore, ?m Pmax= arg mPax?i=1E(pi,pi+1). (5) Pmax forms a Longest (simple) Path Problem for DAG. Let OPT(i, j) be the maximum path value for Nij from source node. The maximum path value satisfies the following recurrence for i≥ 1and j ≥ 1: OPT(i,j) = k=0...i−m1a,lx=1...Mk[OPT(k,l) + E(Nkl,Nij)]. (6) This problem could be solved by dynamic programming in linear time [12]. The computational complexity for the algorithm is O(n + m), in which n is the number of nodes 666322 919 and m is the number of edges. The most important parameter for the layered DAG is the ratio λ between unary edges and binary edges. However, in practice, the results are not sensitive to it, and in the experiments λ is simply set to be 1. 2.3. Per-pixel Video Object Segmentation Once the primary object proposals are obtained in a video, the results are further refined by a graph-based method to get per-pixel segmentation results. We define a spatiotemporal graph by connecting frames temporally with optical flow displacement. Each of the nodes in the graph is a pixel in a frame, and edges are set to be the 8-neighbors within one frame and the forward-backward 18 neighbors in adjacent frames. We define the energy function for labeling f = [f1, f2, ..., fn] of n pixels with prior knowledge of h: E(f,h) = ?Dhi(fi) + λ ?i∈S ? Vi,j(fi,fj), (7) (i,?j)∈N where S = {pi, ..., pn} is the set of n pixels in the video, N cwohnesriest Ss o =f neighboring pixels, ta ondf i,j ixnedlesx in nt thhee pixels. pi could be set to 0 or 1which represents background or foreground respectively. The unary term Dih defines the cost of labeling pixel iwith label fi which we get from the Gaussian Mixture Models (GMM) for both color and location. Dih(fi) = −log(αUic(fi, h) + (1 − α)Uil(fi, h)), (8) where Uic(.) is the color-induced cost and Uil (.) is the location cost. For the binary term Vi,j (fi, fj), we follow the definitions in [17]: Vi,j(fi, fj) = [fi = fj]exp−β(Ci−Cj)2, (9) where [.] denotes the indicator function taking values 0 and 1, (Ci − Cj)2 is the Euclidean distance betwe?en two adjacent nodes in RGB space, and β = (2? (Ci − Cj)2)−1|(i,j)∈N ?We use −th Ce graph-cuts based minimization method in [8] to o?btain the optimal solution for equation 7, and thus get the final segmentation results. Next, we describe the method for object proposal generation that is used to initialize the video object segmentation process. 2.4. Object Proposal Generation & Expansion In order to achieve our goal of identifying image regions belonging to the primary object in the video, it is preferable (though not necessary) to have an object proposal corresponding to the actual object for each frame in which object is present. Using only appearance or optical flow based Figure 6. Object Proposal Expansion. For each optical flow warped object proposal in frame i− 1, we look for object proposals din o fbjreamcte p ir owpohsicahl ihnav fer high overlap erat liooosk kw fiotrh tohbej warped one. If some object proposals all have high overlap ratios with the warped one, they are merged into a new large object proposal. This process will produce the right object proposal if it is not discovered by [7] from frame i, but frame i− 1. cues to generate object proposals is usually not enough for this purpose. This phenomenon could be observed in the example shown in Figure 6. For frame iin this figure, hundreds of object proposals were generated using method in [7], however, no proposal is consistent with the true object, and the object is fragmented between different proposals. We assume that an object’s shape and location changes smoothly across frames and propose to enhance the set of object proposals for a frame by using the proposals generated for its adjacent frames. The object proposal expansion method works by the guidance of optical flow (see Figure 6). For the forward version of object proposal expansion, each object proposal rk in frame i− 1 is warped by the forward optical flow toi −fra1mine fir,a tmheen i a −ch 1ec isk wisa rmpaedde bify any proposal in frame i has a large overlap ratio with the rij 666333002 warped object proposal, i.e., o =|warpi−1,|ir(jir|ik−1) ∩ rij|. (10) The contiguous overlapped areas, for regions in i+1 with o greater than 0.5, are merged into a single region, and are used as additional proposals. Note that, the old original proposals are also kept, so this is an ‘expansion’ of the proposal set, and not a replacement. In practice, this process is carried out both forward and backward in time. Since it is an iterative process, even if suitable object proposals are missing in consecutive frames, they could potentially be produced by this expansion process. Figure 6 shows an example image sequence where the expansion process resulted in generation of a suitable proposal. 3. Experiments The proposed method was evaluated using two wellknown segmentation datasets: SegTrack dataset [20] and GaTech video segmentation dataset [9]. Quantitative comparisons are shown for SegTrack dataset since ground-truth is available for this dataset. Qualitative results are shown for GaTech video segmentation dataset. We also evaluated the proposed approach on additional challenging videos, for which we will share the ground-truth to aid future evaluations. 3.1. SegTrack Dataset We first evaluate our method on Segtrack dataset [20]. There are 6 videos in this dataset, and also a pixel-level segmentation ground-truth for each video is available. We follow the setup in the literature ([13, 14]), and use 5 (birdfall, cheetah, girl, monkeydog and parachute) of the videos for evaluation (since the ground-truth for the other one (penguin) is not useable). We use an optical flow magnitude based model selection method to infer the camera motion: for static cameras, a background subtraction cue is also used for moving object extraction; for all the results shown in this section, the static camera model was only selected (automatically) for the “birdfall” video. We compare our method with 4 state-of-the-art methods [14], [13], [20] and [6] shown in Table 1. Note that our method is a unsupervised method, and it outperforms all the other unsupervised methods except for the parachute video where it is a close second. Note that [20] and [6] are supervised methods which need an initial annotation for the first frame. The results in Table 1are the average per-frame pixel error rate compared to the ground-truth. The definition is [20]: error = XORF(f,GT), (11) where f is the segmentation labeling results of the method, GT is the ground-truth labeling of the video, and F is the (a) Birdfall (b) Cheetah (c) Girl (d) Monkeydog (e) Parachute Figure 7. SegTrack dataset results. The regions within the red boundaries are the segmented primary objects. [Please Print in Color] VideoOurs[14][13][20][6] birdfall155189288252454 cheetah 633 806 905 1142 1217 girl 1488 1698 1785 1304 1755 monkeydog 365 472 521 563 683 parachute 220 221 201 235 502 Avg. 452 542 592 594 791 supervised? N N N Y Y Table 1. Quantitative results and comparison with the state of the art on SegTrack dataset number of frames in the video. Figure 7 shows qualitative results for the videos of SegTrack dataset. Figure 8 is an example that shows the effectiveness of the proposed layered DAG approach for temporally dense extraction of primary object regions. The figure shows consecutive frames (frame 38 to frame 43) from “monkeydog” video. The top 2 rows show the results of key-frame objec- t extraction method [13], and the bottom 2 rows show our object region selection results. As one can see, [13] detects the primary object proposal in only one of the frames, however, by using the proposed approach, we can extract the 666333113 #41 ?#42 ?#43 ?(a) Key-frame Obje?ct Re gion Sel cti?on #41 #42 #43 Frame #38 ?#39 ?#40 Frame #38 #39 #40 (b) Layered DAG Object Region Sel ction Figure 8. Comparison of object region selection methods. The regions within the red boundaries are the selected object regions. “?” means there is no object region selected by the method. Numbers above are the frame indices.[Please Print in Color] primary object region from all the frames. This is the main reason that the segmentation results of the proposed method are better than prior methods. 3.2. GaTech Segmentation Dataset We also evaluated the proposed method on GaTech video segmentation dataset. We show qualitative comparison of results between the proposed approach and the original bottom-up method for the dataset in Figure 9. As one can observe, our results could segment the true foreground object from the background. The method [9] doesn’t use an object model which induces over-segmentation (although the results are very good for the general segmentation problem). 3.3. Persons and Cars Segmentation Dataset We have built a new dataset for video object segmentation. The dataset is challenging: persons are in a variety of poses; cars have different speeds, and when they are slow, it is very hard to do motion segmentation. We generate ground truth for those videos. Figure 10 shows some sample results from this dataset, and Table 2 shows the quantitative (a) waterski (b) yunakim Figure 9. Object Segmentation Results on GaTech Video Segmentation Dataset. Row 1: orignial frame, Row 2: Segmentation results by the bottom-up segmentation method [9]. Row 3: Video object segmentation by the proposed method. The regions within the red or green boundaries are the segmented primary objects. [Please Print in Color] VideoAverage per-frame pixel error Surfing1209 Jumping Skiing Sliding Big car Small car 835 817 2228 1129 272 Table 2. Quantitative Results on Persons and Cars dataset results for this dataset (the average per-frame pixel error is defined as the same as SegTrack dataset [20]). Please go to http://crcv.ucf.edu for more details. 4. Conclusions We have proposed a novel and efficient layered DAG based approach to segment the primary object in videos. This approach also uses innovative mechanisms to compute the ‘objectness’ of a region and to compute similarity between object proposals across frames. The proposed approach outperforms the state of the art on the well-known SegTrack dataset. We also demonstrate good segmentation performance on additional challenging data sets. 666333224 (a) Surfing (b) Jumping (c) Skiing (d) Sliding (e) Big car (f) Small car Figure 10. Sample Results on Persons and Cars Dataset. Please go to http://crcv.ucf.edu for more details. Acknowledgment This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract numbers D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S.Government. References [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010. [2] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, 28(3):70, 2009. [3] W. Brendel and S. Todorovic. Video object segmentation by tracking regions. In ICCV, pages 833–840, 2009. [4] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, pages 282–295, 2010. [5] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, pages 3241–3248, 2010. [6] P. Chockalingam, N. Pradeep, and S. Birchfield. Adaptive fragments-based tracking ofnon-rigid objects using level sets. In ICCV, pages 1530–1537, 2009. [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588, 2010. [8] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In ICCV, pages 670–677, 2009. [9] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In CVPR, pages 2141–2148, 2010. [10] Y. Huang, Q. Liu, and D. Metaxas. Video object segmentation by hypergraph cut. In CVPR, pages 1738–1745, 2009. [11] J.Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In ECCV, 2004. [12] J. Kleinberg and E. Tardos. Algorithm design. Pearson Education and Addison Wesley, 2006. [13] Y. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In ICCV, pages 1995–2002, 2011. [14] T. Ma and L. Latecki. Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, pages 670–677, 2012. [15] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In ICCV, pages 779–786, 2009. [16] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In CVPR, pages 1–8, 2007. [17] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM Transactions on Graphics, volume 23, pages 309–3 14, 2004. [18] Y. Sheikh, O. Javed, and T. Kanade. Background subtraction for freely moving cameras. In ICCV, pages 1219–1225, 2009. [19] J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In ICCV, pages 1154–1 160, 1998. [20] D. Tsai, M. Flagg, and J. Rehg. Motion coherent tracking with multi-label mrf optimization. In BMVC, page 1, 2010. [21] A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, pages 268–281, 2010. [22] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In ECCV, pages 626–639. 2012. [23] J. Yuen, B. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In ICCV, pages 1451–1458, 2009. 666333335

6 0.19357614 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation

7 0.18876024 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

8 0.17704704 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

9 0.17413695 29 cvpr-2013-A Video Representation Using Temporal Superpixels

10 0.16602471 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

11 0.15625623 364 cvpr-2013-Robust Object Co-detection

12 0.14726496 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context

13 0.14657158 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

14 0.14176233 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds

15 0.14172377 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation

16 0.13243948 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning

17 0.13195392 366 cvpr-2013-Robust Region Grouping via Internal Patch Statistics

18 0.12905611 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

19 0.12678355 357 cvpr-2013-Revisiting Depth Layers from Occlusions

20 0.12557933 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.254), (1, -0.039), (2, 0.078), (3, -0.029), (4, 0.172), (5, 0.037), (6, 0.111), (7, 0.141), (8, -0.103), (9, -0.013), (10, 0.14), (11, -0.131), (12, 0.053), (13, 0.035), (14, 0.002), (15, 0.005), (16, 0.106), (17, -0.081), (18, -0.125), (19, 0.16), (20, 0.021), (21, 0.007), (22, 0.01), (23, -0.009), (24, 0.003), (25, 0.027), (26, -0.115), (27, -0.042), (28, 0.002), (29, -0.016), (30, 0.091), (31, -0.024), (32, -0.009), (33, -0.048), (34, 0.042), (35, 0.022), (36, 0.032), (37, -0.068), (38, 0.012), (39, 0.046), (40, 0.013), (41, 0.028), (42, 0.043), (43, 0.022), (44, -0.037), (45, 0.027), (46, 0.038), (47, -0.013), (48, 0.071), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93813449 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning

Author: David Weiss, Ben Taskar

2 0.88001943 212 cvpr-2013-Image Segmentation by Cascaded Region Agglomeration

Author: Zhile Ren, Gregory Shakhnarovich

3 0.81998199 86 cvpr-2013-Composite Statistical Inference for Semantic Segmentation

Author: Fuxin Li, Joao Carreira, Guy Lebanon, Cristian Sminchisescu

Abstract: In this paper we present an inference procedure for the semantic segmentation of images. Differentfrom many CRF approaches that rely on dependencies modeled with unary and pairwise pixel or superpixel potentials, our method is entirely based on estimates of the overlap between each of a set of mid-level object segmentation proposals and the objects present in the image. We define continuous latent variables on superpixels obtained by multiple intersections of segments, then output the optimal segments from the inferred superpixel statistics. The algorithm is capable of recombine and refine initial mid-level proposals, as well as handle multiple interacting objects, even from the same class, all in a consistent joint inference framework by maximizing the composite likelihood of the underlying statistical model using an EM algorithm. In the PASCAL VOC segmentation challenge, the proposed approach obtains high accuracy and successfully handles images of complex object interactions.

4 0.79630977 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen

Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.

5 0.74991435 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations

Author: Payman Yadollahpour, Dhruv Batra, Gregory Shakhnarovich

Abstract: This paper introduces a two-stage approach to semantic image segmentation. In the first stage a probabilistic model generates a set of diverse plausible segmentations. In the second stage, a discriminatively trained re-ranking model selects the best segmentation from this set. The re-ranking stage can use much more complex features than what could be tractably used in the probabilistic model, allowing a better exploration of the solution space than possible by simply producing the most probable solution from the probabilistic model. While our proposed approach already achieves state-of-the-art results (48.1%) on the challenging VOC 2012 dataset, our machine and human analyses suggest that even larger gains are possible with such an approach.

6 0.7494958 460 cvpr-2013-Weakly-Supervised Dual Clustering for Image Semantic Segmentation

7 0.74871284 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

8 0.74274939 29 cvpr-2013-A Video Representation Using Temporal Superpixels

9 0.73587292 217 cvpr-2013-Improving an Object Detector and Extracting Regions Using Superpixels

10 0.73225921 366 cvpr-2013-Robust Region Grouping via Internal Patch Statistics

11 0.71420729 458 cvpr-2013-Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds

12 0.69687009 281 cvpr-2013-Measures and Meta-Measures for the Supervised Evaluation of Image Segmentation

13 0.6959945 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

14 0.69143921 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

15 0.68700176 145 cvpr-2013-Efficient Object Detection and Segmentation for Fine-Grained Recognition

16 0.66131961 26 cvpr-2013-A Statistical Model for Recreational Trails in Aerial Images

17 0.63656658 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

18 0.63059491 280 cvpr-2013-Maximum Cohesive Grid of Superpixels for Fast Object Localization

19 0.62861127 173 cvpr-2013-Finding Things: Image Parsing with Regions and Per-Exemplar Detectors

20 0.61031598 437 cvpr-2013-Towards Fast and Accurate Segmentation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.119), (10, 0.147), (16, 0.018), (26, 0.068), (27, 0.013), (28, 0.011), (33, 0.294), (67, 0.077), (69, 0.076), (80, 0.017), (87, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94991159 161 cvpr-2013-Facial Feature Tracking Under Varying Facial Expressions and Face Poses Based on Restricted Boltzmann Machines

Author: Yue Wu, Zuoguan Wang, Qiang Ji

Abstract: Facial feature tracking is an active area in computer vision due to its relevance to many applications. It is a nontrivial task, sincefaces may have varyingfacial expressions, poses or occlusions. In this paper, we address this problem by proposing a face shape prior model that is constructed based on the Restricted Boltzmann Machines (RBM) and their variants. Specifically, we first construct a model based on Deep Belief Networks to capture the face shape variations due to varying facial expressions for near-frontal view. To handle pose variations, the frontal face shape prior model is incorporated into a 3-way RBM model that could capture the relationship between frontal face shapes and non-frontal face shapes. Finally, we introduce methods to systematically combine the face shape prior models with image measurements of facial feature points. Experiments on benchmark databases show that with the proposed method, facial feature points can be tracked robustly and accurately even if faces have significant facial expressions and poses.

2 0.94051558 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

3 0.93934995 330 cvpr-2013-Photometric Ambient Occlusion

Author: Daniel Hauagge, Scott Wehrwein, Kavita Bala, Noah Snavely

Abstract: We present a method for computing ambient occlusion (AO) for a stack of images of a scene from a fixed viewpoint. Ambient occlusion, a concept common in computer graphics, characterizes the local visibility at a point: it approximates how much light can reach that point from different directions without getting blocked by other geometry. While AO has received surprisingly little attention in vision, we show that it can be approximated using simple, per-pixel statistics over image stacks, based on a simplified image formation model. We use our derived AO measure to compute reflectance and illumination for objects without relying on additional smoothness priors, and demonstrate state-of-the art performance on the MIT Intrinsic Images benchmark. We also demonstrate our method on several synthetic and real scenes, including 3D printed objects with known ground truth geometry.

4 0.93321085 414 cvpr-2013-Structure Preserving Object Tracking

Author: Lu Zhang, Laurens van_der_Maaten

Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.

5 0.93122858 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

Author: Horst Possegger, Sabine Sternig, Thomas Mauthner, Peter M. Roth, Horst Bischof

Abstract: Combining foreground images from multiple views by projecting them onto a common ground-plane has been recently applied within many multi-object tracking approaches. These planar projections introduce severe artifacts and constrain most approaches to objects moving on a common 2D ground-plane. To overcome these limitations, we introduce the concept of an occupancy volume exploiting the full geometry and the objects ’ center of mass and develop an efficient algorithm for 3D object tracking. Individual objects are tracked using the local mass density scores within a particle filter based approach, constrained by a Voronoi partitioning between nearby trackers. Our method benefits from the geometric knowledge given by the occupancy volume to robustly extract features and train classifiers on-demand, when volumetric information becomes unreliable. We evaluate our approach on several challenging real-world scenarios including the public APIDIS dataset. Experimental evaluations demonstrate significant improvements compared to state-of-theart methods, while achieving real-time performance. – –

6 0.93036795 325 cvpr-2013-Part Discovery from Partial Correspondence

7 0.93025333 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

8 0.92954999 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking

9 0.92921937 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

10 0.92835987 445 cvpr-2013-Understanding Bayesian Rooms Using Composite 3D Object Models

11 0.92794609 311 cvpr-2013-Occlusion Patterns for Object Class Detection

12 0.92774719 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

13 0.92761374 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

14 0.92690516 61 cvpr-2013-Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics

15 0.92690378 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

same-paper 16 0.92687553 370 cvpr-2013-SCALPEL: Segmentation Cascades with Localized Priors and Efficient Learning

17 0.92529315 331 cvpr-2013-Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis

18 0.92499489 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning

19 0.92495227 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects

20 0.92480701 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image