Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

1 edu Abstract Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. [sent-5, score-0.974]

2 In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. [sent-6, score-0.43]

3 Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. [sent-7, score-0.38]

4 One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! [sent-9, score-0.657]

5 Clearly, other image understanding tasks like object detection [10], scene recognition [38], contextual reasoning among objects [29], and pose estimation [39] can aid semantic segmentation. [sent-14, score-0.581]

6 Studies have shown that humans can effectively leverage contextual information from the entire scene to recognize objects in low resolution images that can not be recognized in isolation [35]. [sent-16, score-0.487]

7 Recent works [ 12, 40, 16, 23], have thus pushed on holistic scene understanding models for among other things, improved semantic segmentation. [sent-18, score-0.529]

8 A holistic scene understanding approach to semantic segmentation consists of a conditional random field (CRF) model that jointly reasons about: (a) classification of local patches (segmentation), (b) object detection, (c) shape analysis, (d) scene recognition and (e) contextual reasoning. [sent-36, score-0.914]

9 In this paper we analyze the relative importance of each of these components by building an array of hybrid human-machine CRFs where each component is performed by a machine (default), or replaced by human subjects or ground truth, or is removed all together (top). [sent-37, score-0.658]

10 333 111444113 We analyze the recent and most comprehensive holistic scene understanding model of Yao et al. [sent-45, score-0.4]

11 It is a conditional random field (CRF) that models the interplay between segmentation and a variety of components such as local super-pixel appearance, object detection, scene recognition, shape analysis, class co-occurrence, and compatibility of classes with scene categories. [sent-47, score-0.491]

12 To gain insights into the relative importance of these different factors or tasks, we isolate each one, and substitute a machine with a human for that task, keeping the rest of the model intact (Figure 1). [sent-48, score-0.398]

13 Hence, the use of human subjects in our studies is key, as it gives us a feasible point of what can be done. [sent-53, score-0.454]

14 However when plugged into the holistic model, human potentials provide a significant boost in performance. [sent-58, score-0.86]

15 Excited by this insight, we conducted a thorough analysis of the human generated super-pixel potentials to identify precisely how they differ from existing machine potentials. [sent-63, score-0.756]

16 Our analysis inspired a rather simple modification of the machine potentials which resulted in a significant increase of 2. [sent-64, score-0.61]

17 Related Work Holistic Scene Understanding: The key motivation behind holistic scene understanding, going back to the seminal 1Of course, ground truth segmentation annotations are themselves generated by humans, but by viewing the whole image and leveraging information from the entire scene. [sent-69, score-0.372]

18 In this paper, orthogonal to these advances, we propose the use of human subjects to understand the relative ×× importance of various recognition tasks in aiding semantic segmentation. [sent-85, score-0.639]

19 In contrast, we are interested in semantic segmentation which involves identifying the semantic category of each pixel in the image. [sent-90, score-0.451]

20 sk Ins that closely mimic existing holistic computational models for semantic segmentation in order to identify bottlenecks, and better guide future research efforts. [sent-101, score-0.413]

21 In contrast, in this work, we are inter- ested in systematically analyzing the roles played by several high- and mid-level tasks such as grouping, shape analysis, scene recognition, object detection and contextual interactions in holistic scene understanding models for semantic segmentation. [sent-104, score-0.935]

22 gies of the human studies and machine experiments, as well as the findings and insights are all novel. [sent-157, score-0.518]

23 This allows us to conveniently replace the machine potentials with human responses: after all, we cannot quite require humans to be submodular! [sent-166, score-0.915]

24 The problem of holistic scene understanding is formulated as that of inference in a CRF. [sent-169, score-0.367]

25 The random field contains variables representing the class labels of image segments at two levels in a segmentation hierarchy: super-pixels and larger segments. [sent-170, score-0.463]

26 The segments and super-segments reason about the semantic class labels to be assigned to each pixel in the image. [sent-174, score-0.446]

27 A shape prior is associated with these nodes encouraging segments that respect this prior to take on corresponding class labels. [sent-179, score-0.355]

28 Before we provide details about how the various machine potentials are computed, we first discuss the dataset we work with to ground further descriptions. [sent-193, score-0.566]

29 The contextual interactions are also quite skewed [7] making it less interesting for holistic scene understanding. [sent-203, score-0.43]

30 Machine CRF Potentials We now describe the machine potentials we employed. [sent-214, score-0.566]

31 Segments and super-segments: We utilize UCM [1] to create our segments and super-segments as it returns a small number of segments that tend to respect the true object boundaries well. [sent-216, score-0.544]

32 We use the output of the modified TextonBoost [33] in [20] to get pixel-wise potentials and average those within the segments and super-segments to get the unary potentials. [sent-221, score-0.727]

33 Following [18], we connect the two levels via a pairwise Pn potential that encourages segments and super-segments to take the same label. [sent-222, score-0.438]

34 We also employ pairwise potentials between zi and zk that capture cooccurance statistics of pairs of classes. [sent-224, score-0.605]

35 A binary variable bi is used for each detection and it is connected to the binary class variable, zci , where ci is the class of the detector that fired for the i−th hypothesis. [sent-229, score-0.357]

36 Shape: Shape potentials are incorporated in the model by connecting the binary detection variables bi to all segments xj inside the detection’s bounding box. [sent-230, score-0.91]

37 Scene and scene-class co-occurrence: We train a classifier [38] to predict each of the scene types, and use its confidence to form the unitary potential for the scene variable. [sent-234, score-0.449]

38 The scene node connects to each binary class variable zi via a pairwise potential which is defined based on the cooccurance statistics of the training data, i. [sent-235, score-0.568]

39 More than 500 subjects participated in our studies that involved ∼ 300, 000 crowd-sourced tasks, making tiehes rtheastuil tnsv ooblvtaeidne ∼d likely 0to0 b cer fairly ostuarbcleed across a daifkfienrgent sampling of subjects. [sent-246, score-0.36]

40 Segments and Super-segments: The study involves having human subjects classify segments into one of the semantic categories. [sent-247, score-0.747]

41 However, showing all the information that the machine uses to human subjects would lead to nearly 100% classification accuracy by the subjects, leaving us with little insights to gain. [sent-254, score-0.66]

42 More importantly, a 200 x 200 window occupies nearly 60% of the image, resulting in humans potentially using holistic scene understanding while classifying the segments. [sent-255, score-0.524]

43 To this goal, the discrepancy in information shown to humans and machines is not a concern, as long as humans are not shown more information than the machine has access to. [sent-259, score-0.533]

44 showing subjects a collection of segments and asking them to click on all the ones likely to belong to a certain class, or allowing a subject to select only one category per segment, etc. [sent-262, score-0.583]

45 Our experiment involved having subjects label all segments and super-segments from the MSRC dataset containing more than 500 pixels. [sent-265, score-0.497]

46 Figure 4 shows examples of segmentations obtained by assigning each segment to the class with most human votes. [sent-272, score-0.38]

47 Assigning each segment to the class with the highest number of human votes achieves an accuracy of 72. [sent-274, score-0.421]

48 The C dimensional human unary potential for a (super)segment is proportional to the number of times subjects selected each class, normalized to sum to 1. [sent-281, score-0.627]

49 We set the potentials for the unlabeled (smaller than 500 pixels) (super)segments to be uniform. [sent-282, score-0.414]

50 For all pairs of categories, we then ask subjects which category is more likely to occur in an image from the collection. [sent-284, score-0.369]

51 We build the class unary potentials by counting how often each class was preferred over all other classes. [sent-285, score-0.648]

52 Class-Class Co-occurrence: To obtain the human cooccurrence potentials we ask subjects the following question for all triplets of categories {zi , zj , zk}: “Which scentioarnio f oisr more likely fto c occur eins an image? [sent-292, score-0.968]

53 We use th)e, wChhiocwh- gLiviue algorithm on tchois-o mccautrrirxe,n as was used in [40] on the class co-occurrence potentials to obtain the tree structure, where the edges connect highly cooccurring nodes. [sent-300, score-0.493]

54 As a crude proxy, we showed subjects images inside ground truth object bounding boxes and asked them to recognize the object. [sent-304, score-0.398]

55 Shape: We showed 5 subjects the segment boundaries in the ground truth object bounding boxes along with its category label and contextual information from the rest of the scene. [sent-307, score-0.674]

56 Using the interface of [14], subjects were asked to trace a subset of the segment boundaries to match their expected shape of the object. [sent-309, score-0.488]

57 This shows that humans can not decipher the shape of an object from the UCM segment boundaries better than an automatic approach. [sent-314, score-0.408]

58 Scene Unary: We ask human subjects to classify an image into one of the 21 scene categories used in [40] (see Figure 2). [sent-316, score-0.557]

59 Subjects were allowed to select 5We showed subjects contextual information around the bounding box because without it humans were unable to recognize the object category reliably using only the boundaries of the segments in the box (54% accuracy). [sent-320, score-1.051]

60 Humans clearly outperform the machine at scene recognition, but the question of interest is whether this will translate to improved semantic segmentation performance. [sent-345, score-0.501]

61 Scene-Class Co-occurrence: Similar to the class-class experiment, subjects were asked which object category is more likely to be present in the scene. [sent-346, score-0.396]

62 Ground-truth Potentials: In addition to human potentials (which provide a feasible point), we are also interested in establishing an upper-bound on the effect each subtask can have on segmentation performance. [sent-349, score-0.707]

63 We do so by introduc- ing ground truth (GT) potentials into the model. [sent-350, score-0.414]

64 For segments and super-segments we simply set the value of the potential to be 1for the segment GT label and 0 otherwise, similarly for scene and class unary potentials. [sent-352, score-0.864]

65 Experiments with Human-Machine CRFs We now describe the results of inserting the human potentials in the CRF model. [sent-356, score-0.573]

66 We also investigated how plugging in GT potentials or discarding certain tasks all together affects segmentation performance on the MSRC dataset. [sent-357, score-0.604]

67 Class presence, class-class co-occurrence, and the sceneclass potentials have negligible impact on the performance of semantic segmentation. [sent-367, score-0.575]

68 GT shape also improves performance, but as discussed earlier, we find that humans are unable to instantiate this potential using the UCM segment boundaries. [sent-370, score-0.541]

69 One human potential that does improve performance is the unitary segment potential. [sent-372, score-0.51]

70 This is quite striking since human labeling accuracy of segments was substantially worse than machine’s (72. [sent-373, score-0.47]

71 Intrigued by this, we performed detailed analysis to identify properties of the human potential that are leading to this boost in performance. [sent-379, score-0.418]

72 Resultant insights provided us concrete guidance to improve machine potentials and hence state-of-the-art accuracies. [sent-380, score-0.653]

73 Scale: We noticed that the machine did not have access to the scale of the segments while humans did. [sent-382, score-0.576]

74 So we added a feature that captured the size of a segment relative to the image and re-trained the unary machine potentials. [sent-383, score-0.37]

75 Over-fitting: The machine segment unaries are trained on the same images as the CRF parameters, potentially leading to over-fitting. [sent-387, score-0.436]

76 333 111444668 Ranking of the correct label: It is clear that the highest ranked label of the human potential is wrong more often than the highest ranked label of the machine potential (hence the lower accuracy of the former outside the model). [sent-395, score-0.772]

77 But we wondered if perhaps even when wrong, the human potential gave a high enough score to the correct label making it revivable when used in the CRF, while the machine was more “blatantly” wrong. [sent-396, score-0.552]

78 We found that among the misclassified segments, the rank of the correct label using human potentials was 4. [sent-397, score-0.612]

79 Uniform potentials for small segments: Recall that we did not have human subjects label the segments smaller than 500 pixels and assigned a uniform potential to those segments. [sent-400, score-1.241]

80 We suspected that ignoring the small (likely to be misclassified) segments may give the human potential an advantage in the model. [sent-402, score-0.567]

81 So we replaced the machine potentials for small segments with a uniform distribution over the categories. [sent-403, score-0.844]

82 As a follow-up, we also weighted the machine potentials by the size of the corresponding segment. [sent-406, score-0.566]

83 We also replicated the sparsity of human potentials in the machine potentials, but this did not improve performance by much (77. [sent-415, score-0.725]

84 Complementarity: To get a deeper understanding as to why human segment potentials significantly increase performance when used in the model, we performed a variety of additional CRF experiments with hybrid potentials. [sent-417, score-0.861]

85 These included having human (H) or machine (M) potentials for segments (S) or super-segments (SS) or both, with or without the Pn potential in the model. [sent-418, score-1.133]

86 The last two rows correspond to the case where both human and machine segment potentials are used together at the same level. [sent-420, score-0.867]

87 But when the human and machine potentials are placed at different levels in the model (rows 3 and 4), not having a Pn potential (and thus losing connection between the two levels) significantly hurts performance. [sent-422, score-0.896]

88 This indicates that even though human potentials are not more accurate than machine potentials, when both human and machine potentials interact, there is a significant boost in performance, demonstrating the complementary nature of the two. [sent-423, score-1.554]

89 So we hypothesized that the types of mistakes that the machine and humans make may be different. [sent-446, score-0.411]

90 The resultant confusion matrix was more similar to that of human subjects (Figure 7(c)). [sent-456, score-0.49]

91 We re-computed the segment unaries and plugged them into the model in addition to the original unaries that used large windows. [sent-458, score-0.503]

92 Notice that the improvement provided by the entire CRF model over the original machine segment unaries alone was 3% (from 74. [sent-467, score-0.436]

93 While a fairly straightforward change in the training of machine unaries lead to this improvement in performance, we note that the insight to do so was provided by our use of humans to “debug” the state-of-the-art model. [sent-470, score-0.451]

94 9% of segments, while humans assign different labels to 12% of the segments within a supersegment. [sent-473, score-0.394]

95 GT labels for segments when plugged into the CRF provide an accuracy of 94% (and not 100% because deci- sions are made at the segment level which are not perfect). [sent-478, score-0.497]

96 Plugging in human potentials for all the components gives us an accuracy of 89. [sent-481, score-0.614]

97 Our analysis hinges on the use of human subjects to produce the different potentials in the model. [sent-489, score-0.794]

98 One of our findings was that human responses to local segments in isolation, while being less accurate than machines’, provide complementary information that the CRF model can effectively exploit. [sent-491, score-0.537]

99 We explored various avenues to precisely characterize this complementary nature, which resulted in a novel machine potential that significantly improves accuracy over the state-of-art. [sent-492, score-0.455]

100 Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. [sent-771, score-0.404]

Author: Hu Ding, Branislav Stojkovic, Ronald Berezney, Jinhui Xu

Abstract: Computing accurate and robust organizational patterns of chromosome territories inside the cell nucleus is critical for understanding several fundamental genomic processes, such as co-regulation of gene activation, gene silencing, X chromosome inactivation, and abnormal chromosome rearrangement in cancer cells. The usage of advanced fluorescence labeling and image processing techniques has enabled researchers to investigate interactions of chromosome territories at large spatial resolution. The resulting high volume of generated data demands for high-throughput and automated image analysis methods. In this paper, we introduce a novel algorithmic tool for investigating association patterns of chromosome territories in a population of cells. Our method takes as input a set of graphs, one for each cell, containing information about spatial interaction of chromosome territories, and yields a single graph that contains essential information for the whole population and stands as its structural representative. We formulate this combinato- rial problem as a semi-definite programming and present novel techniques to efficiently solve it. We validate our approach on both artificial and real biological data; the experimental results suggest that our approach yields a nearoptimal solution, and can handle large-size datasets, which are significant improvements over existing techniques.

2 0.94928235 395 cvpr-2013-Shape from Silhouette Probability Maps: Reconstruction of Thin Objects in the Presence of Silhouette Extraction and Calibration Error

Author: Amy Tabb

Abstract: This paper considers the problem of reconstructing the shape ofthin, texture-less objects such as leafless trees when there is noise or deterministic error in the silhouette extraction step or there are small errors in camera calibration. Traditional intersection-based techniques such as the visual hull are not robust to error because they penalize false negative and false positive error unequally. We provide a voxel-based formalism that penalizes false negative and positive error equally, by casting the reconstruction problem as a pseudo-Boolean minimization problem, where voxels are the variables of a pseudo-Boolean function and are labeled occupied or empty. Since the pseudo-Boolean minimization problem is NP-Hard for nonsubmodular functions, we developed an algorithm for an approximate solution using local minimum search. Our algorithm treats input binary probability maps (in other words, silhouettes) or continuously-valued probability maps identically, and places no constraints on camera placement. The algorithm was tested on three different leafless trees and one metal object where the number of voxels is 54.4 million (voxel sides measure 3.6 mm). Results show that our . usda .gov (a)Orignalimage(b)SilhoueteProbabiltyMap approach reconstructs the complicated branching structure of thin, texture-less objects in the presence of error where intersection-based approaches currently fail. 1

3 0.94663095 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

4 0.94100296 311 cvpr-2013-Occlusion Patterns for Object Class Detection

Author: Bojan Pepikj, Michael Stark, Peter Gehler, Bernt Schiele

Abstract: Despite the success of recent object class recognition systems, the long-standing problem of partial occlusion remains a major challenge, and a principled solution is yet to be found. In this paper we leave the beaten path of methods that treat occlusion as just another source of noise instead, we include the occluder itself into the modelling, by mining distinctive, reoccurring occlusion patterns from annotated training data. These patterns are then used as training data for dedicated detectors of varying sophistication. In particular, we evaluate and compare models that range from standard object class detectors to hierarchical, part-based representations of occluder/occludee pairs. In an extensive evaluation we derive insights that can aid further developments in tackling the occlusion challenge. –

same-paper 5 0.93891686 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

Author: Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, Devi Parikh

Abstract: Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this goal, we “plug-in ” human subjects for each of the various components in a state-of-the-art conditional random field model (CRF) on the MSRC dataset. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room ” there is to improve segmentation by focusing research efforts on each of the tasks. One of the interesting findings from our slew of studies was that human classification of isolated super-pixels, while being worse than current machine classifiers, provides a significant boost in performance when plugged into the CRF! Fascinated by this finding, we conducted in depth analysis of the human generated potentials. This inspired a new machine potential which significantly improves state-of-the-art performance on the MRSC dataset.

6 0.93851721 325 cvpr-2013-Part Discovery from Partial Correspondence

7 0.93798411 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

8 0.93778569 414 cvpr-2013-Structure Preserving Object Tracking

9 0.93683225 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

10 0.93631768 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

11 0.93579358 204 cvpr-2013-Histograms of Sparse Codes for Object Detection

12 0.93541789 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

13 0.93532187 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

14 0.93500513 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

15 0.93281603 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

16 0.93239343 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation

17 0.93184894 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image

18 0.93128902 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

19 0.93090057 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds

20 0.93080884 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking