iccv iccv2013 iccv2013-375 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Phillip Isola, Ce Liu
Abstract: To quickly synthesize complex scenes, digital artists often collage together visual elements from multiple sources: for example, mountainsfrom New Zealand behind a Scottish castle with wisps of Saharan sand in front. In this paper, we propose to use a similar process in order to parse a scene. We model a scene as a collage of warped, layered objects sampled from labeled, reference images. Each object is related to the rest by a set of support constraints. Scene parsing is achieved through analysis-by-synthesis. Starting with a dataset of labeled exemplar scenes, we retrieve a dictionary of candidate object segments thaOt miginaatl icmhag ea querEdyi eimd im-a age. We then combine elements of this set into a “scene collage ” that explains the query image. Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph.
Reference: text
sentIndex sentText sentNum sentScore
1 edu l Abstract To quickly synthesize complex scenes, digital artists often collage together visual elements from multiple sources: for example, mountainsfrom New Zealand behind a Scottish castle with wisps of Saharan sand in front. [sent-2, score-0.777]
2 We model a scene as a collage of warped, layered objects sampled from labeled, reference images. [sent-4, score-0.901]
3 Starting with a dataset of labeled exemplar scenes, we retrieve a dictionary of candidate object segments thaOt miginaatl icmhag ea querEdyi eimd im-a age. [sent-7, score-0.396]
4 We then combine elements of this set into a “scene collage ” that explains the query image. [sent-8, score-0.831]
5 Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. [sent-9, score-0.415]
6 We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph. [sent-10, score-0.151]
7 However, many existing scene parsing approaches represent an image simply as a 2D array of pixel labels (e. [sent-16, score-0.249]
8 In typical images, huge swaths of scene structure are occluded from view. [sent-19, score-0.173]
9 To solve these problems, we propose a novel scene model, in which we represent discrete semantic objects on separate layers. [sent-21, score-0.208]
10 While there are many ways by which an artist can synthesize a scene, one of the quickest and easiest is to collage together the image out of found pieces. [sent-23, score-0.756]
11 Following this collaging approach, we model a scene as a collage of object segments Ce Liu Microsoft Research ce l iu@mi cro s o ft . [sent-24, score-1.119]
12 "#$# "&''()# Figure 1: Top: We parse an input image (left) by recombining elements of a labeled dictionary of scenes (middle) to form a collage (right). [sent-26, score-0.936]
13 Representing discrete objects leads to easy scene editing (left; one building may be swapped for another). [sent-28, score-0.28]
14 Representing layers gives us a rough estimate of depth (center; please view with anaglyph glasses ). [sent-29, score-0.255]
15 The statistics of features in the segments from the database are used to match regions in the query image. [sent-32, score-0.325]
16 Given a query image, we infer a collage through analysis-by-synthesis, building up an explanation that both generates the query’s appearance and preserves structural properties of the examples from which the collage is pieced together. [sent-33, score-1.626]
17 Many applications including image editing, random scene synthesis, and rough image-to-anaglyph – become available once we have inferred a scene collage (Figure 1, bottom row). [sent-34, score-1.103]
18 #/%*0 )%56/#%0* *& mskomyskuoy nt ra er ient re t feri e l tedlrte fdre nfce nce Figure 2: We model a scene as a layer-world collage of objects. [sent-40, score-0.831]
19 For example, the scenegraph above indicates that the trees are supported by the field. [sent-42, score-0.364]
20 In computer graphics, several approaches synthesize images by compositing layers (e. [sent-45, score-0.16]
21 However, these applications rely on human input and their goal is image synthesis whereas our system aims at both scene analysis and scene synthesis. [sent-48, score-0.328]
22 In the realm of scene parsing, Guo and Hoiem have recently attempted to infer occluded scene content [7]. [sent-49, score-0.359]
23 The LabelMe3D algorithm ofRussell and Torralba infers the depth ordering of segments in a scene, but only with the aid of extensive human annotation at test time [16]. [sent-51, score-0.207]
24 [15] also developed an image segmentation algorithm that involves matching a query image to an “image composite”, that is, pieces of similar images stitched together. [sent-54, score-0.154]
25 Recently, several example-based methods have been applied to scene parsing with promising results [13], [22]. [sent-59, score-0.249]
26 Other object-based methods have also been proposed: [25] and [21] used object detectors, and [14] showed promising qualitative results at scene parsing with object segments. [sent-61, score-0.341]
27 Our scene collage representation naturally consists of geometric relationships such as occlusion and depth ordering, and could be further extended to a full 3D model. [sent-63, score-0.973]
28 Furthermore, our polygon-based layer representation is user-friendly; namely, users can easily intervene with the layer world by translating and scaling the polygons or dragging each individual control point. [sent-64, score-0.281]
29 In addition, we introduce a nonparametric scene grammar for searching a space of reasonable scenes. [sent-65, score-0.24]
30 In particular, we model each scene with a scenegraph of object interrelations. [sent-66, score-0.536]
31 These scenegraphs provide scene level context, which guides how we fit exemplar segments into a scene. [sent-67, score-0.501]
32 Scene model We model a scene as a collage of transformed exemplar object segments. [sent-69, score-0.992]
33 Each exemplar object is a labeled object segment from a dictionary, L, of annotated images. [sent-70, score-0.247]
34 These segments croomnsi ast d oicft tiohen aorbyj,ec Lt,’s o fcl aasnsn cot‘a, a geometric Tmhaeskse, Q˜‘, which specifies the object’s silhouette, image pixels I˜‘, and an appearance model g˜‘ (throughout this paper we use ˜ · to mark variables that refer to information in our dictionary). [sent-71, score-0.197]
35 A scene collage consists of transformed versions of the dictionary segments. [sent-72, score-0.948]
36 The indices into the dictionary for the segments used in a collage are denoted as ‘ ∈ L. [sent-73, score-0.968]
37 With an soebgjemcte’nst str uansesdfo rimna ati coonll parameters doetendot aesd ‘ as θ ‘L, we tirthan asnform exemplar segments masks as: Q‘ = T(Q˜‘; θ‘). [sent-74, score-0.29]
38 A scene collage also consists of a set of semantic object relationships represented as a scenegraph, S (Figure j2e, right). [sent-78, score-0.983]
39 oTnhseh scenegraph provides sccoennteexgtr afporh ,ea Sch ( object in the scene. [sent-79, score-0.41]
40 Treating the dictionary as deterministic, we model the probability of a scene collage just in terms of random variables for the dictionary indices used, the transformation and layering parameters, and the scenegraph re- lationships: X ={L, θ, z, S}. [sent-82, score-1.497]
41 Scenegraphs We object interrelationships to add context to our scene representation. [sent-86, score-0.198]
42 We represent these relationships as a graph support constraints: if object A physically supports object B, then B is a child of A. [sent-87, score-0.255]
43 At most one parent is assigned to each child use 333000444999 Query image Retrieved scenes Segment transformation and recombination Scene collage images, we select a set of candidate object segments for use in a collage (bottom left). [sent-89, score-1.899]
44 We align each candidate segment onto the query image (column “Translation potential” depicts the cost of centering the segment at each point in the image, with red being low cost; column “Aligned onto query” shows the aligned mask). [sent-90, score-0.262]
45 Finally, we choose a set of segments that semantically fit together and well explain the appearance of the query. [sent-92, score-0.197]
46 These segments are layered together to form a “scene collage” (far right). [sent-93, score-0.213]
47 Our representation of object relationships is especially similar to Russell et al. [sent-99, score-0.155]
48 demonstrated that this representation is useful for inferring scene structure from human annotations, here we attempt to show its utility toward inferring structure in unlabeled images. [sent-102, score-0.233]
49 also explored a similar representation, in which objects in a scene are related by a graph of support constraints [8]. [sent-104, score-0.206]
50 Each object in a scene is associated with a function ˜g‘, which is a generative model of appearance. [sent-110, score-0.197]
51 Intuitively, the object is filled with visual stuff and we model that stuff as a spatially varying distribution over features we expect to see in different subregions of the object’s mask (Figure 4). [sent-111, score-0.151]
52 In order to sample segments of a variety of sizes, we consider big object segments separately from small object segments. [sent-121, score-0.44]
53 Then, for objects in each given size category, we select the k2 segments that maximize the histogram intersection between the spatial pyramid distribution of HOG-color visual words in the segment and in the query image (k2 = 40 in our experiments). [sent-123, score-0.399]
54 The union of all these sets of k2 segments gives us our full retrieval set Rn. [sent-124, score-0.174]
55 Segment transformation Once we have retrieved a set of candidate segments, we warp each to align with the query image using transformation function T (Equation 1), which consists of three separate stages: “translation and scaling”, “layering”, and “trimming and growing” (Figure 3). [sent-127, score-0.213]
56 Objects are translated and scaled during an initial pass over all retrieved segments, whereas layering, trimming and in-painting occurs at each iteration of the segment recombination algorithm (Section 4. [sent-128, score-0.285]
57 Translation and scaling: Each object from our dictionary is transformed independently. [sent-132, score-0.163]
58 Layering: When an object ‘ is added to a collage, we insert it at a layer z‘. [sent-135, score-0.162]
59 All other object segments in the collage at or below this layer are pushed back: z‘0 ← z‘0 − 1 ∀z‘0 ≥ z‘. [sent-136, score-1.041]
60 However, we only consider layer assignments that are valid in the sense that the collage’s scenegraph remains consistent with the collage’s layering order (consistent according to the scenegraph estimation rules of Section 2. [sent-138, score-0.975]
61 Also, as a special case, “sky” segments are always placed on the bottom-most layer. [sent-140, score-0.174]
62 We formulate this part of the problem as 2D MRF-based segmentation, in which segments in a collage compete to explain pixels near their mask boundaries. [sent-142, score-0.966]
63 The energy of the MRF is based on the likelihood each segment assigns to the pixels under our likelihood model from Section 2. [sent-143, score-0.155]
64 Segment recombination For each context/retrieval set, we greedily recombine our retrieved, transformed segments in order to explain the query image. [sent-147, score-0.489]
65 Each context set, Cn, doneefin feosr a context-dependent scene grammar. [sent-149, score-0.152]
66 nTtheixst c soent,te Cxtdependent grammar is the same as our scene grammar from Section 5, but rather than considering moves toward any exemplar scene in our dictionary, we only consider moves toward exemplar scenes in the context set. [sent-150, score-0.819]
67 The context-dependent scene grammar defines the set of valid semantic changes we can make to a collage. [sent-152, score-0.226]
68 In order to instantiate these semantic changes with actual object segments changes, we consider using each of the object segments in the retrieval set Rn. [sent-153, score-0.462]
69 This generates N collages, from which we choose the max a posteriori collage as a final explanation of the query (Algorithm 1). [sent-155, score-0.876]
70 Second, we consider segments in Rn in a coarse-to-fine manner. [sent-158, score-0.174]
71 Notice that each region of a query image is matched to a similar but not identical object segment from our dictionary. [sent-161, score-0.24]
72 An example inferred scenegraph is shown in Figure 9. [sent-183, score-0.475]
73 Our implementation takes about 30 minutes at test time to parse a single scene on a single core. [sent-185, score-0.187]
74 Training – computing features and learning the likelihood model parameters – takes a few hours on a single core for a dictionary of several thousand scenes, and scales linearly with dictionary size. [sent-186, score-0.21]
75 Evaluating layers and support relationships In addition to inferring a segmentation map and pixelwise labels, our model explicitly represents discrete, layered objects and their support relationships. [sent-198, score-0.384]
76 To estimate the layer order of a scene with only 2D object annotations (i. [sent-200, score-0.316]
77 To estimate ground truth scenegraph support relationships, we 333000555333 Table 2: Comparison with other methods on LMO. [sent-206, score-0.413]
78 12po1 3or2et apply the scenegraph construction rules (Section 2. [sent-217, score-0.394]
79 We then compare our inferred layer orders and scenegraphs to ground truth using the following metric. [sent-219, score-0.341]
80 First, we match each object ‘0 in the ground truth to a best matching object ‘ in our explanation, which is defined to be the object in the explanation of the same object class as ‘0 (if one exists) for which (Q‘0 ∩ Q‘)/(Q‘0 ∪ Q‘) is maximized. [sent-220, score-0.229]
81 Next, we list all pairwise relationships bQetween objects in the ground truth scene, and all such relationships between objects in the inferred collage, We consider both pairwise layer order and scenegraph support relationship (parent, child, no relation). [sent-221, score-0.87]
82 We measure performance as the number of relationships in r that also hold for r when the objects are matched as described above, divided by the total number of relationships in and r. [sent-222, score-0.199]
83 “Nearest-neighbor” refers to just using the single nearestneighbor retrieved scene as the raw explanation for a query image. [sent-233, score-0.37]
84 “No recombination” does not allow objects from more than one dictionary scene to be used in a collage (i. [sent-234, score-0.951]
85 As the results show, allowing segments to transform and recombine from multiple sources boosts pixel labeling accuracy compared to the “Nearest-neighbor” and “No recombination” baselines. [sent-237, score-0.232]
86 However, interestingly, just using the raw nearest-neighbor does better on layer and support scores. [sent-238, score-0.165]
87 Seed Image Randomly synthesized scenes Figure 11: Random scenes, on right, synthesized from seed in column on left. [sent-246, score-0.175]
88 These samples look like reasonable scenes, suggesting we are staying close to the set of natural scenes as we randomly walk over our scene grammar. [sent-247, score-0.207]
89 better fitting the appearance of a query image through segment transformations and recombinations, we break some of the rich structure that comes naturally intact in nearestneighbor images. [sent-249, score-0.226]
90 This prior over-regularizes somewhat, resulting in a decrease in the per-class accuracy and the layer and support scores. [sent-251, score-0.165]
91 Our system is able to recover rough 3D from the layer order. [sent-261, score-0.181]
92 Random scene synthesis: Since our method is generative, it can be used to synthesize random scenes. [sent-266, score-0.177]
93 Starting with a seed image, we retrieve a random context set from similar looking scenes, which gives us a context-dependent scene grammar. [sent-267, score-0.184]
94 We then take a random walk over productions from this grammar (biasing toward collages that cover as much of the image frame as possible). [sent-268, score-0.178]
95 Image-to-anaglyph: The depth order of layers in a collage provides rough 3D information. [sent-270, score-0.852]
96 We can use this information to visualize a scene in anaglyph stereo (Figure 12). [sent-271, score-0.187]
97 First we transfer depth order from the collage to the query image. [sent-272, score-0.864]
98 Shifting the camera reveals occluded pixels, which we in-paint with the pixels from both the collage and nearest-neighbor images. [sent-273, score-0.775]
99 Conclusion We have presented a novel and intuitive framework for scene understanding in which we explain an image by piecing together a collection of exemplar scene elements. [sent-275, score-0.362]
100 Our system moves beyond the standard problem of labeling a pixel grid and into an exciting realm of representing layers, discrete objects, and object interrelationships. [sent-276, score-0.196]
wordName wordTfidf (topN-words)
[('collage', 0.705), ('scenegraph', 0.364), ('segments', 0.174), ('query', 0.126), ('scene', 0.126), ('parsing', 0.123), ('layer', 0.116), ('scenegraphs', 0.114), ('inferred', 0.111), ('layering', 0.101), ('recombination', 0.101), ('dictionary', 0.089), ('exemplar', 0.087), ('relationships', 0.084), ('scenes', 0.081), ('layers', 0.079), ('grammar', 0.078), ('trimming', 0.075), ('editing', 0.072), ('collages', 0.068), ('collaging', 0.068), ('segment', 0.068), ('parse', 0.061), ('anaglyph', 0.061), ('russell', 0.054), ('phillip', 0.053), ('synthesize', 0.051), ('support', 0.049), ('isola', 0.048), ('occluded', 0.047), ('object', 0.046), ('synthesis', 0.046), ('lmo', 0.046), ('explanation', 0.045), ('rgbd', 0.044), ('mask', 0.041), ('retrieved', 0.041), ('parses', 0.04), ('layered', 0.039), ('nyu', 0.038), ('cn', 0.038), ('realm', 0.037), ('recombine', 0.037), ('nonparametric', 0.036), ('rough', 0.035), ('fq', 0.035), ('labelme', 0.035), ('parent', 0.034), ('moves', 0.033), ('depth', 0.033), ('nearestneighbor', 0.032), ('tighe', 0.032), ('stuff', 0.032), ('likelihood', 0.032), ('toward', 0.032), ('seed', 0.032), ('objects', 0.031), ('synthesized', 0.031), ('demo', 0.031), ('greedy', 0.031), ('compositing', 0.03), ('swap', 0.03), ('system', 0.03), ('rules', 0.03), ('child', 0.03), ('discrete', 0.029), ('supplemental', 0.029), ('translation', 0.029), ('masks', 0.029), ('annotations', 0.028), ('segmentation', 0.028), ('transformed', 0.028), ('bb', 0.026), ('context', 0.026), ('xn', 0.025), ('microsoft', 0.025), ('generative', 0.025), ('rn', 0.025), ('yuen', 0.025), ('glasses', 0.025), ('inferring', 0.025), ('representation', 0.025), ('database', 0.025), ('scaling', 0.024), ('explain', 0.023), ('outdoor', 0.023), ('pixels', 0.023), ('posterior', 0.023), ('infer', 0.023), ('transformation', 0.023), ('please', 0.022), ('building', 0.022), ('user', 0.022), ('guo', 0.022), ('semantic', 0.022), ('labeling', 0.021), ('arxiv', 0.021), ('behind', 0.021), ('siggraph', 0.021), ('freeman', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
Author: Phillip Isola, Ce Liu
Abstract: To quickly synthesize complex scenes, digital artists often collage together visual elements from multiple sources: for example, mountainsfrom New Zealand behind a Scottish castle with wisps of Saharan sand in front. In this paper, we propose to use a similar process in order to parse a scene. We model a scene as a collage of warped, layered objects sampled from labeled, reference images. Each object is related to the rest by a set of support constraints. Scene parsing is achieved through analysis-by-synthesis. Starting with a dataset of labeled exemplar scenes, we retrieve a dictionary of candidate object segments thaOt miginaatl icmhag ea querEdyi eimd im-a age. We then combine elements of this set into a “scene collage ” that explains the query image. Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph.
2 0.13410099 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
3 0.11151224 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
Author: Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, James M. Rehg
Abstract: We propose an unsupervised video segmentation approach by simultaneously tracking multiple holistic figureground segments. Segment tracks are initialized from a pool of segment proposals generated from a figure-ground segmentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By using the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of segment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statistical inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment proposals and recombines for better ones by utilizing highorder statistic estimates from the appearance model and enforcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework outperforms state-of-the-art approaches in the dataset, show- ing its efficiency and robustness to challenges in different video sequences.
Author: Basura Fernando, Tinne Tuytelaars
Abstract: In this paper we present a new method for object retrieval starting from multiple query images. The use of multiple queries allows for a more expressive formulation of the query object including, e.g., different viewpoints and/or viewing conditions. This, in turn, leads to more diverse and more accurate retrieval results. When no query images are available to the user, they can easily be retrieved from the internet using a standard image search engine. In particular, we propose a new method based on pattern mining. Using the minimal description length principle, we derive the most suitable set of patterns to describe the query object, with patterns corresponding to local feature configurations. This results in apowerful object-specific mid-level image representation. The archive can then be searched efficiently for similar images based on this representation, using a combination of two inverted file systems. Since the patterns already encode local spatial information, good results on several standard image retrieval datasets are obtained even without costly re-ranking based on geometric verification.
5 0.094622165 410 iccv-2013-Support Surface Prediction in Indoor Scenes
Author: Ruiqi Guo, Derek Hoiem
Abstract: In this paper, we present an approach to predict the extent and height of supporting surfaces such as tables, chairs, and cabinet tops from a single RGBD image. We define support surfaces to be horizontal, planar surfaces that can physically support objects and humans. Given a RGBD image, our goal is to localize the height and full extent of such surfaces in 3D space. To achieve this, we created a labeling tool and annotated 1449 images with rich, complete 3D scene models in NYU dataset. We extract ground truth from the annotated dataset and developed a pipeline for predicting floor space, walls, the height and full extent of support surfaces. Finally we match the predicted extent with annotated scenes in training scenes and transfer the the support surface configuration from training scenes. We evaluate the proposed approach in our dataset and demonstrate its effectiveness in understanding scenes in 3D space.
6 0.091430552 150 iccv-2013-Exemplar Cut
7 0.088340878 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
8 0.087495156 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
9 0.086836457 317 iccv-2013-Piecewise Rigid Scene Flow
10 0.086778015 57 iccv-2013-BOLD Features to Detect Texture-less Objects
11 0.086021237 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items
12 0.084726326 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
13 0.081169583 379 iccv-2013-Semantic Segmentation without Annotating Segments
14 0.078147367 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing
15 0.077398643 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
16 0.077189431 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
17 0.076288179 444 iccv-2013-Viewing Real-World Faces in 3D
18 0.07597322 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
19 0.074814461 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
20 0.074481867 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
topicId topicWeight
[(0, 0.17), (1, 0.009), (2, -0.009), (3, -0.011), (4, 0.026), (5, 0.01), (6, -0.067), (7, -0.032), (8, -0.071), (9, -0.03), (10, 0.06), (11, 0.061), (12, 0.035), (13, 0.072), (14, -0.056), (15, -0.028), (16, 0.001), (17, -0.05), (18, 0.058), (19, -0.046), (20, -0.054), (21, -0.07), (22, 0.038), (23, -0.021), (24, -0.04), (25, -0.061), (26, 0.005), (27, -0.009), (28, -0.045), (29, -0.008), (30, -0.072), (31, 0.039), (32, 0.011), (33, -0.055), (34, -0.079), (35, 0.144), (36, -0.067), (37, 0.086), (38, 0.018), (39, 0.023), (40, 0.061), (41, 0.058), (42, -0.045), (43, -0.074), (44, 0.071), (45, 0.154), (46, -0.032), (47, 0.082), (48, -0.101), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.92002738 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
Author: Phillip Isola, Ce Liu
Abstract: To quickly synthesize complex scenes, digital artists often collage together visual elements from multiple sources: for example, mountainsfrom New Zealand behind a Scottish castle with wisps of Saharan sand in front. In this paper, we propose to use a similar process in order to parse a scene. We model a scene as a collage of warped, layered objects sampled from labeled, reference images. Each object is related to the rest by a set of support constraints. Scene parsing is achieved through analysis-by-synthesis. Starting with a dataset of labeled exemplar scenes, we retrieve a dictionary of candidate object segments thaOt miginaatl icmhag ea querEdyi eimd im-a age. We then combine elements of this set into a “scene collage ” that explains the query image. Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph.
2 0.58871186 57 iccv-2013-BOLD Features to Detect Texture-less Objects
Author: Federico Tombari, Alessandro Franchi, Luigi Di_Stefano
Abstract: Object detection in images withstanding significant clutter and occlusion is still a challenging task whenever the object surface is characterized by poor informative content. We propose to tackle this problem by a compact and distinctive representation of groups of neighboring line segments aggregated over limited spatial supports and invariant to rotation, translation and scale changes. Peculiarly, our proposal allows for leveraging on the inherent strengths of descriptor-based approaches, i.e. robustness to occlusion and clutter and scalability with respect to the size of the model library, also when dealing with scarcely textured objects.
3 0.57857829 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
Author: Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, James M. Rehg
Abstract: We propose an unsupervised video segmentation approach by simultaneously tracking multiple holistic figureground segments. Segment tracks are initialized from a pool of segment proposals generated from a figure-ground segmentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By using the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of segment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statistical inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment proposals and recombines for better ones by utilizing highorder statistic estimates from the appearance model and enforcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework outperforms state-of-the-art approaches in the dataset, show- ing its efficiency and robustness to challenges in different video sequences.
4 0.57522631 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
Author: Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, Shuicheng Yan
Abstract: In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by lowlevel over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parselet ensembles are exhibited as the “And-Or” structure of sub-trees; (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configura- tionfrom apool ofParselet hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.
5 0.56080276 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
Author: Olaf Kähler, Ian Reid
Abstract: We address the problem of 3D scene labeling in a structured learning framework. Unlike previous work which uses structured Support VectorMachines, we employ the recently described Decision Tree Field and Regression Tree Field frameworks, which learn the unary and binary terms of a Conditional Random Field from training data. We show this has significant advantages in terms of inference speed, while maintaining similar accuracy. We also demonstrate empirically the importance for overall labeling accuracy of features that make use of prior knowledge about the coarse scene layout such as the location of the ground plane. We show how this coarse layout can be estimated by our framework automatically, and that this information can be used to bootstrap improved accuracy in the detailed labeling.
6 0.52939743 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items
7 0.5270732 412 iccv-2013-Synergistic Clustering of Image and Segment Descriptors for Unsupervised Scene Understanding
8 0.51315039 150 iccv-2013-Exemplar Cut
9 0.51213199 148 iccv-2013-Example-Based Facade Texture Synthesis
10 0.50577629 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
11 0.49902543 410 iccv-2013-Support Surface Prediction in Indoor Scenes
12 0.49123272 334 iccv-2013-Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval
13 0.48838156 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
14 0.48648587 2 iccv-2013-3D Scene Understanding by Voxel-CRF
15 0.47890699 1 iccv-2013-3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
16 0.47800916 379 iccv-2013-Semantic Segmentation without Annotating Segments
17 0.46772942 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation
18 0.46455574 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
19 0.44804162 63 iccv-2013-Bounded Labeling Function for Global Segmentation of Multi-part Objects with Geometric Constraints
20 0.44094557 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
topicId topicWeight
[(2, 0.072), (26, 0.076), (31, 0.085), (35, 0.022), (40, 0.019), (42, 0.072), (48, 0.013), (64, 0.062), (73, 0.033), (78, 0.013), (81, 0.204), (89, 0.193), (98, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.82781398 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers
Author: Phillip Isola, Ce Liu
Abstract: To quickly synthesize complex scenes, digital artists often collage together visual elements from multiple sources: for example, mountainsfrom New Zealand behind a Scottish castle with wisps of Saharan sand in front. In this paper, we propose to use a similar process in order to parse a scene. We model a scene as a collage of warped, layered objects sampled from labeled, reference images. Each object is related to the rest by a set of support constraints. Scene parsing is achieved through analysis-by-synthesis. Starting with a dataset of labeled exemplar scenes, we retrieve a dictionary of candidate object segments thaOt miginaatl icmhag ea querEdyi eimd im-a age. We then combine elements of this set into a “scene collage ” that explains the query image. Beyond just assigning object labels to pixels, scene collaging produces a lot more information such as the number of each type of object in the scene, how they support one another, the ordinal depth of each object, and, to some degree, occluded content. We exploit this representation for several applications: image editing, random scene synthesis, and image-to-anaglyph.
2 0.78016382 308 iccv-2013-Parsing IKEA Objects: Fine Pose Estimation
Author: Joseph J. Lim, Hamed Pirsiavash, Antonio Torralba
Abstract: We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.
3 0.76535207 174 iccv-2013-Forward Motion Deblurring
Author: Shicheng Zheng, Li Xu, Jiaya Jia
Abstract: We handle a special type of motion blur considering that cameras move primarily forward or backward. Solving this type of blur is of unique practical importance since nearly all car, traffic and bike-mounted cameras follow out-ofplane translational motion. We start with the study of geometric models and analyze the difficulty of existing methods to deal with them. We also propose a solution accounting for depth variation. Homographies associated with different 3D planes are considered and solved for in an optimization framework. Our method is verified on several natural image examples that cannot be satisfyingly dealt with by previous methods.
4 0.76394457 275 iccv-2013-Motion-Aware KNN Laplacian for Video Matting
Author: Dingzeyu Li, Qifeng Chen, Chi-Keung Tang
Abstract: This paper demonstrates how the nonlocal principle benefits video matting via the KNN Laplacian, which comes with a straightforward implementation using motionaware K nearest neighbors. In hindsight, the fundamental problem to solve in video matting is to produce spatiotemporally coherent clusters of moving foreground pixels. When used as described, the motion-aware KNN Laplacian is effective in addressing this fundamental problem, as demonstrated by sparse user markups typically on only one frame in a variety of challenging examples featuring ambiguous foreground and background colors, changing topologies with disocclusion, significant illumination changes, fast motion, and motion blur. When working with existing Laplacian-based systems, our Laplacian is expected to benefit them immediately with improved clustering of moving foreground pixels.
5 0.76067454 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
Author: Bo Li, Wenze Hu, Tianfu Wu, Song-Chun Zhu
Abstract: Occlusion presents a challenge for detecting objects in real world applications. To address this issue, this paper models object occlusion with an AND-OR structure which (i) represents occlusion at semantic part level, and (ii) captures the regularities of different occlusion configurations (i.e., the different combinations of object part visibilities). This paper focuses on car detection on street. Since annotating part occlusion on real images is time-consuming and error-prone, we propose to learn the the AND-OR structure automatically using synthetic images of CAD models placed at different relative positions. The model parameters are learned from real images under the latent structural SVM (LSSVM) framework. In inference, an efficient dynamic programming (DP) algorithm is utilized. In experiments, we test our method on both car detection and car view estimation. Experimental results show that (i) Our CAD simulation strategy is capable of generating occlusion patterns for real scenarios, (ii) The proposed AND-OR structure model is effective for modeling occlusions, which outperforms the deformable part-based model (DPM) [6, 10] in car detec- , tion on both our self-collected streetparking dataset and the Pascal VOC 2007 car dataset [4], (iii) The learned model is on-par with the state-of-the-art methods on car view estimation tested on two public datasets.
6 0.76029456 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
7 0.75787807 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
8 0.75472683 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
9 0.75471801 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection
10 0.7540397 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
11 0.75181627 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
12 0.75144786 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
13 0.7514267 210 iccv-2013-Image Retrieval Using Textual Cues
14 0.75006568 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
15 0.749933 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features
16 0.74973869 396 iccv-2013-Space-Time Robust Representation for Action Recognition
17 0.74929929 89 iccv-2013-Constructing Adaptive Complex Cells for Robust Visual Tracking
18 0.74885082 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
19 0.74882728 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
20 0.74815679 423 iccv-2013-Towards Motion Aware Light Field Video for Dynamic Scenes