iccv iccv2013 iccv2013-306 knowledge-graph by maker-knowledge-mining

306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items


Source: pdf

Author: Kota Yamaguchi, M. Hadi Kiapour, Tamara L. Berg

Abstract: Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. [sent-9, score-0.808]

2 In this paper, we tackle the clothing parsing problem using a retrieval based approach. [sent-10, score-0.913]

3 For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. [sent-11, score-0.467]

4 Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. [sent-12, score-2.074]

5 Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy. [sent-13, score-0.276]

6 In addition to style variation, individual clothing items also display many different appearance characteristics. [sent-22, score-1.069]

7 In this paper, we take a data driven approach to clothing parsing. [sent-29, score-0.591]

8 We first collect a large, complex, real world collection of outfit pictures from a social network focused on fashion, chictopia . [sent-30, score-0.21]

9 Using a very small set of hand parsed images in combination with the text tags associated with each image in the collection, we can parse our large database accurately. [sent-32, score-0.46]

10 Now, given a query image without any associated text, we can predict an accurate parse by retrieving similar outfits from our parsed collection, building local models from retrieved clothing items, and transferring inferred clothing items from the retrieved samples to the query image. [sent-33, score-2.514]

11 In each of these steps we take advantage of the relationship between clothing and body pose to constrain prediction and produce a more accurate parse. [sent-35, score-0.711]

12 We call this paper doll parsing because it essentially transfers predictions from retrieved samples to the query, like laying paper cutouts of clothing items onto a paper doll. [sent-36, score-1.475]

13 In particular, we propose a retrieval based approach to clothing parsing that combines: • Pre-trained global models of clothing items. [sent-38, score-1.525]

14 • Local models of clothing items learned on the fly from rLeotrciaevl medo examples. [sent-39, score-0.96]

15 • Parse mask predictions transferred from retrieved examples taos tkhe p query image. [sent-40, score-0.3]

16 Clothing recognition is a challenging and societally important problem global sales for clothing total over a hundred billion dollars, much of which is conducted online. [sent-42, score-0.612]

17 This is reflected in the growing interest in clothing related recognition papers [11, 10, 25, 16, 26, 7, 2, 4], perhaps boosted by recent advances in pose estimation [27, 3]. [sent-43, score-0.639]

18 Many of these papers have focused on specific aspects of clothing recognition such as predicting attributes of clothing [7, 2, 4], outfit recommendation [15], or identifying aspects of socio-identity through clothing [18, 20]. [sent-44, score-1.872]

19 We attack the problem of clothing parsing, assigning a semantic label to each pixel in the image where labels can – 3355 1192 be selected from background, skin, hair, or from a large set of clothing items (e. [sent-45, score-1.586]

20 Effective solutions to clothing parsing could enable useful end-user applications such as pose independent clothing retrieval [26] or street to shop applications [16]. [sent-48, score-1.552]

21 This problem is closely related to the general image parsing problem which has been approached successfully using related non-parametric methods [21, 14, 22]. [sent-49, score-0.276]

22 However, we view the clothing parsing problem as suitable for specialized exploration because it deals with people, a category that has obvious significance. [sent-50, score-0.867]

23 The clothing parsing problem is also special in that one can take advantage of body pose estimates during parsing, and we do so in all parts of our method. [sent-51, score-0.951]

24 Previous state of the art on clothing parsing [26] performed quite well on the constrained parsing problem, where test images are parsed given user provided tags indicating depicted clothing items. [sent-52, score-1.926]

25 However, they were less effective at unconstrained clothing parsing, where test images are parsed in the absense of any textual information. [sent-53, score-0.682]

26 The training samples are used for learning feature transforms, building global clothing models, and adjusting parameters. [sent-58, score-0.64]

27 com with associated metadata tags denoting characteristics such as color, clothing item, or occasion. [sent-62, score-0.692]

28 From the remaining, we select pictures tagged with at least one clothing item and run a full-body pose detector [27], keeping those that have a person detection. [sent-64, score-0.969]

29 This results in 339,797 pictures weakly annotated with clothing items and estimated pose. [sent-65, score-1.003]

30 Though the annotations are not always complete users often do not label all depicted items, especially small items or accessories it is rare to find images where an annotated tag is not present. [sent-66, score-0.481]

31 Use retrieved images and tags to parse the query. [sent-72, score-0.515]

32 During parsing, we compute the parse in this fixed frame size then warp it back to the original image, assuming regions outside the bounding box are background. [sent-78, score-0.268]

33 Our methods draw from a number of dense feature types (each parsing method uses some subset): RGB RGB color of the pixel. [sent-79, score-0.276]

34 Style retrieval Our goal for retrieving similar pictures is two-fold: a) to predict depicted clothing items, and b) to obtain information helpful for parsing clothing items. [sent-90, score-1.616]

35 Style descriptor We design a descriptor for style retrieval that is useful for finding styles with similar appearance. [sent-93, score-0.261]

36 Skin-hair Detection is computed using logistic regression for skin, hair, background, and clothing at each pixel. [sent-98, score-0.676]

37 Note that we do not include Pose Distance as a feature in the style descriptor, but instead use Skin-hair detection to indirectly include pose-dependent information in the representation since the purpose ofthe style descriptor is to find similar styles independent of pose. [sent-100, score-0.322]

38 Tag prediction The retrieved samples are first used to predict clothing items potentially present in a query image. [sent-112, score-1.257]

39 The purpose of tag prediction is to obtain a set of tags that might be relevant to the query, while eliminating definitely irrelevant items for consideration. [sent-113, score-0.552]

40 Each tag in the retrieved samples provides a vote weighted by the inverse of its distance from the query, which forms a confidence for presence of that item. [sent-117, score-0.296]

41 While linear classification (clothing item classifiers trained on subsets of body parts, e. [sent-121, score-0.253]

42 Since the goal here is only to eliminate obviously irrelevant items while keeping most potentially relevant items, we tune the threshold to give 0. [sent-125, score-0.348]

43 Due to the skewed item distribution in the Fashionista dataset, we use the same threshold for all items to avoid over-fitting the predictive model. [sent-127, score-0.592]

44 In the parsing stage, we always include background, skin, and hair in addition to the predicted clothing tags. [sent-128, score-0.935]

45 Clothing parsing Following tag prediction, we start to parse the image in a per-pixel fashion. [sent-130, score-0.611]

46 Compute pixel-level confidence from three methods: global parse, nearest neighbor parse, and transferred parse. [sent-132, score-0.217]

47 Pixel confidence Let us denote yi as the clothing item label at pixel i. [sent-138, score-0.997]

48 The first step in parsing is to compute a confidence score of assigning clothing item lto yi. [sent-139, score-1.139]

49 1 Global parse The first term in our model is a global clothing likelihood, trained for each clothing item on the hand parsed Fashionista training split. [sent-144, score-1.779]

50 The leftmost column shows query images with ground truth item annotation. [sent-146, score-0.302]

51 The rest are retrieved images with associated tags in the top 25. [sent-147, score-0.247]

52 Transferred parse Combined (1+2+3) Smoothed result Labels are MAP assignments of the scoring functions. [sent-151, score-0.268]

53 This could potentially increase confusion between similar item types, such as blazer and jacket since they usually do not appear together, in favor of better localization accuracy. [sent-158, score-0.366]

54 2 Nearest neighbor parse The second term in our model is also a logistic regression, but trained only on the retrieved nearest neighbor (NN) images. [sent-164, score-0.585]

55 Here we learn a local appearance model for each clothing item based on examples that are similar to the query, e. [sent-165, score-0.808]

56 blazers that look similar to the query blazer because they were retrieved via style similarity. [sent-167, score-0.451]

57 3 Transferred parse The third term in our model is obtained by transferring the parse mask likelihoods estimated by the global parse Sglobal from the retrieved images to the query image (Figure 6 visualizes an example). [sent-177, score-1.089]

58 , [21]) to overcome the difficulty in naively transferring deformable, often occluded clothing items pixel-wise. [sent-181, score-0.972]

59 Our approach first computes an over-segmentation of both query and retrieved images using a fast and simple segmentation algorithm [9], then finds corresponding pairs of super-pixels between the query and each retrieved image based on pose and appearance: 1. [sent-182, score-0.51]

60 Then, our transferred parse is computed as: Stransfer(yi|xi,D) ≡Z1? [sent-189, score-0.337]

61 si,rP(yi= l|xi,θlg) · 1[l ∈ τ(r)], (5) which is a mean of the global parse over the super-pixel in a retrieved image. [sent-193, score-0.435]

62 Iterative label smoothing The combined confidence gives a rough estimate of item localization. [sent-213, score-0.344]

63 However, it does not respect boundaries of actual clothing items since it is computed per-pixel. [sent-214, score-0.939]

64 Therefore, we introduce an iterative smoothing stage that considers all pixels together to provide a smooth parse of an image. [sent-215, score-0.367]

65 Following the approach of [19], we formulate this smoothing problem by considering the joint labeling of pixels Y ≡ {yi} and item appearance models Θ ≡ {θsl}, where θls ≡is { a m}o adnedl f ioter a la apbeple al. [sent-216, score-0.284]

66 r aTnhcee goal iesl tso Θ Θfind ≡ ≡th {eθ optimal joint assignment Y∗ and item models Θ∗ for a given image. [sent-217, score-0.248]

67 We start this problem by initializing the current predicted parsing Yˆ0 with the MAP assignment under the combined 33551236 Yˆ0 confidence S. [sent-218, score-0.396]

68 Then, we treat as training data to build initial image-specific item models Θˆ0 (logistic regressions). [sent-219, score-0.217]

69 Offline processing Our retrieval techniques require the large Paper Doll Dataset to be pre-processed (parsed), for building nearest neighbor models on the fly from retrieved samples and for transferring parse masks. [sent-234, score-0.614]

70 Therefore, we estimate a clothing parse for each sample in the 339K image dataset, making use of pose estimates and the tags associated with the image by the photo owner. [sent-235, score-1.008]

71 This parse makes use of the global clothing models (constrained to the tags associated with the image by the photo owner) and iterative smoothing parts of our approach. [sent-236, score-1.058]

72 Although these training images are tagged, there are often clothing items missing in the annotation. [sent-237, score-0.939]

73 To prevent this, we add an unknown item label with uniform probability and initialize together with the global clothing model at all samples. [sent-239, score-0.856]

74 This effectively prevents the final estimated labeling Yˆ to mark missing items with incorrect labels. [sent-240, score-0.348]

75 For an unseen query image, our full parsing pipeline takes 20 to 40 seconds, including pose estimation. [sent-242, score-0.409]

76 Experimental results We evaluate parsing performance on the 229 testing sam- ples from the Fashionista dataset. [sent-245, score-0.276]

77 The task is to predict a label for every pixel where labels represent a set of 56 different categories a very large and challenging variety of clothing items. [sent-246, score-0.67]

78 In addition, we also include foreground accuracy (See eqn 6) as a measure of how accurately each method is at parsing foreground regions (those pixels on the body, not on the background). [sent-248, score-0.408]

79 Table 1 summarizes predictive performance of our parsing method, including a breakdown of how well the intermediate parsing steps perform. [sent-251, score-0.579]

80 For comparison, we include the performance of previous state of the art on clothing parsing [26]. [sent-252, score-0.867]

81 Figure 7 shows examples from our parsing method, with ground truth annotation and the method of [26]. [sent-265, score-0.276]

82 We observe – that our method produces a parse that respects the actual item boundary, even if some items are incorrectly labeled; e. [sent-266, score-0.833]

83 However, often these confusions are due to high similarity in appearance between items and sometimes due to non-exclusivity in item types, i. [sent-269, score-0.588]

84 Figure 8 plots F-1 scores for non-empty items (items predicted on the test set) comparing the method of [26] with our method. [sent-272, score-0.382]

85 Our model outperforms the prior work on many items, especially major foreground items such as dress, jeans, coat, shorts, or skirt. [sent-273, score-0.403]

86 This results in a significant boost in foreground accuracy and perceptually better parsing results. [sent-274, score-0.352]

87 By design, our style descriptor is aimed at representing whole outfit style rather than specific details of the outfit. [sent-276, score-0.354]

88 Consequently, small items like accessories tend to be less weighted during retrieval and are therefore poorly predicted during parsing. [sent-277, score-0.467]

89 However, prediction of small items is inherently extremely challenging because they provide limited appearance information. [sent-278, score-0.384]

90 flicting items from being predicted for the same image, such as dress and skirt, or boots and shoes which tend not to be worn together. [sent-322, score-0.556]

91 Our iterative smoothing is effectively reducing such confusion, but the parsing result sometimes contains one item split into two conflicting items. [sent-323, score-0.593]

92 Lastly, we find it difficult to predict items with skin-like color or coarsely textured items (similar to issues reported in [26]). [sent-325, score-0.719]

93 Because of the variation in lighting condition in pictures, it is very hard to distinguish between actual skin and clothing items that look like skin, e. [sent-326, score-0.992]

94 Conclusion We describe a clothing parsing method based on nearest neighbor style retrieval. [sent-332, score-1.069]

95 Our system combines: global parse models, nearest neighbor parse models, and transferred parse predictions. [sent-333, score-0.966]

96 all accuracy and especially foreground parsing accuracy over previous work. [sent-338, score-0.331]

97 It is our future work to resolve the confusion between very similar items and to incorporate higher level knowledge about outfits. [sent-339, score-0.393]

98 Street-toshop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. [sent-453, score-0.637]

99 Finding things: Image parsing with regions and per-exemplar detectors. [sent-493, score-0.276]

100 Who blocks who: Simultaneous clothing segmentation for grouping images. [sent-507, score-0.591]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clothing', 0.591), ('items', 0.348), ('parsing', 0.276), ('parse', 0.268), ('item', 0.217), ('fashionista', 0.19), ('retrieved', 0.146), ('style', 0.13), ('shorts', 0.125), ('tags', 0.101), ('parsed', 0.091), ('jeans', 0.088), ('doll', 0.086), ('query', 0.085), ('yi', 0.078), ('jacket', 0.073), ('outfit', 0.071), ('transferred', 0.069), ('tag', 0.067), ('pictures', 0.064), ('pants', 0.063), ('logistic', 0.061), ('dress', 0.061), ('lg', 0.059), ('shoes', 0.059), ('skirt', 0.055), ('chapel', 0.055), ('foreground', 0.055), ('confidence', 0.055), ('blazer', 0.054), ('boots', 0.054), ('chictopia', 0.054), ('outfits', 0.054), ('sglobal', 0.054), ('shirts', 0.054), ('tights', 0.054), ('skin', 0.053), ('knn', 0.051), ('tagged', 0.049), ('pose', 0.048), ('sweater', 0.047), ('hill', 0.047), ('retrieval', 0.046), ('smoothing', 0.045), ('kiapour', 0.044), ('xi', 0.043), ('rgb', 0.043), ('accessories', 0.039), ('styles', 0.039), ('lab', 0.039), ('neighbor', 0.038), ('prediction', 0.036), ('body', 0.036), ('blazers', 0.036), ('snearest', 0.036), ('stransfer', 0.036), ('hair', 0.034), ('predicted', 0.034), ('nearest', 0.034), ('transferring', 0.033), ('pages', 0.032), ('iterative', 0.032), ('belt', 0.032), ('hadi', 0.032), ('assignment', 0.031), ('coat', 0.029), ('stony', 0.029), ('regressions', 0.029), ('shirt', 0.029), ('pixel', 0.029), ('predicting', 0.028), ('samples', 0.028), ('brook', 0.028), ('unc', 0.028), ('predictive', 0.027), ('label', 0.027), ('fashion', 0.026), ('tighe', 0.025), ('button', 0.025), ('vs', 0.025), ('boundary', 0.025), ('retrieving', 0.025), ('regression', 0.024), ('yamaguchi', 0.024), ('resolve', 0.023), ('predict', 0.023), ('descriptor', 0.023), ('crf', 0.023), ('sometimes', 0.023), ('bourdev', 0.023), ('confusion', 0.022), ('gallagher', 0.022), ('song', 0.022), ('pixels', 0.022), ('social', 0.021), ('ls', 0.021), ('global', 0.021), ('wear', 0.021), ('perceptually', 0.021), ('fly', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999869 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items

Author: Kota Yamaguchi, M. Hadi Kiapour, Tamara L. Berg

Abstract: Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy.

2 0.25440192 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context

Author: Ming Shao, Liangyue Li, Yun Fu

Abstract: In this paper, we investigate the problem of recognizing occupations of multiple people with arbitrary poses in a photo. Previous work utilizing single person ’s nearly frontal clothing information and fore/background context preliminarily proves that occupation recognition is computationally feasible in computer vision. However, in practice, multiple people with arbitrary poses are common in a photo, and recognizing their occupations is even more challenging. We argue that with appropriately built visual attributes, co-occurrence, and spatial configuration model that is learned through structure SVM, we can recognize multiple people ’s occupations in a photo simultaneously. To evaluate our method’s performance, we conduct extensive experiments on a new well-labeled occupation database with 14 representative occupations and over 7K images. Results on this database validate our method’s effectiveness and show that occupation recognition is solvable in a more general case.

3 0.13956454 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets

Author: Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, Shuicheng Yan

Abstract: In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by lowlevel over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parselet ensembles are exhibited as the “And-Or” structure of sub-trees; (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configura- tionfrom apool ofParselet hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.

4 0.11763137 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

Author: Jungseock Joo, Shuo Wang, Song-Chun Zhu

Abstract: We present a part-based approach to the problem of human attribute recognition from a single image of a human body. To recognize the attributes of human from the body parts, it is important to reliably detect the parts. This is a challenging task due to the geometric variation such as articulation and view-point changes as well as the appearance variation of the parts arisen from versatile clothing types. The prior works have primarily focused on handling . edu . cn ???????????? geometric variation by relying on pre-trained part detectors or pose estimators, which require manual part annotation, but the appearance variation has been relatively neglected in these works. This paper explores the importance of the appearance variation, which is directly related to the main task, attribute recognition. To this end, we propose to learn a rich appearance part dictionary of human with significantly less supervision by decomposing image lattice into overlapping windows at multiscale and iteratively refining local appearance templates. We also present quantitative results in which our proposed method outperforms the existing approaches.

5 0.11665285 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

Author: Arash Vahdat, Greg Mori

Abstract: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by “flipping ” labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.

6 0.095716424 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation

7 0.086021237 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers

8 0.082958959 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation

9 0.073064327 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

10 0.068315633 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

11 0.067056216 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

12 0.066429794 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies

13 0.065427318 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching

14 0.064758837 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

15 0.063519627 234 iccv-2013-Learning CRFs for Image Parsing with Adaptive Subgradient Descent

16 0.063385345 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation

17 0.061272196 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

18 0.060281921 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing

19 0.058720987 444 iccv-2013-Viewing Real-World Faces in 3D

20 0.057920512 217 iccv-2013-Initialization-Insensitive Visual Tracking through Voting with Salient Local Features


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.145), (1, 0.038), (2, -0.011), (3, -0.048), (4, 0.065), (5, 0.019), (6, -0.017), (7, -0.014), (8, -0.033), (9, 0.035), (10, 0.047), (11, 0.023), (12, -0.011), (13, -0.022), (14, -0.032), (15, 0.045), (16, 0.018), (17, -0.105), (18, 0.047), (19, -0.012), (20, 0.029), (21, -0.009), (22, 0.024), (23, -0.008), (24, -0.036), (25, 0.017), (26, 0.066), (27, 0.023), (28, 0.001), (29, -0.008), (30, -0.025), (31, 0.047), (32, 0.115), (33, -0.017), (34, 0.026), (35, 0.003), (36, -0.011), (37, 0.11), (38, -0.031), (39, -0.013), (40, 0.027), (41, -0.02), (42, -0.069), (43, -0.04), (44, 0.084), (45, 0.062), (46, 0.054), (47, -0.015), (48, -0.077), (49, -0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92776674 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items

Author: Kota Yamaguchi, M. Hadi Kiapour, Tamara L. Berg

Abstract: Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy.

2 0.63274986 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets

Author: Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, Shuicheng Yan

Abstract: In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by lowlevel over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parselet ensembles are exhibited as the “And-Or” structure of sub-trees; (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configura- tionfrom apool ofParselet hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.

3 0.62756282 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval

Author: Yen-Liang Lin, Cheng-Yu Huang, Hao-Jeng Wang, Winston Hsu

Abstract: We propose a 3D sub-query expansion approach for boosting sketch-based multi-view image retrieval. The core idea of our method is to automatically convert two (guided) 2D sketches into an approximated 3D sketch model, and then generate multi-view sketches as expanded sub-queries to improve the retrieval performance. To learn the weights among synthesized views (sub-queries), we present a new multi-query feature to model the similarity between subqueries and dataset images, and formulate it into a convex optimization problem. Our approach shows superior performance compared with the state-of-the-art approach on a public multi-view image dataset. Moreover, we also conduct sensitivity tests to analyze the parameters of our approach based on the gathered user sketches.

4 0.59782225 446 iccv-2013-Visual Semantic Complex Network for Web Images

Author: Shi Qiu, Xiaogang Wang, Xiaoou Tang

Abstract: This paper proposes modeling the complex web image collections with an automatically generated graph structure called visual semantic complex network (VSCN). The nodes on this complex network are clusters of images with both visual and semantic consistency, called semantic concepts. These nodes are connected based on the visual and semantic correlations. Our VSCN with 33, 240 concepts is generated from a collection of 10 million web images. 1 A great deal of valuable information on the structures of the web image collections can be revealed by exploring the VSCN, such as the small-world behavior, concept community, indegree distribution, hubs, and isolated concepts. It not only helps us better understand the web image collections at a macroscopic level, but also has many important practical applications. This paper presents two application examples: content-based image retrieval and image browsing. Experimental results show that the VSCN leads to significant improvement on both the precision of image retrieval (over 200%) and user experience for image browsing.

5 0.57733953 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context

Author: Ming Shao, Liangyue Li, Yun Fu

Abstract: In this paper, we investigate the problem of recognizing occupations of multiple people with arbitrary poses in a photo. Previous work utilizing single person ’s nearly frontal clothing information and fore/background context preliminarily proves that occupation recognition is computationally feasible in computer vision. However, in practice, multiple people with arbitrary poses are common in a photo, and recognizing their occupations is even more challenging. We argue that with appropriately built visual attributes, co-occurrence, and spatial configuration model that is learned through structure SVM, we can recognize multiple people ’s occupations in a photo simultaneously. To evaluate our method’s performance, we conduct extensive experiments on a new well-labeled occupation database with 14 representative occupations and over 7K images. Results on this database validate our method’s effectiveness and show that occupation recognition is solvable in a more general case.

6 0.56887436 334 iccv-2013-Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval

7 0.55923992 148 iccv-2013-Example-Based Facade Texture Synthesis

8 0.55630273 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation

9 0.55345607 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network

10 0.54023957 368 iccv-2013-SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor

11 0.5313493 273 iccv-2013-Monocular Image 3D Human Pose Estimation under Self-Occlusion

12 0.52702731 225 iccv-2013-Joint Segmentation and Pose Tracking of Human in Natural Videos

13 0.52172226 375 iccv-2013-Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers

14 0.51869142 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval

15 0.4897871 118 iccv-2013-Discovering Object Functionality

16 0.48906127 330 iccv-2013-Proportion Priors for Image Sequence Segmentation

17 0.48216256 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

18 0.47608489 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose

19 0.47477463 234 iccv-2013-Learning CRFs for Image Parsing with Adaptive Subgradient Descent

20 0.47050202 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.115), (12, 0.014), (26, 0.118), (31, 0.041), (34, 0.017), (40, 0.014), (41, 0.017), (42, 0.087), (52, 0.24), (64, 0.034), (73, 0.032), (89, 0.147), (98, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81078994 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items

Author: Kota Yamaguchi, M. Hadi Kiapour, Tamara L. Berg

Abstract: Clothing recognition is an extremely challenging problem due to wide variation in clothing item appearance, layering, and style. In this paper, we tackle the clothing parsing problem using a retrieval based approach. For a query image, we find similar styles from a large database of tagged fashion images and use these examples to parse the query. Our approach combines parsing from: pre-trained global clothing models, local clothing models learned on theflyfrom retrieved examples, and transferredparse masks (paper doll item transfer) from retrieved examples. Experimental evaluation shows that our approach significantly outperforms state of the art in parsing accuracy.

2 0.71066797 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection

Author: Matthijs Douze, Jérôme Revaud, Cordelia Schmid, Hervé Jégou

Abstract: This paper makes two complementary contributions to event retrieval in large collections of videos. First, we propose hyper-pooling strategies that encode the frame descriptors into a representation of the video sequence in a stable manner. Our best choices compare favorably with regular pooling techniques based on k-means quantization. Second, we introduce a technique to improve the ranking. It can be interpreted either as a query expansion method or as a similarity adaptation based on the local context of the query video descriptor. Experiments on public benchmarks show that our methods are complementary and improve event retrieval results, without sacrificing efficiency.

3 0.68528068 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation

Author: Suyog Dutt Jain, Kristen Grauman

Abstract: The mode of manual annotation used in an interactive segmentation algorithm affects both its accuracy and easeof-use. For example, bounding boxes are fast to supply, yet may be too coarse to get good results on difficult images; freehand outlines are slower to supply and more specific, yet they may be overkill for simple images. Whereas existing methods assume a fixed form of input no matter the image, we propose to predict the tradeoff between accuracy and effort. Our approach learns whether a graph cuts segmentation will succeed if initialized with a given annotation mode, based on the image ’s visual separability and foreground uncertainty. Using these predictions, we optimize the mode of input requested on new images a user wants segmented. Whether given a single image that should be segmented as quickly as possible, or a batch of images that must be segmented within a specified time budget, we show how to select the easiest modality that will be sufficiently strong to yield high quality segmentations. Extensive results with real users and three datasets demonstrate the impact.

4 0.68231988 414 iccv-2013-Temporally Consistent Superpixels

Author: Matthias Reso, Jörn Jachalsky, Bodo Rosenhahn, Jörn Ostermann

Abstract: Superpixel algorithms represent a very useful and increasingly popular preprocessing step for a wide range of computer vision applications, as they offer the potential to boost efficiency and effectiveness. In this regards, this paper presents a highly competitive approach for temporally consistent superpixelsfor video content. The approach is based on energy-minimizing clustering utilizing a novel hybrid clustering strategy for a multi-dimensional feature space working in a global color subspace and local spatial subspaces. Moreover, a new contour evolution based strategy is introduced to ensure spatial coherency of the generated superpixels. For a thorough evaluation the proposed approach is compared to state of the art supervoxel algorithms using established benchmarks and shows a superior performance.

5 0.68076849 180 iccv-2013-From Where and How to What We See

Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath

Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.

6 0.68074858 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

7 0.67986202 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning

8 0.67970854 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition

9 0.67827564 383 iccv-2013-Semi-supervised Learning for Large Scale Image Cosegmentation

10 0.67795622 404 iccv-2013-Structured Forests for Fast Edge Detection

11 0.67712718 153 iccv-2013-Face Recognition Using Face Patch Networks

12 0.67682689 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions

13 0.67645788 245 iccv-2013-Learning a Dictionary of Shape Epitomes with Applications to Image Labeling

14 0.67594379 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction

15 0.67573106 150 iccv-2013-Exemplar Cut

16 0.67569274 20 iccv-2013-A Max-Margin Perspective on Sparse Representation-Based Classification

17 0.67566526 374 iccv-2013-Salient Region Detection by UFO: Uniqueness, Focusness and Objectness

18 0.67515498 6 iccv-2013-A Convex Optimization Framework for Active Learning

19 0.6751501 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification

20 0.67483002 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time