iccv iccv2013 iccv2013-451 knowledge-graph by maker-knowledge-mining

451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

Source: pdf

Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal

Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We propose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indicate successful classifier prediction.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu ] Abstract The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. [sent-4, score-1.08]

2 We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. [sent-5, score-0.605]

3 Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. [sent-7, score-0.425]

4 , fine-grained categorization, for example building classifiers for different bird species or flower types (there are estimated over 10000 living bird species, similar for flowers). [sent-15, score-0.489]

5 This even motivated some recent work on zero-shot learning of visual categories where there are no training images available for test categories (unseen classes), e. [sent-20, score-0.244]

6 Such approaches exploit the similarity (visual or semantic) between seen classes and unseen ones, or describe unseen classes in terms of a learned vocabulary of semantic visual attributes. [sent-23, score-0.943]

7 In contrast to the lack of reasonable size training sets for a large number of real world categories, there are abundant of textual descriptions of these categories. [sent-24, score-0.681]

8 The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. [sent-27, score-1.08]

9 In other words, we aim at zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry. [sent-28, score-0.605]

10 We explicitly address the question of how to automatically decide which information to transfer between classes without the need of human intervention. [sent-29, score-0.311]

11 Similar to the setting of zero-shot learning, we use classes with training data (seen classes) to predict classifiers for classes with no training data (unseen classes). [sent-31, score-0.556]

12 Recent works on zero-shot learning of object categories focused on leveraging knowledge about common attributes and shared parts [17]. [sent-32, score-0.264]

13 Typically, attributes [28, 7] are manually defined by humans and are used to transfer knowledge between seen and unseen classes. [sent-33, score-0.504]

14 The description of a new category is purely textual and the process is totally automatic without human annotation beyond the category labels. [sent-35, score-0.881]

15 We learn from an image corpus and a textual corpus, however not in the form of image-caption pairs, instead the only alignment be- tween the corpora is at the level of the category. [sent-37, score-0.677]

16 We propose and investigate two baseline formulations based on regression and domain adaptation. [sent-38, score-0.233]

17 The goal is to estimate a new classifier parameter given only a textual description gression function and a knowledge transfer function with additional constraints to solve the problem. [sent-42, score-0.975]

18 In general, knowledge transfer aims at enhancing recognition by exploiting shared knowledge between classes. [sent-48, score-0.266]

19 Most existing research fo- cused on knowledge sharing within the visual domain only, e. [sent-49, score-0.273]

20 We explore how knowledge from the visual and textual domains can be used to learn acrossdomain correlation, which facilitates prediction of visual classifiers from textual description. [sent-55, score-1.576]

21 Motivated by the practical need to learn visual classifiers of rare categories, researchers have explored approaches for learning from a single image (one-shot learning [18, 9, 11, 2]) or even from no images (zero-shot learning). [sent-56, score-0.274]

22 One way of recognizing object instances from previously unseen test categories (the zero-shot learning problem) is by leveraging knowledge about common attributes and shared parts. [sent-57, score-0.457]

23 Typically an intermediate semantic layer is introduced to enable sharing knowledge between classes and facilitate describing knowledge about novel unseen classes, e. [sent-58, score-0.576]

24 For instance, given adequately labeled training data, one can learn classifiers for the attributes occurring in the training object categories. [sent-61, score-0.297]

25 Such attribute-based “knowledge transfer” approaches use an intermediate visual attribute representation to enable describing unseen object categories. [sent-64, score-0.244]

26 Therefore, an unseen category has to be specified in terms of the used vocabulary of attributes. [sent-68, score-0.293]

27 The description of a new category is purely textual. [sent-73, score-0.242]

28 [ 1], which showed that learning a joint distribution of words and visual elements facilitates clustering the images in a semantic way, generating illustrative images from a caption, and generating annotations for novel images. [sent-81, score-0.234]

29 There has been an increasing recent interest in the intersection between computer vision and natural language processing with researches that focus on generating textual description of im- 2585 ages and videos, e. [sent-82, score-0.773]

30 In terms of the goal, we do not target generating textual description from images, instead we target predicting classifiers from text, in a zero-shot setting. [sent-87, score-0.917]

31 In terms of the learning setting, the textual descriptions that we use is at the level of the category and do not come in the form of image-caption pairs, as in typical datasets used for text generation from images, e. [sent-88, score-0.808]

32 The information in our problem comes from two different domains: the visual domain and the textual domain, denoted by V and T , respectively. [sent-93, score-0.743]

33 Let us consider a typical binary linear classifier in the feature space in the form fk(x) = ckT · x where x is the visual feature vector amended with 1, and ck ∈ Rdv is the linear classifier parameters for class k. [sent-97, score-0.419]

34 Give∈n a test image, its class is determined by l∗ = argmkaxfk(x) Our goal is to be able to predict a classifier for a new category based only on the learned classes and a textual description(s) of that category. [sent-98, score-1.022]

35 In order to achieve that, the learning process has to also include textual description of the seen classes (as shown in Fig 1 ). [sent-99, score-0.956]

36 Depending on the domain we might find a few, a couple, or as little as one textual description to each class. [sent-100, score-0.817]

37 We denote the textual training data for class k by {ti ∈ T }k. [sent-101, score-0.652]

38 In this paper we assume we are dealing sw kith b tyhe { ex∈tre Tme } case of having only one textual description available per class, which makes the problem even more challenging. [sent-102, score-0.696]

39 However, the formulation we propose in this paper directly applies to the case of multiple textual descriptions per class. [sent-103, score-0.685]

40 Similar to the visual domain, the raw textual descriptions have to go through a feature extraction process, which will be described in Sec 5. [sent-104, score-0.699]

41 Let us denote the extracted textual feature by T = {tk ∈ Rdt }k=1···Nsc . [sent-105, score-0.571]

42 A typical regression model, such as ridge regression [13] or Gaussian Process (GP) Regression [24], learns the regressor to each dimension of the output domain (the parameters of a linear classifier) separately, i. [sent-114, score-0.38]

43 eIn tshteead co, a structured prediction regressor would be more suitable since it would learn the correlation between the input and output domain. [sent-119, score-0.258]

44 However, even a structure prediction model, will only learn the correlation between the textual and visual domain through the information available in the input-output pairs (tk , ck). [sent-120, score-0.888]

45 Here the visual domain information is encapsulated in the pre-learned classifiers and prediction does not have access to the original data in the visual domain. [sent-121, score-0.397]

46 Instead we need to directly learn the correlation between the visual and textual domain and use that for prediction. [sent-122, score-0.836]

47 Another fundamental problem that a regressor would face, is the sparsity of the data; the data points are the textual description-classifier pairs, and typically the number of classes can be very small compared to the dimension of the classifier space (i. [sent-123, score-0.954]

48 Knowledge Transfer Models An alternative formulation is to pose the problem as domain adaptation from the textual to the visual domain. [sent-133, score-0.82]

49 In the computer vision context, domain adaptation work has focused on transferring categories learned from a source domain, with a given distribution of images, to a target domain with different distribution, e. [sent-134, score-0.374]

50 What we need is an approach 2586 that learns the correlation between the textual domain features and the visual domain features, and uses that correlation to predict new visual classifier given textual features. [sent-137, score-1.746]

51 The approach was applied to transfer learned categories between different data distributions, both in the visual domain. [sent-140, score-0.255]

52 A particular attractive characteristic of [15], over other domain adaptation models, is that the source and target domains do not have to share the same feature spaces or the same dimensionality. [sent-141, score-0.223]

53 n H aecrtse as a compatibility function between the textual features and visual features, which gives high values if they are from the same class and a low value if they are from different classes. [sent-147, score-0.67]

54 Given a textual feature t∗ and a test image, represented by x, a classification decision can be obtained by tTWx ≷ b where b is a decision boundary which can be set to (l + u)/2. [sent-149, score-0.571]

55 Hence, our desired predicted classifier in Eq 1 can be obtained as c(t∗) = tTW (note that the features vectors are amended with ones). [sent-150, score-0.216]

56 However, since learning W was done over seen classes only, it is not clear how the predicted classifier c(t∗) will behave for unseen classes. [sent-151, score-0.607]

57 There is no guarantee that such a classifier will put all the seen data on one side and the new unseen class on the other side of that hyperplane. [sent-152, score-0.469]

58 Objective Function The proposed formulation aims at predicting the hyperplane parameter c of a one-vs-all classifier for a new unseen class given a textual description, encoded by t and knowledge learned at the training phase from seen classes. [sent-156, score-1.278]

59 At the training phase three components are learned: Classifiers: a set of one-vs-all classifiers {ck} are learned, one rfso:r e aa sceht seen cel-avsss-. [sent-158, score-0.255]

60 Probabilistic Regressor: Given {(tk , ck)} a regressor is laebairnliesdti ct Rhaet can ober :u Gseidv eton give a prior e restgimreasstoe rfo irs preg (c|t) (Details in Sec 4. [sent-159, score-0.254]

61 Domain Transfer Function: Given T and V a domain transfer function, encoded in the matrix W is learned, which captures the correlation between the textual and visual domains (Details in Sec 4. [sent-161, score-0.977]

62 The question is how to combine such knowledge to predict a new classifier given a textual description. [sent-164, score-0.807]

63 The new classifier has to put all the seen instances at one side of the hyperplane, and has to be consistent with the learned domain transfer function. [sent-166, score-0.458]

64 The second term enforces that the predicted classifier has high correlation with tTW. [sent-175, score-0.214]

65 The constraints −cTxi ≥ ζi enforce all the seen data inTstahenc ceos tsot r baien atts t −hec negative side of the predicted classifier hyperplane with some missclassification allowed through the slack variables ζi. [sent-177, score-0.337]

66 The constraint t∗TWc ≥ l enforces that the correlation between the predicted c≥lass lif einerand t∗TW is no less than l, this is to enforce a minimum correlation between the text and visual features. [sent-178, score-0.279]

67 Domain Transfer Function To learn the domain transfer function W we adapted the approach in [15] as follows. [sent-181, score-0.266]

68 Let T be the textual feature data matrix and X be the visual feature data matrix where each feature vector is amended with a 1. [sent-182, score-0.684]

69 Probabilistic Regressor There are different regressors that can be used, however we need a regressor that provide a probabilistic estimate preg (c| (t)). [sent-197, score-0.254]

70 However, it has the advantage of handling the dependency between the dimensions of the classifiers c given the textual features t. [sent-216, score-0.693]

71 Datasets We used the CU200 Birds [32] (200 classes - 6033 images) and the Oxford Flower-102 [20] (102 classes - 8189 images) image dataset to test our approach, since they are among the largest and widely used fine-grained datasets. [sent-232, score-0.332]

72 We generate textual descriptions for each class in both datasets. [sent-233, score-0.696]

73 The CU200 Birds image dataset was created based on birds that have a corresponding Wikipedia article, so we have developed a tool to automatically extract Wikipedia articles given the class name. [sent-234, score-0.261]

74 On the other hand, Flower image dataset was not created using the same criteria as the Bird dataset, so classes of the Flower dataset classes does not necessarily have corresponding Wikipedia article. [sent-237, score-0.332]

75 The tool managed to generate about 16 classes from Wikipedia out of 102, the remaining 86 articles was generated manually for each class from Wikipedia, Plant Database 3, Plant Encyclopedia 4, and BBC articles 5. [sent-238, score-0.422]

76 We plan to make the extracted textual description available as augmentations of these datasets. [sent-239, score-0.696]

77 Sample textual description can be found in the supplementary material. [sent-240, score-0.696]

78 Extracting Textual Features The textual features were extracted in two phases, which are typical in document retrieval literature. [sent-243, score-0.629]

79 The first phase is an indexing phase that generates textual features with tfidf (Term Frequency-Inverse Document Frequency) configuration (Term frequency as local weighting while inverse document frequency as a global weighting). [sent-244, score-0.827]

80 We used the normalized frequency of a term in the given textual description [29]. [sent-247, score-0.737]

81 In the Flower Dataset, tf-idf features ∈ R8875 and after CLSI the final textual feattuf-riedsf ∈ fe Ratu10re2s. [sent-251, score-0.571]

82 I ∈n t Rhe Birds Dataset, tf-idf features is in R7086 taunrde sa ∈fte Rr CLSI the final textual features is in R200. [sent-252, score-0.571]

83 Classeme features are output of a set of classifiers corresponding to a set of C category labels, which are drawn from an appropriate term list defined in [3 1], and not related to our textual features. [sent-263, score-0.761]

84 In zero-shot learning setting the test data from the seen class are typically very large compared to those from unseen classes. [sent-274, score-0.335]

85 This makes other measures, such as accuracy, useless since high accuracy can be obtained even if all the unseen class test data are wrongly classified; hence we used ROC curves, which are independent of this problem. [sent-275, score-0.241]

86 Five-fold cross validation over the classes were performed, where in each fold 4/5 of the classes are considered as “seen classes” and are used for training and 1/5th of the classes were considered as “unseen classes” where their classifiers are predicted and tested. [sent-276, score-0.703]

87 Within each of these class-folds, the data of the seen classes are further split into training and test sets. [sent-277, score-0.259]

88 Baselines: Since our work is the first to predict classifiers based on pure textual description, there are no other reported results to compare against. [sent-281, score-0.729]

89 5985701230AUCfor40diefObejrcnt Cla s IindFeolxw5e0dratse607890 Figure 4: AUC of the predicated classifiers for all classes of the flower datasets them, so we need to test if the formulation is making any improvement over them. [sent-305, score-0.534]

90 GPR performed poorly in all classes in both data sets, which was expected since it is not a structure prediction approach. [sent-308, score-0.218]

91 The DA formulation outperformed TGP in the flower dataset but slightly underperformed on the Bird dataset. [sent-309, score-0.246]

92 The proposed approach outperformed all the baselines on both datasets, with significant difference on the flower dataset. [sent-310, score-0.257]

93 Fig 3 shows the ROC curves for our approach on best predicted unseen classes from the Birds dataset on the Left and Flower dataset on the middle. [sent-312, score-0.409]

94 Table 2 shows the percentage of the classes which our approach makes a prediction improvement for each of the three baselines. [sent-316, score-0.218]

95 Table 3 shows the five classes in Flower Table 2: Percentage of classes that the proposed approach makes an improvement in predicting over the baselines (relative to the total number of classes in each dataset baselineF%lo imweprrso (v1em02e)ntB%ir imdsp (r2o0v0e)ment TDGAGP R1560460% % %595186. [sent-317, score-0.597]

96 To evaluate the effect of the constraints in the objective function, we removed the constraints −(cTxi) ≥ ζi which try tctoi oenn,fo wrcee rse mallo vtehed seen examples t−o bce on t h≥e negative side of the predicted classifier hyperplane and evaluated the approach. [sent-326, score-0.295]

97 Conclusion and Future Work We explored the problem of predicting visual classifiers from textual description of classes with no training images. [sent-337, score-1.119]

98 We proposed a novel formulation that captures information between the visual and textual domains by involving knowledge transfer from textual features to visual features, which indirectly leads to predicting the visual classifier described by the text. [sent-339, score-1.724]

99 Furthermore, we will study predicting classifiers from complex-structured textual features. [sent-341, score-0.744]

100 Learning to detect unseen object classes by betweenclass attribute transfer. [sent-468, score-0.359]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('textual', 0.571), ('tgp', 0.235), ('flower', 0.209), ('unseen', 0.193), ('auc', 0.174), ('classes', 0.166), ('preg', 0.141), ('description', 0.125), ('classifiers', 0.122), ('domain', 0.121), ('clsi', 0.117), ('regressor', 0.113), ('transfer', 0.112), ('birds', 0.109), ('kc', 0.109), ('articles', 0.104), ('classifier', 0.104), ('nsc', 0.096), ('sec', 0.084), ('kt', 0.084), ('bird', 0.079), ('classeme', 0.077), ('descriptions', 0.077), ('attributes', 0.076), ('wikipedia', 0.073), ('regression', 0.073), ('encyclopedia', 0.069), ('category', 0.068), ('fig', 0.066), ('categories', 0.063), ('knowledge', 0.063), ('amended', 0.062), ('rdt', 0.062), ('domains', 0.062), ('seen', 0.06), ('wordnet', 0.06), ('correlation', 0.06), ('document', 0.058), ('text', 0.058), ('semantic', 0.053), ('prediction', 0.052), ('lnp', 0.052), ('twin', 0.052), ('predicting', 0.051), ('visual', 0.051), ('ck', 0.05), ('predicted', 0.05), ('purely', 0.049), ('hyperplane', 0.049), ('baselines', 0.048), ('generating', 0.048), ('class', 0.048), ('ctxi', 0.047), ('kernalized', 0.047), ('rdv', 0.047), ('saleh', 0.047), ('ttw', 0.047), ('ttwc', 0.047), ('ttwx', 0.047), ('twc', 0.047), ('eq', 0.046), ('da', 0.045), ('tk', 0.045), ('roc', 0.042), ('tsot', 0.042), ('gpr', 0.042), ('bfgs', 0.042), ('frequency', 0.041), ('adaptation', 0.04), ('phase', 0.04), ('formulations', 0.039), ('sharing', 0.038), ('corpus', 0.038), ('formulation', 0.037), ('gp', 0.037), ('tfidf', 0.036), ('regularizer', 0.036), ('predict', 0.036), ('asymmetric', 0.035), ('saenko', 0.035), ('std', 0.035), ('rhe', 0.035), ('corpora', 0.035), ('learning', 0.034), ('rohrbach', 0.033), ('learn', 0.033), ('question', 0.033), ('training', 0.033), ('kx', 0.032), ('vocabulary', 0.032), ('side', 0.032), ('farhadi', 0.031), ('article', 0.031), ('sentences', 0.03), ('linguistic', 0.03), ('kulkarni', 0.029), ('language', 0.029), ('learned', 0.029), ('barnard', 0.029), ('shared', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal

2 0.15802421 123 iccv-2013-Domain Adaptive Classification

Author: Fatemeh Mirrashed, Mohammad Rastegari

Abstract: We propose an unsupervised domain adaptation method that exploits intrinsic compact structures of categories across different domains using binary attributes. Our method directly optimizes for classification in the target domain. The key insight is finding attributes that are discriminative across categories and predictable across domains. We achieve a performance that significantly exceeds the state-of-the-art results on standard benchmarks. In fact, in many cases, our method reaches the same-domain performance, the upper bound, in unsupervised domain adaptation scenarios.

3 0.14685336 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

4 0.1354842 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

5 0.13298358 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

Author: Sukrit Shankar, Joan Lasenby, Roberto Cipolla

Abstract: Relative (comparative) attributes are promising for thematic ranking of visual entities, which also aids in recognition tasks [19, 23]. However, attribute rank learning often requires a substantial amount of relational supervision, which is highly tedious, and apparently impracticalfor realworld applications. In this paper, we introduce the Semantic Transform, which under minimal supervision, adaptively finds a semantic feature space along with a class ordering that is related in the best possible way. Such a semantic space is found for every attribute category. To relate the classes under weak supervision, the class ordering needs to be refined according to a cost function in an iterative procedure. This problem is ideally NP-hard, and we thus propose a constrained search tree formulation for the same. Driven by the adaptive semantic feature space representation, our model achieves the best results to date for all of the tasks of relative, absolute and zero-shot classification on two popular datasets.

6 0.12898374 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

7 0.12172323 246 iccv-2013-Learning the Visual Interpretation of Sentences

8 0.12126986 53 iccv-2013-Attribute Dominance: What Pops Out?

9 0.12031479 438 iccv-2013-Unsupervised Visual Domain Adaptation Using Subspace Alignment

10 0.11932766 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories

11 0.10927828 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification

12 0.10780507 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

13 0.10355724 431 iccv-2013-Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias

14 0.10340563 435 iccv-2013-Unsupervised Domain Adaptation by Domain Invariant Projection

15 0.10005351 202 iccv-2013-How Do You Tell a Blackbird from a Crow?

16 0.098427892 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects

17 0.097046077 190 iccv-2013-Handling Occlusions with Franken-Classifiers

18 0.095577262 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies

19 0.09287487 52 iccv-2013-Attribute Adaptation for Personalized Image Search

20 0.092417113 48 iccv-2013-An Adaptive Descriptor Design for Object Recognition in the Wild

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, 0.14), (2, -0.034), (3, -0.108), (4, 0.079), (5, 0.015), (6, -0.028), (7, -0.053), (8, 0.053), (9, -0.019), (10, 0.058), (11, -0.106), (12, -0.016), (13, -0.053), (14, 0.06), (15, -0.092), (16, -0.065), (17, 0.008), (18, -0.006), (19, 0.012), (20, 0.047), (21, -0.01), (22, 0.007), (23, 0.07), (24, -0.018), (25, -0.047), (26, 0.039), (27, -0.079), (28, -0.053), (29, 0.004), (30, 0.087), (31, -0.129), (32, 0.026), (33, -0.061), (34, -0.11), (35, -0.06), (36, -0.046), (37, 0.031), (38, -0.115), (39, -0.033), (40, -0.021), (41, 0.031), (42, -0.057), (43, 0.074), (44, -0.02), (45, 0.02), (46, 0.069), (47, -0.03), (48, 0.033), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93546003 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal

2 0.80572045 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

3 0.75739914 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko

Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

4 0.74588233 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories

Author: Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg

Abstract: Entry level categories the labels people will use to name an object were originally defined and studied by psychologists in the 1980s. In this paper we study entrylevel categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word “naturalness ” mined from the enormous amounts of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people. We also learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. – –

5 0.70230764 246 iccv-2013-Learning the Visual Interpretation of Sentences

Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende

Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.

6 0.66949958 248 iccv-2013-Learning to Rank Using Privileged Information

7 0.64324039 431 iccv-2013-Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias

8 0.63097924 181 iccv-2013-Frustratingly Easy NBNN Domain Adaptation

9 0.61974382 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields

10 0.61944056 123 iccv-2013-Domain Adaptive Classification

11 0.61744684 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

12 0.5697698 202 iccv-2013-How Do You Tell a Blackbird from a Crow?

13 0.56270611 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation

14 0.5533179 44 iccv-2013-Adapting Classification Cascades to New Domains

15 0.55323529 332 iccv-2013-Quadruplet-Wise Image Similarity Learning

16 0.55095553 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

17 0.5493241 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?

18 0.54629278 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data

19 0.53083748 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes

20 0.52788061 438 iccv-2013-Unsupervised Visual Domain Adaptation Using Subspace Alignment

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.06), (7, 0.013), (12, 0.324), (13, 0.011), (26, 0.058), (31, 0.046), (34, 0.037), (42, 0.125), (64, 0.05), (73, 0.028), (89, 0.125), (98, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86733198 413 iccv-2013-Target-Driven Moire Pattern Synthesis by Phase Modulation

Author: Pei-Hen Tsai, Yung-Yu Chuang

Abstract: This paper investigates an approach for generating two grating images so that the moir e´ pattern of their superposition resembles the target image. Our method is grounded on the fundamental moir e´ theorem. By focusing on the visually most dominant (1, −1)-moir e´ component, we obtain the phase smto ddoumlaintiaonnt c (o1n,s−tr1a)in-mt on the phase shifts bee otwbteaeinn the two grating images. For improving visual appearance of the grating images and hiding capability the embedded image, a smoothness term is added to spread information between the two grating images and an appearance phase function is used to add irregular structures into grating images. The grating images can be printed on transparencies and the hidden image decoding can be performed optically by overlaying them together. The proposed method enables the creation of moir e´ art and allows visual decoding without computers.

2 0.83334434 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation

Author: Chunxiao Liu, Chen Change Loy, Shaogang Gong, Guijin Wang

Abstract: Owing to visual ambiguities and disparities, person reidentification methods inevitably produce suboptimal ranklist, which still requires exhaustive human eyeballing to identify the correct target from hundreds of different likelycandidates. Existing re-identification studies focus on improving the ranking performance, but rarely look into the critical problem of optimising the time-consuming and error-prone post-rank visual search at the user end. In this study, we present a novel one-shot Post-rank OPtimisation (POP) method, which allows a user to quickly refine their search by either “one-shot” or a couple of sparse negative selections during a re-identification process. We conduct systematic behavioural studies to understand user’s searching behaviour and show that the proposed method allows correct re-identification to converge 2.6 times faster than the conventional exhaustive search. Importantly, through extensive evaluations we demonstrate that the method is capable of achieving significant improvement over the stateof-the-art distance metric learning based ranking models, even with just “one shot” feedback optimisation, by as much as over 30% performance improvement for rank 1reidentification on the VIPeR and i-LIDS datasets.

same-paper 3 0.76838493 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal

4 0.6804955 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

5 0.67792135 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness

Author: Michael Van_Den_Bergh, Gemma Roig, Xavier Boix, Santiago Manen, Luc Van_Gool

Abstract: Superpixel and objectness algorithms are broadly used as a pre-processing step to generate support regions and to speed-up further computations. Recently, many algorithms have been extended to video in order to exploit the temporal consistency between frames. However, most methods are computationally too expensive for real-time applications. We introduce an online, real-time video superpixel algorithm based on the recently proposed SEEDS superpixels. A new capability is incorporated which delivers multiple diverse samples (hypotheses) of superpixels in the same image or video sequence. The multiple samples are shown to provide a strong cue to efficiently measure the objectness of image windows, and we introduce the novel concept of objectness in temporal windows. Experiments show that the video superpixels achieve comparable performance to state-of-the-art offline methods while running at 30 fps on a single 2.8 GHz i7 CPU. State-of-the-art performance on objectness is also demonstrated, yet orders of magnitude faster and extended to temporal windows in video.

6 0.66693991 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels

7 0.61235023 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

8 0.60548979 338 iccv-2013-Randomized Ensemble Tracking

9 0.60329425 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve

10 0.59167504 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?

11 0.58810252 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information

12 0.58760786 428 iccv-2013-Translating Video Content to Natural Language Descriptions

13 0.58361697 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

14 0.57935959 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

15 0.57547933 180 iccv-2013-From Where and How to What We See

16 0.56825882 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation

17 0.56778568 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing

18 0.56430954 52 iccv-2013-Attribute Adaptation for Personalized Image Search

19 0.56277752 241 iccv-2013-Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection

20 0.56265461 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition