cvpr cvpr2013 cvpr2013-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Catherine Wah, Serge Belongie
Abstract: Recent work in computer vision has addressed zero-shot learning or unseen class detection, which involves categorizing objects without observing any training examples. However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. We address the more realistic problem of detecting categories that do not appear in the dataset in any form. We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. This problem is one that has received limited attention within the computer vision community. In this work, we propose a novel ap. ucs d .edu Unfamiliar? or?not? UERY?IMAGQ IMmFaAtgMechs?inIlLatsrA?inYRESg MFNaAotc?ihntIlraLsin?A YRgES UMNaotFc?hAinMltarsIinL?NIgAOR AKNTAWDNO ?Train g?imagesn U(se)alc?n)eSs(Long?bilCas n?a’t lrfyibuteIn?mfoartesixNearwter proach to the unfamiliar class detection task that builds on attribute-based classification methods, and we empirically demonstrate how classification accuracy is impacted by attribute noise and dataset “difficulty,” as quantified by the separation of classes in the attribute space. We also present a method for incorporating human users to overcome deficiencies in attribute detection. We demonstrate results superior to existing methods on the challenging CUB-200-2011 dataset.
Reference: text
sentIndex sentText sentNum sentScore
1 However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. [sent-3, score-0.423]
2 We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. [sent-5, score-0.922]
3 We also present a method for incorporating human users to overcome deficiencies in attribute detection. [sent-26, score-0.39]
4 Attributes can be used to describe and differentiate classes [18, 12, 16], and information about attributes and their values can be exploited in order to perform unseen class detection, where the goal is to categorize objects into classes for which we have no training examples (i. [sent-30, score-0.617]
5 Should the system then detect the presFigure 1: We address unfamiliar class detection, in which the goal is to predict if an input image belongs to an unfamiliar class (right query), defined by its lack of observed training examples and unknown class-attribute relationships. [sent-34, score-1.961]
6 Unfamiliar class detection is a challenging problem, given that the system must be able to distinguish between a difficult example of a known class and a truly unfamiliar class. [sent-37, score-1.114]
7 In this work, we study the related and more challenging problem of detecting unfamiliar classes. [sent-40, score-0.897]
8 The distinction between an unfamiliar class and an unseen class lies in the amount of knowledge regarding the category that is available to the system. [sent-41, score-1.183]
9 While examples belonging to unfamiliar and unseen classes are both unobserved during 7 7 7 7 7 7 97 797 training, unfamiliar classes lack entries in the matrix defining class-attribute associations; in contrast, an unseen class’ relationship to attributes is known. [sent-42, score-2.321]
10 The problem of unfamiliar class detection marks a departure from the predominant closed-set approach to visual categorization, in which the goal is to classify test examples as one of a fixed set of possible classes. [sent-45, score-1.045]
11 As such, detecting unfamiliar classes becomes a significant problem as recognition systems improve and are deployed in the wild. [sent-50, score-1.029]
12 While unseen class detection is similar in spirit, unfamiliar class detection is a more salient problem in practice, yet also one that has been widely overlooked in the computer vision community. [sent-61, score-1.202]
13 In a problem as challenging as detecting unfamiliar classes, it is important to have a means of incorporating humans into the loop, as human users can be engaged at test time to bring performance up to a desired level. [sent-65, score-1.032]
14 One class is kept as a seen class and the other class is deemed unfamiliar. [sent-72, score-0.38]
15 First, there is an indeterminate amount of error in attribute values that can arise from either attribute detectors or user labels. [sent-75, score-0.66]
16 Second, the number of differing attributes between classes quantifies how distinct the classes are; if the classes have low separation in the attribute space, the dataset will be more “diffi- cult” and less robust to attribute noise. [sent-76, score-1.204]
17 In Section 3, we formalize the unfamiliar class detection problem and describe our approach. [sent-81, score-1.004]
18 Related Work Recent work addressing the problem of zero-shot learning or unseen class detection has taken advantage of the generality of high-level attribute descriptions [18, 12, 16, 11, 25]. [sent-84, score-0.526]
19 Some methods treat zero-shot learning as a nearestneighbor problem, classifying a test image as the category with the most similar attribute description [12, 25]. [sent-85, score-0.356]
20 These characteristics can come in various forms, including binary attribute descriptions [25] or relative relationships between attributes [27]. [sent-87, score-0.525]
21 Of the limited works that do address unfamiliar class detection [12, 19], the focus is on basic-level categories, whereas we focus on detecting unfamiliar fine-grained categories. [sent-88, score-1.901]
22 However, these methods do not deal with unfamiliar classes that may be encountered at test time. [sent-90, score-1.015]
23 Other work directly addresses the novel category scenario in large-scale recognition by having a user provide a set of images belonging to an unfamiliar class [3]; the focus in this case is on retrieval, as the novelty of the test class is known. [sent-91, score-1.231]
24 For unfamiliar classes, the primary challenge is that the class-attribute relationships are unknown. [sent-92, score-0.875]
25 [20] address a problem similar to unfamiliar class detection but make some different assumptions. [sent-98, score-1.004]
26 However, the number of unfamiliar classes must be specified beforehand. [sent-100, score-0.993]
27 We do not place such restrictions on the unfamiliar classes; examples from both seen and unfamiliar classes occur at test time, and we assume that all classes (seen and unfamiliar) can be characterized by the same super set of attributes. [sent-101, score-2.112]
28 We note that among fine-grained visual categories of the same basic-level category, the interclass variation lies at the attribute level, and an unfamiliar class represents an unknown combination of these attributes. [sent-103, score-1.321]
29 This work presents an approach to unfamiliar class detection in the visual categorization setting that uses high-level attribute features and is not application specific. [sent-107, score-1.326]
30 Approach In this section, we introduce our algorithm for performing unfamiliar class detection using knowledge of attributes and present a method for incorporating user responses. [sent-109, score-1.256]
31 seen and unfamiliar classes all contain an object belonging to the same basic-level category, that is, we do not address detecting object presence. [sent-116, score-1.1]
32 Unfamiliar classes differ from seen classe∈s 0in, th, aast their attribute values and distributions are not known. [sent-124, score-0.505]
33 This assumption is reasonable given a sufficiently large attribute vocabulary, with which we can theoretically represent an exponential number ofclasses; in practice, it enables us to capture a significant amount of variation in attributes that may represent new categories. [sent-125, score-0.473]
34 Detecting an unfamiliar class therefore involves predicting an unfamiliar combination of attributes, such that it does not correspond with any known class-attribute relationships. [sent-126, score-1.849]
35 We note that if an unfamiliar class is detected, a necessary task is to verify this prediction. [sent-127, score-0.971]
36 Both the verification and naming of unfamiliar classes are non-trivial tasks, and we assume that they will be performed by an expert with necessary domain knowledge. [sent-129, score-1.025]
37 Detecting unfamiliar classes Our goal in unfamiliar class detection is to estimate the probability of an incoming test example belonging to an unfamiliar class in U. [sent-134, score-3.028]
38 (1) We assume a direct-class attribute model, such that p(c|a, x) is nonzero only when a = ac, where ac is the ground )tr iusth n oatntzriebruot eo vleyc tworh fnor a cl =ass a c. [sent-139, score-0.363]
39 Together with our assumption that the class can be fully determined by a unique attribute membership vector, the integral of Equa7 7 7 87 78719 919 tion 1can then be expressed as: p(c|x) = p(c|ac, x)p(ac |x) = p(c|ac)p(ac |x) . [sent-140, score-0.414]
40 While we are unable to learn parameters that optimize for unfamiliar classes in the test set, we assume that all possible unfamiliar classes are drawn from the same basic-level category and thus validate on other classes within this basic-level category. [sent-159, score-2.212]
41 Incorporating computer vision Under the assumption of mutual independence of attributes, we estimate the attribute probabilities using binary attribute classifiers. [sent-166, score-0.632]
42 ging to an unfamiliar class, as attribute classifier performance and robustness play a large role in accurately predicting unfamiliarity. [sent-193, score-1.208]
43 Our framework for unfamiliar class detection supports the addition of human users, who can boost performance by replacing outputs of poor attribute detectors. [sent-197, score-1.322]
44 We model the set of user attribute responses U in a similar fashion as [37, 6]. [sent-198, score-0.38]
45 (5) A user’s perception of an attribute value ai is denoted as a random variable ai, and we assume attribute values are perceived independently: p(U|x) = ? [sent-201, score-0.64]
46 Images from the unfamiliar categories were then removed from the train set. [sent-221, score-0.888]
47 We also removed corresponding entries in the class-attribute matrix for the unfamiliar classes. [sent-222, score-0.861]
48 We performed 5-fold cross validation on random sets of 10 unfamiliar classes drawn from the remaining 50 classes that did not appear in training or testing. [sent-224, score-1.167]
49 From each family, we randomly selected one class to be included in the dataset as seen, and we selected a second class as an unfamiliar category. [sent-230, score-1.081]
50 At the same time, it remains a challenging dataset: due to the selection of family pairs, unfamiliar classes may closely resemble seen classes. [sent-232, score-1.092]
51 For both bird datasets, we used roughly 30 images per class to train our attribute classifiers, and 15 per class to test. [sent-234, score-0.571]
52 Each attribute is associated with a certain part of the bird; object-level attributes such as Has Primary Color Blue, Has Size Large, and Has Shape PerchingLike are associated with the entire bounding box of the bird. [sent-239, score-0.473]
53 In general, we expect more sophisticated learning algorithms to yield attribute classifiers of greater discriminative power that can boost classification accuracy as well as unfamiliar class detection accuracy. [sent-242, score-1.374]
54 4 the roles that attribute noise, class attribute variation, and users respectively play in detecting unfamiliar classes accurately. [sent-250, score-1.845]
55 We measure performance in unfamiliar class prediction as the area under the ROC curve (AUC); the advantage of using this metric is that it is invariant to priors. [sent-254, score-0.971]
56 2 with an RBF kernel, the one-class SVM performed only (a) (b) Figure 3: ROC curves for unfamiliar class detection with the Bird-Families (3a) and All-Birds (3b) datasets. [sent-263, score-1.004]
57 5C 1 S24 Table 1: We present results for the multi-class classification of seen classes (average accuracy on seen class test images), and report AUCs for unfamiliar class detection (UCD); a one-class SVM (OCS) with ν = 0. [sent-269, score-1.39]
58 Also shown are the numbers of seen (SC) and unfamiliar classes (UC). [sent-272, score-1.043]
59 We train a binary classifier to recognize unfamiliar classes, in which negative examples are drawn from all familiar classes, and positive examples are drawn from a held-out set of unfamiliar categories consisting of 5 classes selected at random for Bird-Families and 25 for All-Birds. [sent-276, score-1.987]
60 We also report accuracy for the multi-class classification task on the seen classes to demonstrate the difficulty ofthese datasets without even considering the unfamiliar class problem. [sent-278, score-1.195]
61 For both multi-class classification and unfamiliar class detection, the discriminativeness of the individual attribute classifiers clearly affects performance. [sent-280, score-1.327]
62 In general, we note that better attribute classifiers are likely to produce higher overall accuracy for the classification task and consequently, unfamiliar class detection, as the prediction of unfamiliarity is based on the per-class probabilities. [sent-283, score-1.39]
63 For several bird family pairs 4(a-d), we observe how unfamiliar class probabilities change as noise is added to the attributes. [sent-292, score-1.167]
64 Each family is represented by one seen (cyan dashed lines) and one unfamiliar class (solid magenta lines). [sent-294, score-1.07]
65 An attribute classifier’s performance can be characterized in terms of noisiness if the output is treated as a binary response—a poorly performing attribute classifier will therefore detect an attribute with a high rate of error or noise. [sent-301, score-0.937]
66 The amount of noise present in detecting attributes, whether it arises from human users or attribute classifiers, impacts how well one can detect unfamiliar classes. [sent-303, score-1.309]
67 In this experiment, we observe the effect of attribute noise in unfamiliar class detection, focusing on the Bird-Families dataset as it allows us to examine this effect for similar and commonly confused classes. [sent-304, score-1.333]
68 The added noise represents the probability that an attribute bit value will be flipped. [sent-307, score-0.375]
69 For each family, recall that one class appears in the dataset as a seen class, and the other is considered an unfamiliar class. [sent-312, score-1.021]
70 As noise is added to the attribute values, we observe the probability that test examples from the two classes are unfamiliar (see Figure 4). [sent-313, score-1.431]
71 Referring to Figures 4(a-b), we observe that with no added noise, the unfamiliar classes are predicted with high probability of being unfamiliar, as compared to the corresponding seen classes. [sent-315, score-1.1]
72 We do not take into account any bias or priors, so a threshold on the unfamiliar probability is not determined. [sent-316, score-0.878]
73 Regardless, it is important to note the clear separation between how likely an unfamiliar class is indeed unfamiliar versus how likely a seen class is considered unfamiliar. [sent-317, score-1.992]
74 As noise is added and errors are introduced to the attribute values, the probability of being unfamiliar for both classes in a family tends to converge. [sent-318, score-1.417]
75 This indicates a confusion between seen and unfamiliar classes; after exceeding a certain level of noisiness, it is not possible to discern between them as both are equally likely to be unfamiliar. [sent-319, score-0.911]
76 Modeling class attribute variation Another dimension to unfamiliar class detection deals with the amount of class attribute variation, which can be thought of as the “difficulty” of the dataset. [sent-322, score-1.832]
77 We quantify it as the average number of attributes that differ between classes; this metric is computed as the average Hamming distance between all pairs of binary class attribute vectors. [sent-323, score-0.602]
78 The unfamiliar Brandt Cormorant test examples are instead mistaken for their seen class counterpart, the Redfaced Cormorant. [sent-327, score-1.083]
79 While the family pairs represent taxonomically similar species, the classes are not necessarily visually similar and may share more attributes in common with other classes. [sent-329, score-0.367]
80 7 7 7 8 8 8 42 242 (a)(b) Figure 5: 5a: Seen class classification observed as we query humans for attribute values. [sent-330, score-0.492]
81 As such, Blue Jays are not mistaken for Green Jays; however, they are also not likely to belong to an unfamiliar class. [sent-333, score-0.882]
82 Modeling users To overcome noise in attribute classifiers, which is compounded by the inherent difficulty of the dataset at hand, we can incorporate human users into the pipeline. [sent-337, score-0.504]
83 In order to observe the role users play in unfamiliar class detection, we perform an experiment in which we select attribute questions one at a time to ask users. [sent-338, score-1.395]
84 We determine how well an attribute can discriminate the classes by sorting the attributes based on their average Hamming distance to all other attributes. [sent-340, score-0.605]
85 User confidence on a perattribute basis is observed on the training set, and we query users of attributes that they tend to answer with the highest certainty (users are given the options definitely, probably, and guessing when providing attribute values). [sent-341, score-0.601]
86 The fourth method ranks the attributes based on mutual information, such that attributes that share less information with other attributes are selected first. [sent-342, score-0.524]
87 In Figure 6, we observe that we are able to improve unfamiliar class detection accuracy by querying users selectively, suggesting that despite variance in user responses, we are able to leverage them to overcome poor attribute detections. [sent-343, score-1.486]
88 For example, the first two attributes queried using the Hamming distance and MI methods are Has Eye Color Black and Has Belly Pattern Solid, which are two of the most common attributes found in the seen classes. [sent-345, score-0.419]
89 These attributes are useful in detecting unfamiliar classes, because when they are not detected in a test example, then the example is less likely to belong to any of the seen classes. [sent-346, score-1.138]
90 The attributes queried later tend to occur infrequently in the set of seen classes and thus are less informative in general. [sent-347, score-0.402]
91 We observe that by using user responses that are answered with high certainty, without considering consistency in response, we can boost unfamiliar class detection performance significantly. [sent-348, score-1.139]
92 As more questions are answered, the system incorporates less reliable attribute values, and this increased noise in attribute values (Figure 4) may cause the drop in the curve after 60 user responses. [sent-350, score-0.696]
93 At the same time, user certainty has no observable impact on seen class classification performance (Figure 5a), suggesting that the issues of detecting unfamiliarities and classification require different approaches and should be addressed individually. [sent-351, score-0.35]
94 Conclusion In this work, we have presented an attribute-based framework for unfamiliar class detection that supports the use of humans in the loop, empirically observing the roles that attribute noise and variation play in the task of unfamiliar class detection. [sent-353, score-2.385]
95 Success at detecting unfamiliar classes is influenced by how distinct the classes in the dataset are to each other, as there inevitably will be noise in the attribute detection. [sent-354, score-1.501]
96 Achieving better performance on un- familiar class detection also necessitates improving accuracy in the classification of attributes and classes; however, we emphasize that these problems should be addressed separately. [sent-355, score-0.362]
97 In order to improve attribute-based classification, one could take into account dependencies between attributes or consider multiple modalities in class attribute descriptions; these are directions for future work. [sent-356, score-0.583]
98 ) indicates that we have only seen the tip of the iceberg for the unfamiliar class problem. [sent-358, score-1.021]
99 We compare several methods of sequentially querying users and observe how each affects unfamiliar class detection performance. [sent-365, score-1.113]
100 Learning to detect unseen object classes by between-class attribute transfer. [sent-459, score-0.491]
wordName wordTfidf (topN-words)
[('unfamiliar', 0.861), ('attribute', 0.304), ('attributes', 0.169), ('classes', 0.132), ('class', 0.11), ('users', 0.072), ('aic', 0.066), ('unfamiliarity', 0.063), ('unseen', 0.055), ('species', 0.054), ('user', 0.052), ('seen', 0.05), ('family', 0.049), ('bird', 0.047), ('ac', 0.044), ('jays', 0.038), ('detecting', 0.036), ('noise', 0.036), ('detection', 0.033), ('auc', 0.033), ('queried', 0.031), ('cub', 0.031), ('taxonomic', 0.031), ('category', 0.03), ('classifiers', 0.03), ('query', 0.029), ('associations', 0.028), ('hamming', 0.028), ('drawn', 0.027), ('certainty', 0.027), ('categories', 0.027), ('humans', 0.027), ('north', 0.026), ('play', 0.026), ('amil', 0.025), ('bunting', 0.025), ('buntings', 0.025), ('fisnrane', 0.025), ('indigo', 0.025), ('kiwi', 0.025), ('noisiness', 0.025), ('ssaclcssa', 0.025), ('novelty', 0.025), ('probabilities', 0.024), ('descriptions', 0.024), ('responses', 0.024), ('families', 0.023), ('wah', 0.023), ('answered', 0.023), ('markou', 0.023), ('terns', 0.023), ('test', 0.022), ('classification', 0.022), ('observe', 0.022), ('loop', 0.021), ('mistaken', 0.021), ('belonging', 0.021), ('occur', 0.02), ('difficulty', 0.02), ('distinctions', 0.02), ('american', 0.019), ('interclass', 0.019), ('leafsnap', 0.019), ('mahajan', 0.019), ('examples', 0.019), ('differ', 0.019), ('added', 0.018), ('farhadi', 0.018), ('categorization', 0.018), ('jay', 0.018), ('vocabulary', 0.018), ('logp', 0.018), ('biometric', 0.017), ('suggesting', 0.017), ('probability', 0.017), ('empirically', 0.017), ('predicting', 0.017), ('ai', 0.017), ('share', 0.017), ('quantifies', 0.017), ('knowledge', 0.017), ('verification', 0.017), ('unobserved', 0.016), ('brandt', 0.016), ('birds', 0.016), ('quantified', 0.016), ('validation', 0.015), ('assume', 0.015), ('querying', 0.015), ('differing', 0.014), ('interactively', 0.014), ('incorporating', 0.014), ('characteristics', 0.014), ('familiar', 0.014), ('branson', 0.014), ('griffin', 0.014), ('addressed', 0.014), ('boost', 0.014), ('relationships', 0.014), ('sigmoid', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
Author: Catherine Wah, Serge Belongie
Abstract: Recent work in computer vision has addressed zero-shot learning or unseen class detection, which involves categorizing objects without observing any training examples. However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. We address the more realistic problem of detecting categories that do not appear in the dataset in any form. We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. This problem is one that has received limited attention within the computer vision community. In this work, we propose a novel ap. ucs d .edu Unfamiliar? or?not? UERY?IMAGQ IMmFaAtgMechs?inIlLatsrA?inYRESg MFNaAotc?ihntIlraLsin?A YRgES UMNaotFc?hAinMltarsIinL?NIgAOR AKNTAWDNO ?Train g?imagesn U(se)alc?n)eSs(Long?bilCas n?a’t lrfyibuteIn?mfoartesixNearwter proach to the unfamiliar class detection task that builds on attribute-based classification methods, and we empirically demonstrate how classification accuracy is impacted by attribute noise and dataset “difficulty,” as quantified by the separation of classes in the attribute space. We also present a method for incorporating human users to overcome deficiencies in attribute detection. We demonstrate results superior to existing methods on the challenging CUB-200-2011 dataset.
2 0.26557559 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition
Author: Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang
Abstract: Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making the representation costly to obtain. In this paper, we propose a novel formulation to automatically design discriminative “category-level attributes ”, which can be efficiently encoded by a compact category-attribute matrix. The formulation allows us to achieve intuitive and critical design criteria (category-separability, learnability) in a principled way. The designed attributes can be used for tasks of cross-category knowledge transfer, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images). This approach also leads to state-ofthe-art performance on the zero-shot learning task on AwA.
3 0.19195051 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes
Author: Amir Sadovnik, Andrew Gallagher, Tsuhan Chen
Abstract: Visual attributes are powerful features for many different applications in computer vision such as object detection and scene recognition. Visual attributes present another application that has not been examined as rigorously: verbal communication from a computer to a human. Since many attributes are nameable, the computer is able to communicate these concepts through language. However, this is not a trivial task. Given a set of attributes, selecting a subset to be communicated is task dependent. Moreover, because attribute classifiers are noisy, it is important to find ways to deal with this uncertainty. We address the issue of communication by examining the task of composing an automatic description of a person in a group photo that distinguishes him from the others. We introduce an efficient, principled methodfor choosing which attributes are included in a short description to maximize the likelihood that a third party will correctly guess to which person the description refers. We compare our algorithm to computer baselines and human describers, and show the strength of our method in creating effective descriptions.
4 0.18784209 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
Author: Jonghyun Choi, Mohammad Rastegari, Ali Farhadi, Larry S. Davis
Abstract: We propose a method to expand the visual coverage of training sets that consist of a small number of labeled examples using learned attributes. Our optimization formulation discovers category specific attributes as well as the images that have high confidence in terms of the attributes. In addition, we propose a method to stably capture example-specific attributes for a small sized training set. Our method adds images to a category from a large unlabeled image pool, and leads to significant improvement in category recognition accuracy evaluated on a large-scale dataset, ImageNet.
5 0.18562493 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes
Author: Shuo Wang, Jungseock Joo, Yizhou Wang, Song-Chun Zhu
Abstract: In this paper, we propose a weakly supervised method for simultaneously learning scene parts and attributes from a collection ofimages associated with attributes in text, where the precise localization of the each attribute left unknown. Our method includes three aspects. (i) Compositional scene configuration. We learn the spatial layouts of the scene by Hierarchical Space Tiling (HST) representation, which can generate an excessive number of scene configurations through the hierarchical composition of a relatively small number of parts. (ii) Attribute association. The scene attributes contain nouns and adjectives corresponding to the objects and their appearance descriptions respectively. We assign the nouns to the nodes (parts) in HST using nonmaximum suppression of their correlation, then train an appearance model for each noun+adjective attribute pair. (iii) Joint inference and learning. For an image, we compute the most probable parse tree with the attributes as an instantiation of the HST by dynamic programming. Then update the HST and attribute association based on the in- ferred parse trees. We evaluate the proposed method by (i) showing the improvement of attribute recognition accuracy; and (ii) comparing the average precision of localizing attributes to the scene parts.
6 0.18065017 241 cvpr-2013-Label-Embedding for Attribute-Based Classification
7 0.16850911 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?
8 0.16720997 101 cvpr-2013-Cumulative Attribute Space for Age and Crowd Density Estimation
9 0.152155 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
10 0.15069535 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
11 0.14790499 396 cvpr-2013-Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback
12 0.14241494 146 cvpr-2013-Enriching Texture Analysis with Semantic Data
13 0.13368827 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning
15 0.1089228 99 cvpr-2013-Cross-View Image Geolocalization
16 0.09024854 462 cvpr-2013-Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines
17 0.080962606 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
18 0.069203027 239 cvpr-2013-Kernel Null Space Methods for Novelty Detection
19 0.057789247 452 cvpr-2013-Vantage Feature Frames for Fine-Grained Categorization
20 0.055290129 73 cvpr-2013-Bringing Semantics into Focus Using Visual Abstraction
topicId topicWeight
[(0, 0.121), (1, -0.108), (2, -0.036), (3, -0.026), (4, 0.104), (5, 0.105), (6, -0.247), (7, 0.067), (8, 0.083), (9, 0.179), (10, -0.046), (11, 0.084), (12, -0.036), (13, -0.001), (14, 0.061), (15, 0.028), (16, -0.032), (17, -0.021), (18, -0.034), (19, 0.076), (20, 0.002), (21, 0.023), (22, 0.003), (23, 0.009), (24, 0.019), (25, -0.002), (26, -0.045), (27, 0.024), (28, -0.042), (29, 0.003), (30, 0.029), (31, 0.015), (32, 0.042), (33, -0.0), (34, 0.011), (35, -0.009), (36, -0.005), (37, -0.015), (38, -0.009), (39, 0.081), (40, -0.003), (41, 0.007), (42, 0.016), (43, -0.031), (44, -0.026), (45, 0.043), (46, 0.018), (47, 0.049), (48, -0.008), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.94126356 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
Author: Catherine Wah, Serge Belongie
Abstract: Recent work in computer vision has addressed zero-shot learning or unseen class detection, which involves categorizing objects without observing any training examples. However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. We address the more realistic problem of detecting categories that do not appear in the dataset in any form. We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. This problem is one that has received limited attention within the computer vision community. In this work, we propose a novel ap. ucs d .edu Unfamiliar? or?not? UERY?IMAGQ IMmFaAtgMechs?inIlLatsrA?inYRESg MFNaAotc?ihntIlraLsin?A YRgES UMNaotFc?hAinMltarsIinL?NIgAOR AKNTAWDNO ?Train g?imagesn U(se)alc?n)eSs(Long?bilCas n?a’t lrfyibuteIn?mfoartesixNearwter proach to the unfamiliar class detection task that builds on attribute-based classification methods, and we empirically demonstrate how classification accuracy is impacted by attribute noise and dataset “difficulty,” as quantified by the separation of classes in the attribute space. We also present a method for incorporating human users to overcome deficiencies in attribute detection. We demonstrate results superior to existing methods on the challenging CUB-200-2011 dataset.
2 0.93402249 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition
Author: Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang
Abstract: Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making the representation costly to obtain. In this paper, we propose a novel formulation to automatically design discriminative “category-level attributes ”, which can be efficiently encoded by a compact category-attribute matrix. The formulation allows us to achieve intuitive and critical design criteria (category-separability, learnability) in a principled way. The designed attributes can be used for tasks of cross-category knowledge transfer, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images). This approach also leads to state-ofthe-art performance on the zero-shot learning task on AwA.
3 0.88568074 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning
Author: Babak Saleh, Ali Farhadi, Ahmed Elgammal
Abstract: When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviations from prototypical properties of categories. Our model can recognize abnormalities and report the main reasons of any recognized abnormality. We also show that abnormality predictions can help image categorization. We introduce the abnormality detection dataset and show interesting results on how to reason about abnormalities.
4 0.87981415 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes
Author: Amir Sadovnik, Andrew Gallagher, Tsuhan Chen
Abstract: Visual attributes are powerful features for many different applications in computer vision such as object detection and scene recognition. Visual attributes present another application that has not been examined as rigorously: verbal communication from a computer to a human. Since many attributes are nameable, the computer is able to communicate these concepts through language. However, this is not a trivial task. Given a set of attributes, selecting a subset to be communicated is task dependent. Moreover, because attribute classifiers are noisy, it is important to find ways to deal with this uncertainty. We address the issue of communication by examining the task of composing an automatic description of a person in a group photo that distinguishes him from the others. We introduce an efficient, principled methodfor choosing which attributes are included in a short description to maximize the likelihood that a third party will correctly guess to which person the description refers. We compare our algorithm to computer baselines and human describers, and show the strength of our method in creating effective descriptions.
5 0.87610543 241 cvpr-2013-Label-Embedding for Attribute-Based Classification
Author: Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid
Abstract: Attributes are an intermediate representation, which enables parameter sharing between classes, a must when training data is scarce. We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors. We introduce a function which measures the compatibility between an image and a label embedding. The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct classes rank higher than the incorrect ones. Results on the Animals With Attributes and Caltech-UCSD-Birds datasets show that the proposed framework outperforms the standard Direct Attribute Prediction baseline in a zero-shot learning scenario. The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e.g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.
6 0.86271882 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?
7 0.81454587 396 cvpr-2013-Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback
8 0.79107022 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes
9 0.72881544 101 cvpr-2013-Cumulative Attribute Space for Age and Crowd Density Estimation
10 0.7139762 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
11 0.68116838 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
12 0.67599171 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
13 0.59928137 146 cvpr-2013-Enriching Texture Analysis with Semantic Data
14 0.59232551 99 cvpr-2013-Cross-View Image Geolocalization
16 0.4998771 463 cvpr-2013-What's in a Name? First Names as Facial Attributes
17 0.48564658 462 cvpr-2013-Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines
18 0.42391467 174 cvpr-2013-Fine-Grained Crowdsourcing for Fine-Grained Recognition
19 0.41173556 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
20 0.37169006 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
topicId topicWeight
[(10, 0.068), (16, 0.013), (26, 0.034), (27, 0.016), (28, 0.013), (33, 0.544), (67, 0.061), (69, 0.048), (77, 0.013), (87, 0.047), (95, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.99782306 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
Author: Catherine Wah, Serge Belongie
Abstract: Recent work in computer vision has addressed zero-shot learning or unseen class detection, which involves categorizing objects without observing any training examples. However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. We address the more realistic problem of detecting categories that do not appear in the dataset in any form. We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. This problem is one that has received limited attention within the computer vision community. In this work, we propose a novel ap. ucs d .edu Unfamiliar? or?not? UERY?IMAGQ IMmFaAtgMechs?inIlLatsrA?inYRESg MFNaAotc?ihntIlraLsin?A YRgES UMNaotFc?hAinMltarsIinL?NIgAOR AKNTAWDNO ?Train g?imagesn U(se)alc?n)eSs(Long?bilCas n?a’t lrfyibuteIn?mfoartesixNearwter proach to the unfamiliar class detection task that builds on attribute-based classification methods, and we empirically demonstrate how classification accuracy is impacted by attribute noise and dataset “difficulty,” as quantified by the separation of classes in the attribute space. We also present a method for incorporating human users to overcome deficiencies in attribute detection. We demonstrate results superior to existing methods on the challenging CUB-200-2011 dataset.
2 0.99567908 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition
Author: Song Cao, Noah Snavely
Abstract: Recognizing the location of a query image by matching it to a database is an important problem in computer vision, and one for which the representation of the database is a key issue. We explore new ways for exploiting the structure of a database by representing it as a graph, and show how the rich information embedded in a graph can improve a bagof-words-based location recognition method. In particular, starting from a graph on a set of images based on visual connectivity, we propose a method for selecting a set of subgraphs and learning a local distance function for each using discriminative techniques. For a query image, each database image is ranked according to these local distance functions in order to place the image in the right part of the graph. In addition, we propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. We demonstrate that our methods improve performance over standard bag-of-words methods on several existing location recognition datasets.
3 0.99485564 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
Author: Mihir Jain, Hervé Jégou, Patrick Bouthemy
Abstract: Several recent works on action recognition have attested the importance of explicitly integrating motion characteristics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, applying the recent VLAD coding technique proposed in image retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and related work Human actions often convey the essential meaningful content in videos. Yet, recognizing human actions in un- constrained videos is a challenging problem in Computer Vision which receives a sustained attention due to the potential applications. In particular, there is a large interest in designing video-surveillance systems, providing some automatic annotation of video archives as well as improving human-computer interaction. The solutions proposed to address this problem inherit, to a large extent, from the techniques first designed for the goal of image search and classification. The successful local features developed to describe image patches [15, 23] have been translated in the 2D+t domain as spatio-temporal local descriptors [13, 30] and now include motion clues [29]. These descriptors are often extracted from spatial-temporal interest points [12, 3 1]. More recent techniques assume some underlying temporal motion model involving trajectories [2, 6, 7, 17, 18, 25, 29, 32]. Most of these approaches produce large set of local descriptors which are in turn aggregated to produce a single vector representing the video, in order to enable the use of powerful discriminative classifiers such as support vector machines (SVMs). This is usually done with the bag- Figure 1. Optical flow field vectors (green vectors with red end points) before and after dominant motion compensation. Most of the flow vectors due to camera motion are suppressed after compensation. One of the contributions of this paper is to show that compensating for the dominant motion is beneficial for most of the existing descriptors used for action recognition. of-words technique [24], which quantizes the local features using a k-means codebook. Thanks to the successful combination of this encoding technique with the aforementioned local descriptors, the state of the art in action recognition is able to go beyond the toy problems ofclassifying simple human actions in controlled environment and considers the detection of actions in real movies or video clips [11, 16]. Despite these progresses, the existing descriptors suffer from an uncompleted handling of motion in the video sequence. Motion is arguably the most reliable source of information for action recognition, as often related to the actions of interest. However, it inevitably involves the background or camera motion when dealing with uncontrolled and re- alistic situations. Although some attempts have been made to compensate camera motion in several ways [10, 21, 26, 29, 32], how to separate action motion from that caused by the camera, and how to reflect it in the video description remains an open issue. The motion compensation mechanism employed in [10] is tailor-made to the Motion Interchange Pattern encoding technique. The Motion Boundary Histogram (MBH) [29] is a recent appealing approach to 222555555533 suppress the constant motion by considering the flow gradient. It is robust to some extent to the presence of camera motion, yet it does not explicitly handle the camera motion. Another approach [26] uses a sophisticated and robust (RANSAC) estimation of camera motion. It first segments the color image into regions corresponding to planar parts in the scene and estimates the (three) dominant homographies to update the motion associated with local features. A rather different view is adopted in [32] where the motion decomposition is performed at the trajectory level. All these works support the potential of motion compensation. As the first contribution of this paper, we address the problem in a way that departs from these works by considering the compensation of the dominant motion in both the tracking stages and encoding stages involved in the computation of action recognition descriptors. We rely on the pioneering works on motion compensation such as the technique proposed in [20], that considers 2D polynomial affine motion models for estimating the dominant image motion. We consider this particular model for its robustness and its low computational cost. It was already used in [21] to separate the dominant motion (assumed to be due to the camera motion) and the residual motion (corresponding to the independent scene motions) for dynamic event recognition in videos. However, the statistical modeling of both motion components was global (over the entire image) and only the normal flow was computed for the latter. Figure 1 shows the vectors of optical flow before and after applying the proposed motion compensation. Our method successfully suppresses most of the background motion and reinforces the focus towards the action of interest. We exploit this compensated motion both for descriptor computation and for extracting trajectories. However, we also show that the camera motion should not be thrown as it contains complementary information that is worth using to recognize certain action categories. Then, we introduce the Divergence-Curl-Shear (DCS) descriptor, which encodes scalar first-order motion features, namely the motion divergence, curl and shear. It captures physical properties of the flow pattern that are not involved in the best existing descriptors for action recognition, except in the work of [1] which exploits divergence and vorticity among a set of eleven kinematic features computed from the optical flow. Our DCS descriptor provides a good performance recognition performance on its own. Most importantly, it conveys some information which is not captured by existing descriptors and further improves the recognition performance when combined with the other descriptors. As a last contribution, we bring an encoding technique known as VLAD (vector oflocal aggregated descriptors) [8] to the field of action recognition. This technique is shown to be better than the bag-of-words representation for combining all the local video descriptors we have considered. The organization of the paper is as follows. Section 2 introduces the motion properties that we will consider through this paper. Section 3 presents the datasets and classification scheme used in our different evaluations. Section 4 details how we revisit several popular descriptors of the literature by the means of dominant motion compensation. Our DCS descriptor based on kinematic properties is introduced in Section 5 and improved by the VLAD encoding technique, which is introduced and bench-marked in Section 6 for several video descriptors. Section 7 provides a comparison showing the large improvement achieved over the state of the art. Finally, Section 8 concludes the paper. 2. Motion Separation and Kinematic Features In this section, we describe the motion clues we incorporate in our action recognition framework. We separate the dominant motion and the residual motion. In most cases, this will account to distinguishing the impact of camera movement and independent actions. Note that we do not aim at recovering the 3D camera motion: The 2D parametric motion model describes the global (or dominant) motion between successive frames. We first explain how we estimate the dominant motion and employ it to separate the dominant flow from the optical flow. Then, we will introduce kinematic features, namely divergence, curl and shear for a more comprehensive description of the visual motion. 2.1. Affine motion for compensating camera motion Among polynomial motion models, we consider the 2D affine motion model. Simplest motion models such as the 4parameter model formed by the combination of 2D translation, 2D rotation and scaling, or more complex ones such as the 8-parameter quadratic model (equivalent to a homography), could be selected as well. The affine model is a good trade-off between accuracy and efficiency which is of primary importance when processing a huge video database. It does have limitations since strictly speaking it implies a single plane assumption for the static background. However, this is not that penalizing (especially for outdoor scenes) if differences in depth remain moderated with respect to the distance to the camera. The affine flow vector at point p = (x, y) and at time t, is defined as waff(pt) =?cc12((t ) ?+?aa31((t ) aa42((t ) ? ?xytt?. (1) = + + = + uaff(pt) c1(t) a1(t)xt a2(t)yt and vaff(pt) c2(t) a3 (t)xt + a4(t)yt are horizontal and vertical components of waff(pt) respectively. Let us denote the optical flow vector at point p at time t as w(pt) = (u(pt) , v(pt)). We introduce the flow vector ω(pt) obtained by removing the affine flow vector from the optical flow vector ω(pt) = w(pt) − waff(pt) . (2) 222555555644 The dominant motion (estimated as waff(pt)) is usually due to the camera motion. In this case, Equation 2 amounts to canceling (or compensating) the camera motion. Note that this is not always true. For example in case of close-up on a moving actor, the dominant motion will be the affine estimation of the apparent actor motion. The interpretation of the motion compensation output will not be that straightforward in this case, however the resulting ω-field will still exhibit different patterns for the foreground action part and the background part. In the remainder, we will refer to the “compensated” flow as ω-flow. Figure 1 displays the computed optical flow and the ωflow. We compute the affine flow with the publicly available Motion2D software1 [20] which implements a realtime robust multiresolution incremental estimation framework. The affine motion model has correctly accounted for the motion induced by the camera movement which corresponds to the dominant motion in the image pair. Indeed, we observe that the compensated flow vectors in the background are close to null and the compensated flow in the foreground, i.e., corresponding to the actors, is conversely inflated. The experiments presented along this paper will show that effective separation of dominant motion from the residual motions is beneficial for action recognition. As explained in Section 4, we will compute local motion descriptors, such as HOF, on both the optical flow and the compensated flow (ω-flow), which allows us to explicitly and directly characterize the scene motion. 2.2. Local kinematic features By kinematic features, we mean local first-order differential scalar quantities computed on the flow field. We consider the divergence, the curl (or vorticity) and the hyperbolic terms. They inform on the physical pattern of the flow so that they convey useful information on actions in videos. They can be computed from the first-order derivatives of the flow at every point p at every frame t as ⎨⎪ ⎪ ⎪ ⎧hcdyuipvr1l2(p t) = −∂ u ∂ (yxp(xtp) +−∂ v ∂ v(px ypxt ) The diverg⎪⎩ence is related to axial motion, expansion scaling effects, the curl to rotation in the image plane. hyperbolic terms express the shear of the visual flow responding to more complex configuration. We take account the shear quantity only: shear(pt) = ?hyp12(pt) + hyp22(pt). (3) and The corinto (4) 1http://www.irisa.fr/vista/Motion2D/ In Section 5, we propose the DCS descriptor that is based on the kinematic features (divergence, curl and shear) of the visual motion discussed in this subsection. It is computed on either the optical or the compensated flow, ω-flow. 3. Datasets and evaluation This section first introduces the datasets used for the evaluation. Then, we briefly present the bag-of-feature model and the classification scheme used to encode the descriptors which will be introduced in Section 4. Hollywood2. The Hollywood2 dataset [16] contains 1,707 video clips from 69 movies representing 12 action classes. It is divided into train set and test set of 823 and 884 samples respectively. Following the standard evaluation protocol of this benchmark, we use average precision (AP) for each class and the mean of APs (mAP) for evaluation. HMDB51. The HMDB51 dataset [11] is a large dataset containing 6,766 video clips extracted from various sources, ranging from movies to YouTube. It consists of 51 action classes, each having at least 101 samples. We follow the evaluation protocol of [11] and use three train/test splits, each with 70 training and 30 testing samples per class. The average classification accuracy is computed over all classes. Out of the two released sets, we use the original set as it is more challenging and used by most of the works reporting results in action recognition. Olympic Sports. The third dataset we use is Olympic Sports [19], which again is obtained from YouTube. This dataset contains 783 samples with 16 sports action classes. We use the provided2 train/test split, there are 17 to 56 training samples and 4 to 11test samples per class. Mean AP is used for the evaluation, which is the standard choice. Bag of features and classification setup. We first adopt the standard BOF [24] approach to encode all kinds of descriptors. It produces a vector that serves as the video representation. The codebook is constructed for each type of descriptor separately by the k-means algorithm. Following a common practice in the literature [27, 29, 30], the codebook size is set to k=4,000 elements. Note that Section 6 will consider encoding technique for descriptors. For the classification, we use a non-linear SVM with χ2kernel. When combining different descriptors, we simply add the kernel matrices, as done in [27]: K(xi,xj) = exp?−?cγ1cD(xic,xjc)?, 2http://vision.stanford.edu/Datasets/OlympicSports/ 222555555755 (5) where D(xic, xjc) is χ2 distance between video xic and xjc with respect to c-th channel, corresponding to c-th descriptor. The quantity γc is the mean value of χ2 distances between the training samples for the c-th channel. The multiclass classification problem that we consider is addressed by applying a one-against-rest approach. 4. Compensated descriptors This section describes how the compensation ofthe dominant motion is exploited to improve the quality of descriptors encoding the motion and the appearance around spatio-temporal positions, hence the term “compensated descriptors”. First, we briefly review the local descriptors [5, 13, 16, 29, 30] used here along with dense trajectories [29]. Second, we analyze the impact of motion flow compensation when used in two different stages of the descriptor computation, namely in the tracking and the description part. 4.1. Dense trajectories and local descriptors Employing dense trajectories to compute local descriptors is one of the state-of-the-art approaches for action recognition. It has been shown [29] that when local descriptors are computed over dense trajectories the performance improves considerably compared to when computed over spatio temporal features [30]. Dense Trajectories [29]: The trajectories are obtained by densely tracking sampled points using optical flow fields. First, feature points are sampled from a dense grid, with step size of 5 pixels and over 8 scales. Each feature point pt = (xt, yt) at frame t is then tracked to the next frame by median filtering in a dense optical flow field F = (ut, vt) as follows: pt+1 = (xt+1 , yt+1) = (xt, yt) + (M ∗ F) | (x ¯t,y ¯t) , (6) where M is the kernel of median filtering and ( x¯ t, y¯ t) is the rounded position of (xt, yt). The tracking is limited to L (=15) frames to avoid any drifting effect. Excessively short trajectories and trajectories exhibiting sudden large displacements are removed as they induce some artifacts. Trajectories must be understood here as tracks in the spacetime volume of the video. Local descriptors: The descriptors are computed within a space-time volume centered around each trajectory. Four types of descriptors are computed to encode the shape of the trajectory, local motion pattern and appearance, namely Trajectory [29], HOF (histograms of optical flow) [13], MBH [4] and HOG (histograms of oriented gradients) [3]. All these descriptors depend on the flow field used for the tracking and as input of the descriptor computation: 1. The Trajectory descriptor encodes the shape of the trajectory represented by the normalized relative coor- × dinates of the successive points forming the trajectory. It directly depends on the dense flow used for tracking points. 2. HOF is computed using the orientations and magnitudes of the flow field. 3. MBH is designed to capture the gradient of horizontal and vertical components of the flow. The motion boundaries encode the relative pixel motion and therefore suppress camera motion, but only to some extent. 4. HOG encodes the appearance by using the intensity gradient orientations and magnitudes. It is formally not a motion descriptor. Yet the position where the descriptor is computed depends on the trajectory shape. As in [29], volume around a feature point is divided into a 2 2 3 space-time grid. The orientations are quantized ian 2to × ×8 b2i ×ns 3fo srp HacOe-Gti amned g g9r ibdi.ns T fhoer o oHriOenFt (awtioitnhs one a qdudainttiiozneadl zero bin). The horizontal and vertical components of MBH are separately quantized into 8 bins each. 4.2. Impact of motion compensation The optical flow is simply referred to as flow in the following, while the compensated flow (see subsection 2. 1) is denoted by ω-flow. Both of them are considered in the tracking and descriptor computation stages. The trajectories obtained by tracking with the ω-flow are called ω-trajectories. Figure 2 comparatively illustrates the ωtrajectories and the trajectories obtained using the flow. The input video shows a man moving away from the car. In this video excerpt, the camera is following the man walking to the right, thus inducing a global motion to the left in the video. When using the flow, the computed trajectories reflect the combination of these two motion components (camera and scene motion) as depicted by Subfigure 2(b), which hampers the characterization of the current action. In contrast, the ω-trajectories plotted in Subfigure 2(c) are more active on the actor moving on the foreground, while those localized in the background are now parallel to the time axis enhancing static parts of the scene. The ω-trajectories are therefore more relevant for action recognition, since they are more regularly and more exclusively following the actor’s motion. Impact on Trajectory and HOG descriptors. Table 1reports the impact of ω-trajectories on Trajectory and HOG descriptors, which are both significantly improved by 3%4% of mAP on the two datasets. When improved by ωflow, these descriptors will be respectively referred to as ω-Trajdesc and ω-HOG in the rest of the paper. Although the better performance of ω-Trajdesc versus the original Trajectory descriptor was expected, the one 222555555866 2. Trajectories obtained from optical and compensated flows. The green tail is the trajectory the current frame. The trajectories are sub-sampled for the sake of clarity. The frames are extracted Figure over every 15 frames with red dot indicating 5 frames in this example. DescriptorHollywood2HMDB51 BaseTrliaωnje- Tc(rtoarejrdpyreos[c2d9u]ced)54 7 1. 7 4% %2382.–89% BaseliHnωOe- (GHreOp [2rG9od]uced)4 451 . 658%%%2296.– 13%% Table 1. ω-Trajdesc and ω-HOG: Impact of compensating flow on Trajectory descriptor and HOG descriptors. achieved by ω-HOG might be surprising. Our interpretation is that HOG captures more context with the modified trajectories. More precisely, the original HOG descriptor is computed from a 2D+t sub-volume aligned with the corresponding trajectory and hence represents the appearance along the trajectory shape. When using ω-flow, we do not align the video sequence. As a result, the ω-HOG descriptor is no more computed around the very same tracked physical point in the space-time volume but around points lying in a patch of the initial feature point, whose size depends on the affine flow magnitude. ω-HOG can be viewed as a “patchbased” computation capturing more information about the appearance of the background or of the moving foreground. As for ω-trajectories, they are closer to the real trajectories of the moving actors as they usually cancel the camera movement, and so, more easier to train and recognize. Impact on HOF. The ω-flow impacts computation used as an input to HOF computation itself. Therefore, HOF can both types of trajectories (ω-trajectories both the trajectory and the descriptor be computed along or those extracted MethodHollywood2HMDB51 Table(ω2rHf.-alocO IwomkF)inpHgacOtFobf[2u9ωhsb]i:f-nlo ωgwot-hwωHOflFown5H 02 34O. 58291F% %descripto3 r706s38.:–1076% m%APfor Hollywood2 and average accuracy for HMDB5 1. The ω-HOF is used in subsequent evaluations. from flow) and can encode both kinds of flows (ω-flow or flow). For the sake of completeness, we evaluate all the variants as well as the combination of both flows in the descriptor computation stage. The results are presented in Table 2 and demonstrate the significant improvement obtained by computing the HOF descriptor with the ω-flow instead of the optical flow. Note that the type of trajectories which is used, either “Tracking flow” or “Tracking ω-flow”, has a limited impact in this case. From now on, we only consider the “Tracking ω-flow” case where HOF is computed along ω-trajectories. Interestingly, combining the HOF computed from the flow and the ω-flow further improves the results. This suggests that the two flow fields are complementary and the affine flow that was subtracted from ω-flow brings in additional information. For the sake of brevity, the combination of the two kinds of HOF, i.e., computed from the flow and the ω-flow using ω-trajectories, is referred to as the ω-HOF 222555555977 MethodHollywood2HMDB51 Tab(lerT3a.cIkmM inpBgMacHgωtBf-loH w [u2)s9in]gω f-lo wo MBH5 d42 e.052s7c% riptos:m34A90P.–3769f% orHllywood2 and average accuracy for HMDB5 1. DTerHasMjcBrOeblitpHGeFor4.ySumTωraw- frcilykto hw ionfgtheduωpCs-fcaolrtωmeiwNp-df/tl+Aωoutrw-finlogwthdesωcr- isTpc-fHtrMloaiOjrBpdswtGeHFosrc descriptor in the rest of this paper. Compared to the HOF baseline, the ω-HOF descriptor achieves a gain of +3.1% of mAP on Hollywood 2 and of +7.8% on HMDB51. Impact on MBH. Since MBH is computed from gradient of flow and cancel the constant motion, there is practically no benefit in using the ω-flow to compute the MBH descriptors, as shown in Table 3. However, by tracking ω-flow, the performance improves by around 1.3% for HMDB5 1 dataset and drops by around 1.5% for Hollywood2. This relative performance depends on the encoding technique. We will come back on this descriptor when considering another encoding scheme for local descriptors in Section 6. 4.3. Summary of compensated descriptors Table 4 summarizes the refined versions of the descriptors obtained by exploiting the ω-flow, and both ω-flow and the optical flow in the case of HOF. The revisited descriptors considerably improve the results compared to the orig- inal ones, with the noticeable exception of ω-MBH which gives mixed performance with a bag-of-features encoding scheme. But we already mention as this point that this incongruous behavior of ω-MBH is stabilized with the VLAD encoding scheme considered in Section 6. Another advantage of tracking the compensated flow is that fewer trajectories are produced. For instance, the total number of trajectories decreases by about 9. 16% and 22.81% on the Hollywood2 and HMDB51 datasets, respectively. Note that exploiting both the flow and the ω-flow do not induce much computational overhead, as the latter is obtained from the flow and the affine flow which is computed in real-time and already used to get the ω-trajectories. The only additional computational cost that we introduce by using the descriptors summarized in Table 4 is the computation of a second HOF descriptor, but this stage is relatively efficient and not the bottleneck of the extraction procedure. 5. Divergence-Curl-Shear descriptor This section introduces a new descriptor encoding the kinematic properties of motion discussed in Section 2.2. It is denoted by DCS in the rest of this paper. Combining kinematic features. The spatial derivatives are computed for the horizontal and vertical components of the flow field, which are used in turn to compute the divergence, curl and shear scalar values, see Equation 3. We consider all possible pairs of kinematic features, namely (div, curl), (div, shear) and (curl, shear). At each × ×× pixel, we compute the orientation and magnitude of the 2-D vector corresponding to each of these pairs. The orientation is quantized into histograms and the magnitude is used for weighting, similar to SIFT. Our motivation for encoding pairs is that the joint distribution of kinematic features conveys more information than exploiting them independently. Implementation details. The descriptor computation and parameters are similar to HOG and other popular descriptors such as MBH, HOF. We obtain 8-bin histograms for each of the three feature pairs or components of DCS. The range of possible angles is 2π for the (div,curl) pair and π for the other pairs, because the shear is always positive. The DCS descriptor is computed for a space-time volume aligned with a trajectory, as done with the four descriptors mentioned in the previous section. In order to capture the spatio-temporal structure of kinematic features, the volume (32 32 pixels and L = 15 frames) is subdivided into a spatio-temporal grid nofd s Lize = nx 5× f ny m×e nt, sw situhb nx =de ny =to 2a and nt = 3. These parameters ×hnave× × bneen fixed for the sake of consistency with the other descriptors. For each pair of kinematic features, each cell in the grid is represented by a histogram. The resulting local descriptors have a dimensionality equal to 288 = nx ny nt 8 3. At the video level, these descriptors are nenc×od end i×nto 8 a single vector representation using either BOF or the VLAD encoding scheme introduced in the next section. 6. VLAD in actions VLAD [8] is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To our knowledge, this technique has never been considered for action recognition. Below, we briefly introduce this approach and give the performance achieved for all the descriptors introduced along the previous sections. VLAD in brief. Similar to BOF, VLAD relies on a codebook C = {c1, c2 , ...ck} of k centroids learned by k-means. bTohoek representation is ob}t oaifn ked c by summing, efodr b yea kch-m mveiasunasl. word ci, the differences x − ci of the vectors x assigned to ci, thereby producing a sv exct −or c representation oflength d×k, 222555556088 DMeBscHriptorV5 LH.A1o%Dlywo5Bo4d.O2 %F4V3L.3HA%MD B35B91.O7%F Taωbl-eDHM5rOCBa.FSjGPdHe+rsωfco-mMHaBOnFeofV54L2936A.51D% with5431ω208-.5T96% rajde3s42c97158,.ω3% -HOG342,58019ω.6-% HOF descriptors and their combination. where d is the dimension ofthe local descriptors. We use the codebook size, k = 256. Despite this large dimensionality, VLAD is efficient because it is effectively compared with a linear kernel. VLAD is post-processed using a componentwise power normalization, which dramatically improves its performance [8]. While cross validating the parameter α involved in this power normalization, we consistently observe, for all the descriptors, a value between 0.15 and 0.3. Therefore, this parameter is set to α = 0.2 in all our experiments. For classification, we use a linear SVM and oneagainst-rest approach everywhere, unless stated otherwise. Impact on existing descriptors. We employ VLAD because it is less sensitive to quantization parameters and appears to provide better performance with descriptors having a large dimensionality. These properties are interesting in our case, because the quantization parameters involved in the DCS and MBH descriptors have been used unchanged in Section 4 for the sake of direct comparison. They might be suboptimal when using the ω-flow instead of the optical flow on which they have initially been optimized [29]. Results for MBH and ω-MBH in Table 5 supports this argument. When using VLAD instead of BOF, the scores are stable in both the cases and there is no mixed inference as that observed in Table 3. VLAD also has significant positive influence on accuracy of ω-DCS descriptor. We also observe that ω-DCS is complementary to ω-MBH and adds to the performance. Still DCS is probably not best utilized in the current setting of parameters. In case of ω-Trajdesc and ω-HOG, the scores are better with BOF on both the datasets. ω-HOF with VLAD improves on HMDB5 1, but remains equivalent for Hollywood2. Although BOF leads to better scores for the descriptors considered individually, their combination with VLAD outperforms the BOF. 7. Comparison with the state of the art This section reports our results with all descriptors combined and compares our method with the state of the art. TrajectorCy+omHbOiGna+tHioOnF+ MDBCHSHol5 l98y.w76%o%od2H4M489.D02%B%51 All ω-descriptors all five compensated descriptors using combined62.5%52.1% Table 6. Combination of VLAD representation. WU*JVliaOnughreM tHaeolth. [yo2w9d87o] 256 0985. 37% SKa*duOeJinhrau tnegdMteHatlMa.h [ol1Dd.0B[91]25 24 609.8172% Table 7. Comparison with the state of the art on Hollywood2 and HMDB5 1 datasets. *Vig et al. [28] gets 61.9% by using external eye movements data. *Jiang et al. [9] used one-vs-one multi class SVM while our and other methods use one-vs-rest SVMs. With one-against-one multi class SVM we obtain 45. 1% for HMDB51. Descriptor combination. Table 6 reports the results obtained when the descriptors are combined. Since we use VLAD, our baseline is updated that is combination of Trajectory, HOG, HOF and MBH with VLAD representation. When DCS is added to the baseline there is an improvement of 0.9% and 1.2%. With combination of all five compensated descriptors we obtain 62.5% and 52.1% on the two datasets. This is a large improvement even over the updated baseline, which shows that the proposed motion compensation and the way we exploit it are significantly important for action recognition. The comparison with the state of the art is shown in Table 7. Our method outperforms all the previously reported results in the literature. In particular, on the HMDB51 dataset, the improvement over the best reported results to date is more than 11% in average accuracy. Jiang el al. [9] used a one-against-one multi-class SVM, which might have resulted in inferior scores. With a similar multi-class SVM approach, our method obtains 45. 1%, which remains significantly better than their result. All others results were reported with one-against-rest approach. On Olympic Sports dataset we obtain mAP of 83.2% with ‘All ω-descriptors combined’ and the improvement is mostly because of VLAD and ω-flow. The best reported mAPs on this dataset are Liu et al. [14] (74.4%) and Jiang et al. [9] (80.6%), which we exceed convincingly. Gaidon et al. [6] reports the best average accuracy of 82.7%. 8. Conclusions This paper first demonstrates the interest of canceling the dominant motion (predominantly camera motion) to make the visual motion truly related to actions, for both the trajectory extraction and descriptor computation stages. It pro222555556199 duces significantly better versions (called compensated descriptors) of several state-of-the-art local descriptors for action recognition. The simplicity, efficiency and effectiveness of this motion compensation approach make it applicable to any action recognition framework based on motion descriptors and trajectories. The second contribution is the new DCS descriptor derived from the first-order scalar motion quantities specifying the local motion patterns. It captures additional information which is proved complementary to the other descriptors. Finally, we show that VLAD encoding technique instead of bag-of-words boosts several action descriptors, and overall exhibits a significantly better performance when combining different types of descriptors. Our contributions are all complementary and significantly outperform the state of the art when combined, as demon- strated by our extensive experiments on the Hollywood 2, HMDB51 and Olympic Sports datasets. Acknowledgments This work was supported by the Quaero project, funded by Oseo, French agency for innovation. We acknowledge Heng Wang’s help for reproducing some of their results. References [1] S. Ali and M. Shah. Human action recognition in videos using kinematic features and multiple instance learning. IEEE T-PAMI, 32(2):288–303, Feb. 2010. [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, Sep. 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, Jun. 2005. [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, May 2006. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, Oct. 2005. [6] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, Sep. 2012. [7] A. Hervieu, P. Bouthemy, and J.-P. Le Cadre. A statistical video content recognition method using invariant features on object trajectories. IEEE T-CSVT, 18(1 1): 1533–1543, 2008. [8] H. J ´egou, F. Perronnin, M. Douze, J. S ´anchez, P. P ´erez, and C. Schmid. Aggregating local descriptors into compact [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] codes. IEEE T-PAMI, 34(9):1704–1716, 2012. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, Oct. 2012. O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Motion interchange patterns for action recognition in unconstrained videos. In ECCV, Oct. 2012. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, Nov. 2011. I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, Oct. 2003. I. Laptev, M. Marzalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, Jun. 2008. J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, Jun. 2011. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–1 10, Nov. 2004. M. Marzalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, Jun. 2009. P. Matikainen, M. Hebert, and R. Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In Workshop on Video-Oriented Object and Event Classification, ICCV, Sep. 2009. R. Messing, C. J. Pal, and H. A. Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV, Sep. 2009. J. C. Niebles, C.-W. Chen, and F.-F. Li. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, Sep. 2010. [20] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Jal of Vis. Comm. and Image Representation, 6(4):348–365, Dec. 1995. [21] G. Piriou, P. Bouthemy, and J.-F. Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE T-IP, 15(1 1):3417–3430, 2006. [22] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, Jun. 2012. [23] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE T-PAMI, 19(5):530–534, May 1997. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, pages 1470– 1477, Oct. 2003. [25] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. In CVPR, Jun. 2009. [26] H. Uemura, S. Ishikawa, and K. Mikolajczyk. Feature tracking and motion compensation for action recognition. In BMVC, Sep. 2008. [27] M. M. Ullah, S. N. Parizi, and I. Laptev. Improving bag-offeatures action recognition with non-local cues. In BMVC, Sep. 2010. [28] E. Vig, M. Dorr, and D. Cox. Saliency-based space-variant descriptor sampling for action recognition. In ECCV, Oct. 2012. [29] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, Jun. 2011. [30] H. Wang, M. M. Ullah, A. Kl¨ aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, Sep. 2009. [3 1] G. Willems, T. Tuytelaars, and L. J. V. Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV, Oct. 2008. [32] S. Wu, O. Oreifej, and M. Shah. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In ICCV, Nov. 2011. 222555666200
4 0.9948048 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis
Author: Christian Thériault, Nicolas Thome, Matthieu Cord
Abstract: In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11 % in classification score is reached on a data set introduced in 2012.
5 0.99477988 165 cvpr-2013-Fast Energy Minimization Using Learned State Filters
Author: Matthieu Guillaumin, Luc Van_Gool, Vittorio Ferrari
Abstract: Pairwise discrete energies defined over graphs are ubiquitous in computer vision. Many algorithms have been proposed to minimize such energies, often concentrating on sparse graph topologies or specialized classes of pairwise potentials. However, when the graph is fully connected and the pairwise potentials are arbitrary, the complexity of even approximate minimization algorithms such as TRW-S grows quadratically both in the number of nodes and in the number of states a node can take. Moreover, recent applications are using more and more computationally expensive pairwise potentials. These factors make it very hard to employ fully connected models. In this paper we propose a novel, generic algorithm to approximately minimize any discrete pairwise energy function. Our method exploits tractable sub-energies to filter the domain of the function. The parameters of the filter are learnt from instances of the same class of energies with good candidate solutions. Compared to existing methods, it efficiently handles fully connected graphs, with many states per node, and arbitrary pairwise potentials, which might be expensive to compute. We demonstrate experimentally on two applications that our algorithm is much more efficient than other generic minimization algorithms such as TRW-S, while returning essentially identical solutions.
6 0.99434584 301 cvpr-2013-Multi-target Tracking by Rank-1 Tensor Approximation
7 0.99428821 379 cvpr-2013-Scalable Sparse Subspace Clustering
8 0.99391001 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
9 0.99357671 113 cvpr-2013-Dense Variational Reconstruction of Non-rigid Surfaces from Monocular Video
10 0.99357653 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
11 0.99337381 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
12 0.99320424 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes
13 0.99319172 145 cvpr-2013-Efficient Object Detection and Segmentation for Fine-Grained Recognition
14 0.9929564 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
15 0.99220765 180 cvpr-2013-Fully-Connected CRFs with Non-Parametric Pairwise Potential
16 0.9921779 309 cvpr-2013-Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context
17 0.99189872 340 cvpr-2013-Probabilistic Label Trees for Efficient Large Scale Image Classification
18 0.9918555 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
19 0.99176669 55 cvpr-2013-Background Modeling Based on Bidirectional Analysis
20 0.99168247 357 cvpr-2013-Revisiting Depth Layers from Occlusions