cvpr cvpr2013 cvpr2013-348 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Weixin Li, Qian Yu, Harpreet Sawhney, Nuno Vasconcelos
Abstract: In this work, we propose a novel video representation for activity recognition that models video dynamics with attributes of activities. A video sequence is decomposed into short-term segments, which are characterized by the dynamics of their attributes. These segments are modeled by a dictionary of attribute dynamics templates, which are implemented by a recently introduced generative model, the binary dynamic system (BDS). We propose methods for learning a dictionary of BDSs from a training corpus, and for quantizing attribute sequences extracted from videos into these BDS codewords. This procedure produces a representation of the video as a histogram of BDS codewords, which is denoted the bag-of-words for attribute dynamics (BoWAD). An extensive experimental evaluation reveals that this representation outperforms other state-of-the-art approaches in temporal structure modeling for complex ac- tivity recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 edu o Abstract In this work, we propose a novel video representation for activity recognition that models video dynamics with attributes of activities. [sent-2, score-0.652]
2 A video sequence is decomposed into short-term segments, which are characterized by the dynamics of their attributes. [sent-3, score-0.329]
3 These segments are modeled by a dictionary of attribute dynamics templates, which are implemented by a recently introduced generative model, the binary dynamic system (BDS). [sent-4, score-0.84]
4 We propose methods for learning a dictionary of BDSs from a training corpus, and for quantizing attribute sequences extracted from videos into these BDS codewords. [sent-5, score-0.626]
5 This procedure produces a representation of the video as a histogram of BDS codewords, which is denoted the bag-of-words for attribute dynamics (BoWAD). [sent-6, score-0.754]
6 The first, motivated by the fact that an activity is naturally defined by an ordered set of short-term behaviors, aims to model the temporal composition of activities. [sent-11, score-0.232]
7 Figure 1: Challenges in modeling the dynamics of attributes of complex activities. [sent-48, score-0.283]
8 (Bottom) associated trajectory on a 3D attribute space (red for “arm-motion”, green for “foot motion” and blue for “ball motion”). [sent-50, score-0.488]
9 Note the complexity of the trajectory and the fact that only a short segment (red-shaded) is a staple of the action of interest. [sent-51, score-0.17]
10 The second, inspired by recent advances in image analysis, is to represent activities as collections of semantic attributes [15, 23, 22, 6]. [sent-53, score-0.184]
11 While a detailed characterization of the temporal structure on top of low-level features is, in general, insufficient to characterize complex activities, the representation of video as an orderless set of attributes is incapable of fine-grained activity discrimination (i. [sent-57, score-0.514]
12 , distinguishing between activities which express the same attributes in different orders). [sent-59, score-0.184]
13 Recently, [14] has proposed to unify the two research directions, by modeling the temporal structure of the video projection in an attribute space. [sent-60, score-0.565]
14 This was implemented by introducing a dynamic model, denoted binary dynamic system (BDS), which extends classical linear dynamic systems to binary observation spaces. [sent-61, score-0.34]
15 In general, video sequences are only annotated with respect to a dominant event, or high-level subject, and not with respect to the footage that either precedes or trails it. [sent-64, score-0.171]
16 The second is that a single model, such as the BDS, is unlikely to provide a good fit to the complex attribute space trajectories produced by the video. [sent-65, score-0.448]
17 This is illustrated in Figure 1, which presents the trajectory of a video of the “tennis serve” activity in a space spanned by three closely-related attributes. [sent-66, score-0.353]
18 In this work, we propose to address these limitations with a new video representation, which is denoted the bagof-words for attribute dynamics (BoWAD). [sent-67, score-0.7]
19 However, rather than templates of visual appearance, it relies on templates of attribute dynamics. [sent-70, score-0.456]
20 In this way, an activity is represented as a collection of characteristic short-term behaviors, and no single BDS needs to model unduly complex attribute trajectories. [sent-72, score-0.616]
21 We propose a procedure for learning a dictionary of BDSs, and for quantizing video with respect to this dictionary, and show that the representation achieves performance superior to that of state-of-the-art approaches of temporal structure modeling in challenging datasets. [sent-73, score-0.35]
22 Related Work Over the last decade, the bag-of-features (BoF) has become a popular video representation for action recognition [27]. [sent-75, score-0.189]
23 [7] represented an activity with a small number of decomposable parts or atomic actions. [sent-82, score-0.262]
24 [22] augmented video with text-script data and modeled activities as common sets of attributes, defined in terms of basic actions and objects. [sent-93, score-0.251]
25 This work suggests that the modeling of video trajectories in attribute space is crucial for the fine-grained understanding of human behavior . [sent-95, score-0.499]
26 In this work, we expand on the idea of [14], by learning dictionaries of models for attribute dynamics. [sent-96, score-0.425]
27 The main challenge ofthis dictionary leaning problem is the difficulty of identifying the “centroid” of a collection of dynamic textures, due to the non-Euclidean nature of the space of linear dynamic systems. [sent-98, score-0.29]
28 We propose an alternative principled solution, which is specifically designed for clustering attribute sequences, and has a number of advantages over MDS-kM. [sent-100, score-0.428]
29 The Bag of Words for Attribute Dynamics In this section, we introduce a new representation for activity recognition, denoted the bag-of-words for attribute dynamics (BoWADs). [sent-103, score-0.798]
30 Words and Attributes A popular representation for image classification is the bag of visual words (BoVW) [28], which has recently also become popular for action recognition [27]. [sent-106, score-0.162]
31 This consists of representing an image as a BoF, learning a dictionary ofrepresentative feature vectors, which are denoted visual words, and using this dictionary to quantize the features extracted from an image to classify. [sent-107, score-0.287]
32 Despite the popularity of the BoVW, several works have demonstrated the benefits of alternative feature spaces, which encode higher-level semantics by representing images or video as collections of binary attributes [19, 10, 18, 20, 15, 14]. [sent-110, score-0.252]
33 Under this representation, activities are defined with respect to a set of K attributes C = {ci}iK=1, inferred from svpideecot ftora mae ses by a K ba anktt roibf uatttesrib Cut e= cl {acss}ifiers {πi}iK=1 . [sent-111, score-0.184]
34 e V BidDeSo Ωseq eumenbceedss a evftid)ae roe trajectory ains ttora a eloctwo-rideism inen astitroinbault space (shown itne green), by binary PimCilAar, a sendm alenatrincss a Gauss-Markov process that describes the corresponding trajectory in the latent state space (right). [sent-114, score-0.286]
35 A video v ∈ X is mapped inmtoan a-totbrijbecutte i space iSo by a mapping π : X → S = [0, 1]K, (1) where π(v) = (π1(v), · · · , πK(v))T (2) is an attribute score vector. [sent-116, score-0.499]
36 Component πi (v) is a confidence score quantifying the presence of the i-th attribute in v. [sent-117, score-0.4]
37 In this work, these scores are the posterior probabilities πc(v) = p(c|v) of attribute c given some low-level representation =of p pv(icd|evo) v, e. [sent-118, score-0.431]
38 Attribute-based Activity Recognition In [ 15] a vector of attribute scores π(v) is computed for the whole video sequence v. [sent-123, score-0.562]
39 This holistic attribute representation disregards the temporal structure of the different attributes. [sent-124, score-0.537]
40 This problem can be overcome by applying the attribute classifiers to video segments vt extracted with a sliding window. [sent-126, score-0.592]
41 As illustrated in Figure 2, this produces a sequence of attribute score vectors {πt}tτ=1, where πt e=s πa( svetq)u. [sent-127, score-0.485]
42 e nInc summary, a v scidoereo sequence πis }modeled as a trajectory in S and sequences of similar semantics span saim trialjaerc trajectories. [sent-128, score-0.247]
43 Li and Vasconcelos proposed to model a video trajectory in S with a binary dynamic system (BDS) [14], defined by ? [sent-129, score-0.302]
44 Since, in the context of attribute representations, only the the attribute scores πt (and not the attribute variables themselves) are known, [14] replaced the log-likelihood of (4) by the expected log-likelihood EY[L] =? [sent-157, score-1.2]
45 Note that this matrix characterizes the state space trajectory, which is mapped (given C and u) into the video trajectory in S. [sent-167, score-0.218]
46 Bag of Words for Attribute Dynamics While substantially more descriptive than the holistic attribute model of [15], the BDS of [14] still has two serious limitations as a model of video dynamics. [sent-171, score-0.57]
47 First, there is, in general, no guarantee that the whole video sequence depicts the activity of interest. [sent-173, score-0.328]
48 Fitting a single dynamic model to long video sequences will lead to parameter estimates that are not representative of the event of interest. [sent-179, score-0.302]
49 Second, since complex activities are composed of several atomic actions, sometimes disjoint in time, their state trajectories are unlikely to follow the Gauss-Markov process. [sent-180, score-0.238]
50 On the other hand, most activities can be effectively inferred by a characterization of the short-term segments that compose them. [sent-182, score-0.174]
51 The presence (or absence) of a video segment with attributes “jump-jump” is sufficient to discriminate between the two activities. [sent-184, score-0.212]
52 Based on these observations, we propose to model video with an extension of the BoVW that captures the short-term dynamics of the attribute representation of an action. [sent-185, score-0.697]
53 A video sequence is first split into a collection of temporal overlapping segments {s(i) }iN=1 . [sent-186, score-0.296]
54 This produces a set of attribute score vectors = }τti=1, which is denoted the attribute sequence of segment s(i) . [sent-188, score-0.942]
55 The video sequence is finally represented by a bag of attribute sequences (BoAS), which plays the role, in the proposed framework, of the BoF in image classification. [sent-189, score-0.68]
56 A dictionary of representative BDSs }iV=1, which are dednioctetiod nwaroyrd osf f r foepr raetstrenibtautteivse d ByDnaSmsi {csΩ (W}AD), learned from a set of training BoAS, is then used to quantize the BoAS Π(i) {Ω(i) {π(ti) extracted from the video sequence to classify. [sent-190, score-0.301]
57 The resulting histogram of WAD counts, denoted a bag of words for attribute dynamics (BoWAD) is finally used as a feature vector for video classification. [sent-191, score-0.795]
58 For now, we address the problem ofquantizing attribute sequences. [sent-197, score-0.4]
59 An extension to the clustering of BoAS is not straightforward because 1) attribute sequences can have different length; 2) the space of these sequences has nonEuclidean geometry; and 3) the search for optimal prototypes, under this geometry, may lead to intractable nonlinear optimization. [sent-208, score-0.572]
60 More importantly, because we are interested in characterizing the appearance and dynamics of attribute sequences, it is more desirable to find a set of pro- × totype BDSs than a set of prototype sequences. [sent-209, score-0.567]
61 , (9) (10) Figure 3: BoWAD representation of the activity “diving-springboard”. [sent-224, score-0.197]
62 (Middle) the holistic vector of attribute scores is now represented as a trajectory in the attribute space (which is four dimensional, in this example, and represented as four colored functions). [sent-226, score-0.928]
63 The activity is represented by a BoWAD, which is a histogram of assignments of segments to WADs. [sent-230, score-0.233]
64 First, it clusters attribute sequences rather than the models themselves, as is done by [21, 1]. [sent-237, score-0.472]
65 Algorithm 2: Learning a Cluster for WADs Dictionary Algorithm 2: Learning a Cluster for WADs Dictionary Input : a Input set of n sequences of attribute score vectors : a set of n sequences of attribute score vectors }τti=1 }in=1, state space dimension L. [sent-250, score-1.019]
66 t Theh algorithm of [14] for learning a BDS from a single attribute sequence. [sent-275, score-0.4]
67 A binary PCA is first applied to all attribute score vectors in P? [sent-278, score-0.461]
68 The parametaeprps oiefd dth toe hailldd atentr Gbuateus ssc-Moraerk veocv process are then learned by solving a least squares problem involving all latent state sequences returned by binary PCA. [sent-280, score-0.182]
69 In this way, the BDS learned per cluster jointly characterizes the appearance and dynamics of all attribute sequences in that cluster. [sent-281, score-0.678]
70 , {σ(θ(ta) {σ(θ(tb) (13) where ) } and )} are the parameters of the mwuhlteirvea {riaσt(eθ Be)rn}o uanlldi d{iσst(rθibu)ti}on asr efr tohme pwarhaicmhe ttheres binary attribute vectors are sampled, for the two BDSs. [sent-296, score-0.461]
71 Quantization Given a WAD dictionary {Ω(i)}iV=1, a BoAS }τti=1 }iN=1 is quantized by assigning the i-th att{r{ibπute sequence to the k∗-th cluster according to k∗ = argminj dBC Ω(j)? [sent-304, score-0.216]
72 Binary SVMs using histogram intersection kernel (HIK) with probability outputs [3] were used as attribute models, learned from annotated training video clips (see supplementary material for attribue definitions). [sent-325, score-0.522]
73 Weizmann Activity The first set of experiments was based on composite sequences synthesized from the Weizmann dataset [8], which contains 10 atomic action classes, performed by 9 people, for a total of 90 samples. [sent-329, score-0.196]
74 BoWAD was compared to the vanilla BoF, BoF with t3 temporal pyramids [11] (denoted “BoF-TP”), holistic attributes [15] (denoted “Attribute”) and BDS [14]. [sent-330, score-0.196]
75 Attribute sequences were computed over 30frame sliding video windows of 10-frame step. [sent-331, score-0.198]
76 To compute BoWADs, each short-term attribute sequence consisted of the attribute vectors from 12 consecutive windows, extracted with a step of 3 windows. [sent-333, score-0.939]
77 -all SVMs with HIK were used in all histogram-based methods (BoF, BoF-TP, BoWAD, attribute models), where STIP features used a 1000-word vocabulary. [sent-337, score-0.4]
78 An activity was defined as a sequence of 20 consecutive atomic actions from Weizmann. [sent-342, score-0.382]
79 This sequence was inserted at a random temporal location of a larger sequence of 40 atomic actions. [sent-343, score-0.28]
80 f In this case, each activity was defined by two subsequences, each with 10 consecutive atomic actions. [sent-347, score-0.261]
81 Attribute, show that modeling of activity dynamics is critical for success in these datasets. [sent-483, score-0.333]
82 While BDS has substantially improved performance, the underlying assumption of a single dynamic process is a limitation for these sequences, where the activities of interest are not temporally aligned and are surrounded by irrelevant video. [sent-484, score-0.224]
83 The performance ofBoWADs, learned with BMC and MDS-kM, was compared to BoFTP [11], activity models with decomposable segments [16], the hidden Markov model with latent states ofvariable duration of [25], the holistic attribute representation of [15], and the BDS [14]. [sent-490, score-0.752]
84 A 30 frame sliding video window, with a step of 4 frames, was used to compute attribute scores. [sent-493, score-0.526]
85 For the BoWAD, attribute sequences consisted of 12 consecutive attribute vectors, with a 75% overlap between consecutive sequences. [sent-494, score-0.956]
86 It works particularly well for categories, such as “tennisserve”, which have large variability and tend to include video irrelevant for activity detection, or category pairs, such as “triple-jump” and “long-jump”, that differ in subtle ways. [sent-503, score-0.265]
87 The robustness inherent to a vocabulary of dynamics is critical for the former (compare the 73. [sent-504, score-0.199]
88 7% of BDS on “tennis serve”), while the detailed characterization of attribute dynamics is critical for the latter (75. [sent-506, score-0.603]
89 Attribute scores were computed with a 180-frame sliding window with steps of 30 frames, and attribute sub-sequences (τ = 10) were extracted every window. [sent-523, score-0.427]
90 The fact that the BoWAD substantially outperforms the BDS also confirms the observation that the robustness of a vocabulary of local attribute dynamics is critical for accurate detection of complex activities. [sent-591, score-0.656]
91 While it is difficult to model this sequence as a whole, due the large variability of cutting in different videos, it is much easier to capture short-term signature actions, such as “slide-jump”, which are usually not broken during video editing. [sent-595, score-0.162]
92 Conclusion In this work, we proposed a novel solution to the problem of modeling attribute and dynamics for activity recognition. [sent-598, score-0.733]
93 The method combines the advantages, in terms of robustness, of histogram-based representations, with the power of BDSs to model the dynamics of video attributes. [sent-599, score-0.266]
94 We developed new algorithms for learning BDS dictionaries and quantizing video with them. [sent-600, score-0.164]
95 The proposed representation significantly outperforms other state-of-the-art attribute-based or temporal-structure-modeling approaches in complex activity recognition. [sent-601, score-0.223]
96 Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes. [sent-607, score-0.199]
97 Human activity recognition using a dynamic texture based method. [sent-660, score-0.242]
98 Learning to detect unseen object classes by between-class attribute transfer. [sent-667, score-0.4]
99 Modeling temporal structure of decomposable motion segments for activity classification. [sent-705, score-0.307]
100 Action bank: A high-level representation of activity in video. [sent-755, score-0.197]
wordName wordTfidf (topN-words)
[('bds', 0.546), ('attribute', 0.4), ('bowad', 0.364), ('dynamics', 0.167), ('activity', 0.166), ('bdss', 0.146), ('bmc', 0.134), ('boas', 0.127), ('wad', 0.127), ('bof', 0.117), ('dictionary', 0.114), ('video', 0.099), ('activities', 0.094), ('zi', 0.091), ('bowads', 0.091), ('attributes', 0.09), ('trajectory', 0.088), ('bovw', 0.085), ('dynamic', 0.076), ('ldss', 0.073), ('sequences', 0.072), ('temporal', 0.066), ('atomic', 0.065), ('sequence', 0.063), ('olympic', 0.06), ('action', 0.059), ('actions', 0.058), ('event', 0.055), ('devt', 0.055), ('wads', 0.055), ('kt', 0.046), ('trecvid', 0.046), ('bag', 0.046), ('segments', 0.044), ('ravichandran', 0.042), ('vasconcelos', 0.042), ('chaudhry', 0.04), ('holistic', 0.04), ('latent', 0.04), ('quantizing', 0.04), ('binary', 0.039), ('cluster', 0.039), ('hik', 0.039), ('throw', 0.039), ('kl', 0.037), ('harpreet', 0.036), ('laxton', 0.036), ('ykt', 0.036), ('characterization', 0.036), ('dynamical', 0.036), ('ti', 0.035), ('xt', 0.034), ('denoted', 0.034), ('events', 0.033), ('stip', 0.033), ('weizmann', 0.033), ('dbc', 0.032), ('vocabulary', 0.032), ('substantially', 0.031), ('representation', 0.031), ('yt', 0.031), ('svms', 0.031), ('decomposable', 0.031), ('niebles', 0.031), ('state', 0.031), ('behaviors', 0.03), ('consecutive', 0.03), ('laptev', 0.029), ('clustering', 0.028), ('gaidon', 0.028), ('templates', 0.028), ('tb', 0.028), ('sliding', 0.027), ('qian', 0.027), ('subsequences', 0.027), ('rk', 0.026), ('words', 0.026), ('complex', 0.026), ('pca', 0.026), ('substantial', 0.025), ('dictionaries', 0.025), ('quantize', 0.025), ('semantics', 0.024), ('consisted', 0.024), ('rohrbach', 0.024), ('bc', 0.024), ('tennis', 0.024), ('collection', 0.024), ('surrounded', 0.023), ('prototypes', 0.023), ('histogram', 0.023), ('segment', 0.023), ('sadanand', 0.023), ('inserted', 0.023), ('sports', 0.022), ('unlikely', 0.022), ('iarpa', 0.022), ('vt', 0.022), ('vectors', 0.022), ('entails', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
Author: Weixin Li, Qian Yu, Harpreet Sawhney, Nuno Vasconcelos
Abstract: In this work, we propose a novel video representation for activity recognition that models video dynamics with attributes of activities. A video sequence is decomposed into short-term segments, which are characterized by the dynamics of their attributes. These segments are modeled by a dictionary of attribute dynamics templates, which are implemented by a recently introduced generative model, the binary dynamic system (BDS). We propose methods for learning a dictionary of BDSs from a training corpus, and for quantizing attribute sequences extracted from videos into these BDS codewords. This procedure produces a representation of the video as a histogram of BDS codewords, which is denoted the bag-of-words for attribute dynamics (BoWAD). An extensive experimental evaluation reveals that this representation outperforms other state-of-the-art approaches in temporal structure modeling for complex ac- tivity recognition.
2 0.21883543 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition
Author: Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang
Abstract: Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making the representation costly to obtain. In this paper, we propose a novel formulation to automatically design discriminative “category-level attributes ”, which can be efficiently encoded by a compact category-attribute matrix. The formulation allows us to achieve intuitive and critical design criteria (category-separability, learnability) in a principled way. The designed attributes can be used for tasks of cross-category knowledge transfer, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images). This approach also leads to state-ofthe-art performance on the zero-shot learning task on AwA.
3 0.20132656 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes
Author: Shuo Wang, Jungseock Joo, Yizhou Wang, Song-Chun Zhu
Abstract: In this paper, we propose a weakly supervised method for simultaneously learning scene parts and attributes from a collection ofimages associated with attributes in text, where the precise localization of the each attribute left unknown. Our method includes three aspects. (i) Compositional scene configuration. We learn the spatial layouts of the scene by Hierarchical Space Tiling (HST) representation, which can generate an excessive number of scene configurations through the hierarchical composition of a relatively small number of parts. (ii) Attribute association. The scene attributes contain nouns and adjectives corresponding to the objects and their appearance descriptions respectively. We assign the nouns to the nodes (parts) in HST using nonmaximum suppression of their correlation, then train an appearance model for each noun+adjective attribute pair. (iii) Joint inference and learning. For an image, we compute the most probable parse tree with the attributes as an instantiation of the HST by dynamic programming. Then update the HST and attribute association based on the in- ferred parse trees. We evaluate the proposed method by (i) showing the improvement of attribute recognition accuracy; and (ii) comparing the average precision of localizing attributes to the scene parts.
4 0.19814441 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
Author: Zhigang Ma, Yi Yang, Zhongwen Xu, Shuicheng Yan, Nicu Sebe, Alexander G. Hauptmann
Abstract: Complex events essentially include human, scenes, objects and actions that can be summarized by visual attributes, so leveraging relevant attributes properly could be helpful for event detection. Many works have exploited attributes at image level for various applications. However, attributes at image level are possibly insufficient for complex event detection in videos due to their limited capability in characterizing the dynamic properties of video data. Hence, we propose to leverage attributes at video level (named as video attributes in this work), i.e., the semantic labels of external videos are used as attributes. Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events. Specifically, building upon a correlation vector which correlates the attributes and the complex event, we incorporate video attributes latently as extra informative cues into the event detector learnt from complex event videos. Extensive experiments on a real-world large-scale dataset validate the efficacy of the proposed approach.
5 0.19034038 101 cvpr-2013-Cumulative Attribute Space for Age and Crowd Density Estimation
Author: Ke Chen, Shaogang Gong, Tao Xiang, Chen Change Loy
Abstract: A number of computer vision problems such as human age estimation, crowd density estimation and body/face pose (view angle) estimation can be formulated as a regression problem by learning a mapping function between a high dimensional vector-formed feature input and a scalarvalued output. Such a learning problem is made difficult due to sparse and imbalanced training data and large feature variations caused by both uncertain viewing conditions and intrinsic ambiguities between observable visual features and the scalar values to be estimated. Encouraged by the recent success in using attributes for solving classification problems with sparse training data, this paper introduces a novel cumulative attribute concept for learning a regression model when only sparse and imbalanced data are available. More precisely, low-level visual features extracted from sparse and imbalanced image samples are mapped onto a cumulative attribute space where each dimension has clearly defined semantic interpretation (a label) that captures how the scalar output value (e.g. age, people count) changes continuously and cumulatively. Extensive experiments show that our cumulative attribute framework gains notable advantage on accuracy for both age estimation and crowd counting when compared against conventional regression models, especially when the labelled training data is sparse with imbalanced sampling.
6 0.18011977 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
7 0.17490242 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
8 0.1746887 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
9 0.17358863 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes
10 0.16878641 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
11 0.16084597 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
12 0.1541459 241 cvpr-2013-Label-Embedding for Attribute-Based Classification
13 0.15176757 396 cvpr-2013-Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback
14 0.15069535 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
15 0.1480957 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?
16 0.1363913 146 cvpr-2013-Enriching Texture Analysis with Semantic Data
17 0.13345811 99 cvpr-2013-Cross-View Image Geolocalization
19 0.12629497 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning
20 0.12431093 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
topicId topicWeight
[(0, 0.199), (1, -0.164), (2, -0.11), (3, -0.068), (4, -0.107), (5, 0.051), (6, -0.245), (7, 0.034), (8, 0.018), (9, 0.227), (10, 0.029), (11, 0.04), (12, -0.048), (13, -0.007), (14, 0.082), (15, 0.071), (16, -0.008), (17, 0.067), (18, -0.026), (19, 0.001), (20, -0.009), (21, 0.059), (22, 0.022), (23, 0.035), (24, 0.019), (25, -0.012), (26, -0.044), (27, -0.011), (28, -0.005), (29, 0.082), (30, 0.055), (31, -0.006), (32, 0.014), (33, -0.06), (34, -0.022), (35, -0.012), (36, -0.016), (37, -0.082), (38, -0.01), (39, 0.011), (40, 0.021), (41, 0.038), (42, 0.066), (43, -0.015), (44, -0.026), (45, 0.005), (46, 0.048), (47, 0.016), (48, 0.022), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.95411235 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
Author: Weixin Li, Qian Yu, Harpreet Sawhney, Nuno Vasconcelos
Abstract: In this work, we propose a novel video representation for activity recognition that models video dynamics with attributes of activities. A video sequence is decomposed into short-term segments, which are characterized by the dynamics of their attributes. These segments are modeled by a dictionary of attribute dynamics templates, which are implemented by a recently introduced generative model, the binary dynamic system (BDS). We propose methods for learning a dictionary of BDSs from a training corpus, and for quantizing attribute sequences extracted from videos into these BDS codewords. This procedure produces a representation of the video as a histogram of BDS codewords, which is denoted the bag-of-words for attribute dynamics (BoWAD). An extensive experimental evaluation reveals that this representation outperforms other state-of-the-art approaches in temporal structure modeling for complex ac- tivity recognition.
2 0.73188913 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition
Author: Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang
Abstract: Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making the representation costly to obtain. In this paper, we propose a novel formulation to automatically design discriminative “category-level attributes ”, which can be efficiently encoded by a compact category-attribute matrix. The formulation allows us to achieve intuitive and critical design criteria (category-separability, learnability) in a principled way. The designed attributes can be used for tasks of cross-category knowledge transfer, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images). This approach also leads to state-ofthe-art performance on the zero-shot learning task on AwA.
3 0.73044479 85 cvpr-2013-Complex Event Detection via Multi-source Video Attributes
Author: Zhigang Ma, Yi Yang, Zhongwen Xu, Shuicheng Yan, Nicu Sebe, Alexander G. Hauptmann
Abstract: Complex events essentially include human, scenes, objects and actions that can be summarized by visual attributes, so leveraging relevant attributes properly could be helpful for event detection. Many works have exploited attributes at image level for various applications. However, attributes at image level are possibly insufficient for complex event detection in videos due to their limited capability in characterizing the dynamic properties of video data. Hence, we propose to leverage attributes at video level (named as video attributes in this work), i.e., the semantic labels of external videos are used as attributes. Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events. Specifically, building upon a correlation vector which correlates the attributes and the complex event, we incorporate video attributes latently as extra informative cues into the event detector learnt from complex event videos. Extensive experiments on a real-world large-scale dataset validate the efficacy of the proposed approach.
4 0.71860951 48 cvpr-2013-Attribute-Based Detection of Unfamiliar Classes with Humans in the Loop
Author: Catherine Wah, Serge Belongie
Abstract: Recent work in computer vision has addressed zero-shot learning or unseen class detection, which involves categorizing objects without observing any training examples. However, these problems assume that attributes or defining characteristics of these unobserved classes are known, leveraging this information at test time to detect an unseen class. We address the more realistic problem of detecting categories that do not appear in the dataset in any form. We denote such a category as an unfamiliar class; it is neither observed at train time, nor do we possess any knowledge regarding its relationships to attributes. This problem is one that has received limited attention within the computer vision community. In this work, we propose a novel ap. ucs d .edu Unfamiliar? or?not? UERY?IMAGQ IMmFaAtgMechs?inIlLatsrA?inYRESg MFNaAotc?ihntIlraLsin?A YRgES UMNaotFc?hAinMltarsIinL?NIgAOR AKNTAWDNO ?Train g?imagesn U(se)alc?n)eSs(Long?bilCas n?a’t lrfyibuteIn?mfoartesixNearwter proach to the unfamiliar class detection task that builds on attribute-based classification methods, and we empirically demonstrate how classification accuracy is impacted by attribute noise and dataset “difficulty,” as quantified by the separation of classes in the attribute space. We also present a method for incorporating human users to overcome deficiencies in attribute detection. We demonstrate results superior to existing methods on the challenging CUB-200-2011 dataset.
5 0.71550071 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning
Author: Babak Saleh, Ali Farhadi, Ahmed Elgammal
Abstract: When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviations from prototypical properties of categories. Our model can recognize abnormalities and report the main reasons of any recognized abnormality. We also show that abnormality predictions can help image categorization. We introduce the abnormality detection dataset and show interesting results on how to reason about abnormalities.
6 0.69582707 241 cvpr-2013-Label-Embedding for Attribute-Based Classification
7 0.67615378 229 cvpr-2013-It's Not Polite to Point: Describing People with Uncertain Attributes
8 0.6733982 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?
9 0.66384363 461 cvpr-2013-Weakly Supervised Learning for Attribute Localization in Outdoor Scenes
10 0.64571905 396 cvpr-2013-Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback
11 0.61868262 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
12 0.60462242 101 cvpr-2013-Cumulative Attribute Space for Age and Crowd Density Estimation
13 0.59329993 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
14 0.57161295 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
15 0.56750029 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
16 0.54946327 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?
17 0.54053158 353 cvpr-2013-Relative Hidden Markov Models for Evaluating Motion Skill
18 0.5175392 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
19 0.49374682 146 cvpr-2013-Enriching Texture Analysis with Semantic Data
20 0.48945633 99 cvpr-2013-Cross-View Image Geolocalization
topicId topicWeight
[(10, 0.097), (16, 0.029), (26, 0.05), (28, 0.01), (33, 0.295), (38, 0.013), (67, 0.06), (69, 0.043), (77, 0.023), (80, 0.012), (87, 0.054), (95, 0.206)]
simIndex simValue paperId paperTitle
same-paper 1 0.87845016 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics
Author: Weixin Li, Qian Yu, Harpreet Sawhney, Nuno Vasconcelos
Abstract: In this work, we propose a novel video representation for activity recognition that models video dynamics with attributes of activities. A video sequence is decomposed into short-term segments, which are characterized by the dynamics of their attributes. These segments are modeled by a dictionary of attribute dynamics templates, which are implemented by a recently introduced generative model, the binary dynamic system (BDS). We propose methods for learning a dictionary of BDSs from a training corpus, and for quantizing attribute sequences extracted from videos into these BDS codewords. This procedure produces a representation of the video as a histogram of BDS codewords, which is denoted the bag-of-words for attribute dynamics (BoWAD). An extensive experimental evaluation reveals that this representation outperforms other state-of-the-art approaches in temporal structure modeling for complex ac- tivity recognition.
2 0.87832767 464 cvpr-2013-What Makes a Patch Distinct?
Author: Ran Margolin, Ayellet Tal, Lihi Zelnik-Manor
Abstract: What makes an object salient? Most previous work assert that distinctness is the dominating factor. The difference between the various algorithms is in the way they compute distinctness. Some focus on the patterns, others on the colors, and several add high-level cues and priors. We propose a simple, yet powerful, algorithm that integrates these three factors. Our key contribution is a novel and fast approach to compute pattern distinctness. We rely on the inner statistics of the patches in the image for identifying unique patterns. We provide an extensive evaluation and show that our approach outperforms all state-of-the-art methods on the five most commonly-used datasets.
3 0.85753846 116 cvpr-2013-Designing Category-Level Attributes for Discriminative Visual Recognition
Author: Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang
Abstract: Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making the representation costly to obtain. In this paper, we propose a novel formulation to automatically design discriminative “category-level attributes ”, which can be efficiently encoded by a compact category-attribute matrix. The formulation allows us to achieve intuitive and critical design criteria (category-separability, learnability) in a principled way. The designed attributes can be used for tasks of cross-category knowledge transfer, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images). This approach also leads to state-ofthe-art performance on the zero-shot learning task on AwA.
4 0.84839076 207 cvpr-2013-Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation
Author: Ľ
Abstract: Our goal is to detect humans and estimate their 2D pose in single images. In particular, handling cases of partial visibility where some limbs may be occluded or one person is partially occluding another. Two standard, but disparate, approaches have developed in the field: the first is the part based approach for layout type problems, involving optimising an articulated pictorial structure; the second is the pixel based approach for image labelling involving optimising a random field graph defined on the image. Our novel contribution is a formulation for pose estimation which combines these two models in a principled way in one optimisation problem and thereby inherits the advantages of both of them. Inference on this joint model finds the set of instances of persons in an image, the location of their joints, and a pixel-wise body part labelling. We achieve near or state of the art results on standard human pose data sets, and demonstrate the correct estimation for cases of self-occlusion, person overlap and image truncation.
5 0.84030658 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese
Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
6 0.84016794 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs
7 0.83998007 364 cvpr-2013-Robust Object Co-detection
8 0.83944607 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection
9 0.83937818 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
10 0.83854574 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
11 0.8383339 202 cvpr-2013-Hierarchical Saliency Detection
12 0.8374297 96 cvpr-2013-Correlation Filters for Object Alignment
13 0.83741617 168 cvpr-2013-Fast Object Detection with Entropy-Driven Evaluation
14 0.83735353 380 cvpr-2013-Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images
15 0.83729428 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
16 0.83714706 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
17 0.83693796 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds
18 0.83682638 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
19 0.83672851 67 cvpr-2013-Blocks That Shout: Distinctive Parts for Scene Classification
20 0.83664757 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes