iccv iccv2013 iccv2013-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
Reference: text
sentIndex sentText sentNum sentScore
1 Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. [sent-2, score-0.354]
2 To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. [sent-3, score-0.428]
3 We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. [sent-4, score-1.013]
4 Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. [sent-5, score-0.579]
5 Introduction As recent research in video understanding has shifted to classifying complex events like “Attempting a board trick” [17], it is now very difficult for a single feature to capture the information required to discriminate between different complex event categories. [sent-8, score-0.473]
6 Given these features, the problem we seek to address is finding the optimal way to combine them together for effective complex event recognition. [sent-12, score-0.296]
7 Together, these intuitions can help alleviate the limitations in a conventional feature combination approach where all of the features are combined or considered simultaneously. [sent-27, score-0.373]
8 As shown in Figure 1, standard methods like kernel averaging [5] do not perform feature selection, and methods like Multiple Kernel Learning (MKL) [6] consider all features together in a single combination, making it difficult to discover complementary sets of features. [sent-28, score-0.512]
9 To capture these intuitions, we introduce a novel method for feature combination that represents feature combinations using an AND/OR graph structure, with nodes in the graph representing combinations of different sets of features. [sent-29, score-1.09]
10 The presence of OR nodes allow us to to be selective of the features we want to combine for each class, and the hierarchical structure of the AND/OR graph structure allows us to consider sets of features independently to better discover complementary information. [sent-30, score-1.05]
11 Our method is able to constrain and search the large space of possible AND/OR graph structures for the optimal structure, and we introduce an approximate inference procedure that is able to efficiently compute structure scores. [sent-31, score-0.544]
12 Related Work Many recent works in video understanding have focused on complex event recognition in large-scale datasets [17], which is the focus of this paper. [sent-34, score-0.302]
13 The standard approach to combining features is Multiple Kernel Learning (MKL), which has been used for various tasks in computer vision including object categorization [20], object detection [21], multi-class object classification [5], and complex event recognition [ 14]. [sent-38, score-0.394]
14 In [1], the author considers hierarchical multiple kernel learning using kernels that can be decomposed into a large sum of separate basis kernels. [sent-41, score-0.448]
15 The work of [8] considers semantic kernel forests constructed with human knowledge, and introduces a novel K1K2K3K4K5K6K7K8ALONeRaDKf9n no d ed e Positve ideosExCtormacptufeatuker snferlom atvridceosNegative ideos Figure 2. [sent-43, score-0.291]
16 The LEAF nodes encode the input kernel matrices, which are then combined using AND/OR nodes up to the root node. [sent-45, score-0.815]
17 Our method utilizes the AND/OR graph structure as a representation for combining features. [sent-49, score-0.382]
18 The AND/OR graph structure has been used for many different applications in computer vision [2, 3, 7, 24, 25]. [sent-50, score-0.326]
19 In [2], the authors use an AND/OR graph to infer composite cloth templates. [sent-51, score-0.304]
20 In [7], the AND/OR graph is used as a storyline model that encodes storyline variation in videos. [sent-53, score-0.364]
21 ) that defines a measure of similarity between a pair of instances using feature type i, we can compute the kernel function for all pairs of training instances to obtain a training kernel matrix for each feature: K = {K1, K2 , . [sent-58, score-0.577]
22 Our goal is to devise a method to find a combination of these kernel matrices that can perform effective recognition for a particular event class. [sent-64, score-0.6]
23 Because we associate features with kernel matrices, the problem of kernel com2697 bination translates naturally to feature combination. [sent-65, score-0.587]
24 We introduce a method that is selective of these kernel matrices, and simultaneously considers different sets of them independently. [sent-66, score-0.395]
25 Our method uses an AND/OR graph structure to represent the possible combinations, which we describe in detail below. [sent-67, score-0.355]
26 AND/OR model The AND/OR graph structure is represented by a graph G = (V, E), where V and E denote the set of vertices and edges. [sent-70, score-0.562]
27 The edge set E consists of vertical edges that define the topological structure of the graph, connecting nodes between adjacent layers. [sent-72, score-0.3]
28 We define TVi to be the child nodes of node Vi ∈ V . [sent-74, score-0.532]
29 We define Vroot ∈ V to be the root node of the tree. [sent-77, score-0.346]
30 Each node in the AND/OR graph is a variable that encodes a kernel matrix. [sent-79, score-0.703]
31 The LEAF nodes encode the base kernel matrices {K1, K2, . [sent-80, score-0.525]
32 , Km} from our original featkuerrense alt m mthater licoewses {Kt layer of the graph, aronmd t ohuer r ro ooritg innoadle en- codes the final kernel matrix to be used for recognition at the highest layer of the graph. [sent-83, score-0.339]
33 Because each LEAF node is just equal to a kernel matrix for one of our original features, the number of LEAF nodes is equal to m. [sent-84, score-0.677]
34 In our model, there are three types of potentials that define the energy of a particular assignment of kernel matrices v = {v1, v2 , . [sent-85, score-0.394]
35 i nT hthee f graph, efnotriacling the node to average the kernels of its children. [sent-90, score-0.331]
36 (1) The second potential captures the behavior of an OR node in the graph, forcing it to select a single kernel from its children. [sent-95, score-0.552]
37 rength of the root node Vroot in the graph: ψROOT(Vroot = vroot) = S(vroot) (3) where S(u) is a scoring function that uses the kernel defined at node u to compute the cross-validated average precision on the training data using an SVM. [sent-100, score-0.813]
38 The root node Vroot also appears in ψAND or ψOR, depending on its node type. [sent-101, score-0.576]
39 In the bottom-up processing stage, we construct an initial configuration by assigning kernel matrices to each node based only on their children nodes. [sent-105, score-0.783]
40 Combining the potentials, we can define the energy of a particular assignment v of kernel matrices to nodes as: E(v) = ? [sent-107, score-0.576]
41 VOR − ψROOT(Vroot = vroot) (4) Intuitively, if Vi is an AND node in the graph, then it averages the kernels of its children TVi . [sent-111, score-0.483]
42 If Vi is an OR node in the graph, then it selects a kernel amongst its children TVi . [sent-112, score-0.619]
43 Since the behavior of the AND nodes is deterministic, the number of possible configurations is only dependent on the number of OR nodes in the graph. [sent-114, score-0.515]
44 Note that the space of possible assignments v is not actually the space of all kernel matrices, as nodes in the graph are restricted to combinations of the kernel matrices in the LEAF nodes. [sent-115, score-1.14]
45 Thus, a configuration can be seen as a parse of the graph (blue edges in Figure 2), where we can trace the kernels combined for each node down to the LEAF nodes. [sent-116, score-0.726]
46 After initializing an AND/OR graph structure, we propose a set of potential moves in the space of possible structures. [sent-119, score-0.455]
47 node |Vi | as the number of LEAF nodes that are combined innotdoe eth |eV k|e arsne thl efo nru nmodbeer rV oi. [sent-123, score-0.482]
48 Inference The inference problem seeks to find the assignment of kernel matrices v = {v1, v2 , . [sent-125, score-0.439]
49 S thinacte m tihnei m biezheasv tihoer of the AND nodes is deterministic, our goal in inference is to choose the children nodes that the OR nodes select. [sent-129, score-0.824]
50 However, this is difficult because ψROOT computes a score based on the kernel at the root node, which couples the decisions of all nodes. [sent-130, score-0.422]
51 Thus, the decisions for the OR nodes cannot be made locally as they could affect the kernel combination at the root node in different ways. [sent-131, score-0.89]
52 In order to perform efficient inference, we propose an approach inspired by [2, 3] that combines a bottom-up processing stage that proposes configurations for subtrees with a top-down refinement stage that considers a global set of moves over the entire graph. [sent-132, score-0.325]
53 We start from the nodes in the lowest layer and work our way up to the root node. [sent-134, score-0.377]
54 For each OR node Vi ∈ VOR, we assume that the best kernel assignment vi is the ∈chi Vld u ∈ TVi that achieves the best score: vi= aurg∈TmViaxS(u) (5) With this approximation, we can compute the kernel assignments for the OR nodes locally from their children. [sent-135, score-1.168]
55 Using this approximation, we can build our configuration from the bottom-up to obtain a kernel assignment for the entire AND/OR graph. [sent-137, score-0.405]
56 To limit the space of possible refinements, we only consider changing children of OR nodes for which the local estimates from Equation 5 were close in score. [sent-141, score-0.36]
57 Structure Learning Our goal in structure learning is to find the best AND/OR graph structure and configuration for a particular class, a difficult problem because of the large space of possible graph structures. [sent-143, score-0.798]
58 We use a greedy hill-climbing approach to structure learning, and start by initializing our AND/OR graph structure using a random initialization. [sent-144, score-0.463]
59 To help constrain the space of possible graph structures, we constrain each node to have at most λchild children and λparent parents. [sent-145, score-0.737]
60 By constraining the number of children a node can have, we help regularize our graph structures so that we select the most important kernels. [sent-146, score-0.736]
61 By constraining the number of parents a node can have, we prevent kernels from appearing in large numbers of nodes in the graph, allowing our structure to consider different subsets of kernels. [sent-147, score-0.726]
62 After initializing our graph structure G, we select a random node Vi in the graph and consider the following set of moves: • Add operation. [sent-148, score-0.839]
63 We remove a child node from Vi, wRhemicho corresponds t. [sent-177, score-0.322]
64 We swap one ofthe child nodes from SViw awpit ho one otiof nth. [sent-181, score-0.342]
65 This corresponds to swapping a node from TVi for a node in TVj . [sent-183, score-0.46]
66 Considering each of these moves provides us with a set of potential graph structures {G1, G2 , . [sent-184, score-0.455]
67 Then, we compute the structure score Struct(Gi) using the following equation: Struct(Gi) = S(Gi (Vroot)) − λstruct |Gi (Vroot) | (6) where Gi (Vroot) corresponds to the root node of the potential graph structure Gi. [sent-190, score-0.855]
68 This score is a combination of the score of the root node, combined with a regularization on the number of LEAF nodes selected by the root node to prevent overly complex combinations. [sent-191, score-0.903]
69 Any subtree that remains unchanged by the graph moves does not need to be re-computed, as the optimal bottom-up configuration will remain the same. [sent-198, score-0.441]
70 In practice, we use a hash table to keep track of the scores for all leaf node combinations that have been computed. [sent-199, score-0.507]
71 There are approximately 150 training videos for each event, and in the two testing sets for DEV-T and DEV-O, we are given large databases of videos that consist of both the events in the set as well as null videos that correspond to no event. [sent-203, score-0.33]
72 Average Precision (AP) values for datasets using graph structures with different numbers of layers. [sent-216, score-0.345]
73 For all features, we used the Histogram Intersection Kernel for our kernel matrices, normalized using spherical normalization [11], as this kernel provided us with the best individual feature results. [sent-220, score-0.515]
74 For all methods that define a combination of kernels, we train an SVM over the kernel combination, and cross-validate to determine the C parameter. [sent-221, score-0.303]
75 To constrain our search space, we considered 5-layer AND/OR graphs (see Figure 8), with alternating layers of AND nodes and OR nodes. [sent-230, score-0.322]
76 To help alleviate the problem of local optima in our structure search procedure, we considered multiple random initializations, and selected the graph structure whose configuration provided us with the lowest energy. [sent-231, score-0.611]
77 This method iteratively selects the best individual performing feature through cross-validation, and combines this feature with all previously selected features using kernel averaging. [sent-235, score-0.422]
78 Visualizations ofthe feature combinations learned by various methods for each ofthe complex event classes. [sent-241, score-0.42]
79 Visualizations of AND/OR graph structures that are learned by our method for the “Wedding ceremony” and “Changing a vehicle tire” classes. [sent-246, score-0.312]
80 [19] This method uses temporal structure for complex event recognition with only HOG3D features. [sent-254, score-0.356]
81 Although kernel averaging is a special instance of our method where an AND node combines all LEAF nodes, our method sometimes performs worse than averaging. [sent-262, score-0.505]
82 This is because we place several forms of regularization on our model including the λchild and λparent parameters so that our method prefers simpler kernel combinations, and constrains the space of AND/OR graphs we must search over. [sent-263, score-0.313]
83 However, it is possible to search an even larger space of AND/OR graph structures that includes kernel averaging, and that would help improve performance further. [sent-264, score-0.619]
84 The convergence will likely be much slower if we considered more complicated graph structures or additional types of moves. [sent-270, score-0.369]
85 Note that the performance of the initial graph structures are decent, as we perform inference on these structures to obtain their optimal configurations. [sent-273, score-0.461]
86 In Table 3, we also show the performance of graph structures with different numbers of layers. [sent-274, score-0.345]
87 However, because our method considers features independently in a hierarchical setting, it allows us to discover complementary features otherwise missed by MKL L1. [sent-281, score-0.344]
88 We visualize two of the learned graph structures and configurations in Figure 8. [sent-282, score-0.348]
89 The “Changing a vehicle tire” graph visualization shows how our method prefers SIFT image features for this class, possibly because the presence oftires is very indicative. [sent-284, score-0.343]
90 Note that our method is also able to do implicit kernel weighting, as seen in the graph visualization for “Wedding ceremony”, where the HOG3D feature is deemed important and combined twice. [sent-285, score-0.556]
91 Conclusion In conclusion, we have presented a method for combining features that incorporates our intuitions for how features should be combined. [sent-287, score-0.311]
92 Our method uses an AND/OR graph to represent possible feature combinations, and automatically learns the structure of the graph. [sent-288, score-0.396]
93 Using the AND/OR graph structure, our feature combination method is able to be selective of features, consider different subsets of features in a hierarchical manner, and achieve convincing results on the 2011TRECVID MED dataset [17]. [sent-289, score-0.62]
94 Designing efficient methods to utilize additional layers and nodes with non-linear behavior could be a possible direction. [sent-292, score-0.3]
95 In addition, it would be interesting to draw connections between our method and objectives that are optimized by kernel combination techniques such as MKL. [sent-293, score-0.303]
96 Exploring large feature spaces with hierarchical multiple kernel learning. [sent-302, score-0.334]
97 Rapid inference on a novel and/or graph for object detection, segmentation and parsing. [sent-319, score-0.309]
98 Recognizing complex events using large margin joint low-level event model. [sent-355, score-0.349]
99 Evaluation of low-level features and their combinations for complex event detection in open source videos. [sent-424, score-0.451]
100 Learning latent temporal [20] [21] [22] [23] [24] [25] structure for complex event detection. [sent-430, score-0.356]
wordName wordTfidf (topN-words)
[('vroot', 0.331), ('tvi', 0.271), ('kernel', 0.237), ('graph', 0.236), ('node', 0.23), ('event', 0.219), ('mkl', 0.218), ('nodes', 0.21), ('vi', 0.203), ('leaf', 0.164), ('children', 0.121), ('configuration', 0.117), ('root', 0.116), ('combinations', 0.113), ('intuitions', 0.111), ('kernels', 0.101), ('struct', 0.099), ('landing', 0.096), ('vor', 0.093), ('child', 0.092), ('structure', 0.09), ('moves', 0.088), ('ceremony', 0.085), ('events', 0.083), ('matrices', 0.078), ('wedding', 0.077), ('structures', 0.076), ('vand', 0.074), ('inference', 0.073), ('med', 0.073), ('features', 0.072), ('fish', 0.071), ('videos', 0.071), ('selective', 0.07), ('combination', 0.066), ('trecvid', 0.066), ('storyline', 0.064), ('classchancetang', 0.06), ('greedyaveragemkl', 0.06), ('complementary', 0.06), ('refinement', 0.057), ('hierarchical', 0.056), ('combining', 0.056), ('gi', 0.056), ('potential', 0.055), ('considers', 0.054), ('bangpeng', 0.053), ('restrictions', 0.052), ('assignment', 0.051), ('layer', 0.051), ('convincing', 0.049), ('initializing', 0.047), ('complex', 0.047), ('stage', 0.045), ('parent', 0.045), ('isa', 0.044), ('permitted', 0.044), ('combined', 0.042), ('ap', 0.042), ('help', 0.041), ('graphs', 0.041), ('feature', 0.041), ('constrain', 0.04), ('tire', 0.04), ('swap', 0.04), ('cloth', 0.04), ('score', 0.038), ('variants', 0.038), ('averaging', 0.038), ('visualizations', 0.037), ('optima', 0.037), ('configurations', 0.036), ('video', 0.036), ('prefers', 0.035), ('sets', 0.034), ('feeding', 0.034), ('numbers', 0.033), ('class', 0.032), ('varma', 0.032), ('km', 0.032), ('constraining', 0.032), ('layers', 0.031), ('instances', 0.031), ('decisions', 0.031), ('selects', 0.031), ('averages', 0.031), ('behavior', 0.03), ('combine', 0.03), ('discover', 0.03), ('oa', 0.03), ('subsets', 0.03), ('vj', 0.029), ('possible', 0.029), ('animal', 0.029), ('recognize', 0.029), ('tang', 0.029), ('complicated', 0.029), ('trying', 0.028), ('types', 0.028), ('composite', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 81 iccv-2013-Combining the Right Features for Complex Event Recognition
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
Author: Arash Vahdat, Kevin Cannons, Greg Mori, Sangmin Oh, Ilseo Kim
Abstract: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
3 0.23302142 238 iccv-2013-Learning Graphs to Match
Author: Minsu Cho, Karteek Alahari, Jean Ponce
Abstract: Many tasks in computer vision are formulated as graph matching problems. Despite the NP-hard nature of the problem, fast and accurate approximations have led to significant progress in a wide range of applications. Learning graph models from observed data, however, still remains a challenging issue. This paper presents an effective scheme to parameterize a graph model, and learn its structural attributes for visual object matching. For this, we propose a graph representation with histogram-based attributes, and optimize them to increase the matching accuracy. Experimental evaluations on synthetic and real image datasets demonstrate the effectiveness of our approach, and show significant improvement in matching accuracy over graphs with pre-defined structures.
4 0.21349263 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
5 0.20309281 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
Author: Ling Wang, Hichem Sahbi
Abstract: One of the trends of action recognition consists in extracting and comparing mid-level features which encode visual and motion aspects of objects into scenes. However, when scenes contain high-level semantic actions with many interacting parts, these mid-level features are not sufficient to capture high level structures as well as high order causal relationships between moving objects resulting into a clear drop in performances. In this paper, we address this issue and we propose an alternative action recognition method based on a novel graph kernel. In the main contributions of this work, we first describe actions in videos using directed acyclic graphs (DAGs), that naturally encode pairwise interactions between moving object parts, and then we compare these DAGs by analyzing the spectrum of their sub-patterns that capture complex higher order interactions. This extraction and comparison process is computationally tractable, re- sulting from the acyclic property of DAGs, and it also defines a positive semi-definite kernel. When plugging the latter into support vector machines, we obtain an action recognition algorithm that overtakes related work, including graph-based methods, on a standard evaluation dataset.
6 0.17957173 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
7 0.17815834 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
8 0.17155431 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
9 0.1705551 10 iccv-2013-A Framework for Shape Analysis via Hilbert Space Embedding
10 0.17002298 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
11 0.1678374 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
12 0.16296339 214 iccv-2013-Improving Graph Matching via Density Maximization
13 0.16204365 237 iccv-2013-Learning Graph Matching: Oriented to Category Modeling from Cluttered Scenes
14 0.1575563 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
15 0.14060436 165 iccv-2013-Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies
16 0.13953863 120 iccv-2013-Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
17 0.13715139 295 iccv-2013-On One-Shot Similarity Kernels: Explicit Feature Maps and Properties
18 0.13392963 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
19 0.13347279 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
20 0.13115035 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
topicId topicWeight
[(0, 0.245), (1, 0.132), (2, 0.022), (3, 0.042), (4, 0.088), (5, 0.102), (6, 0.02), (7, 0.003), (8, 0.03), (9, -0.197), (10, -0.217), (11, -0.147), (12, -0.004), (13, 0.166), (14, -0.046), (15, 0.126), (16, 0.092), (17, -0.059), (18, 0.01), (19, 0.066), (20, -0.019), (21, 0.07), (22, 0.026), (23, 0.092), (24, 0.081), (25, -0.017), (26, -0.121), (27, 0.067), (28, -0.005), (29, 0.022), (30, 0.098), (31, 0.044), (32, -0.038), (33, -0.084), (34, -0.019), (35, 0.09), (36, 0.005), (37, -0.012), (38, -0.046), (39, 0.02), (40, -0.039), (41, -0.104), (42, -0.006), (43, 0.016), (44, 0.061), (45, 0.04), (46, -0.009), (47, -0.053), (48, -0.029), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.96424931 81 iccv-2013-Combining the Right Features for Complex Event Recognition
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
2 0.65652812 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
3 0.64475 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
4 0.64254421 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
Author: Christoph Straehle, Ullrich Koethe, Fred A. Hamprecht
Abstract: We propose a scheme that allows to partition an image into a previously unknown number of segments, using only minimal supervision in terms of a few must-link and cannotlink annotations. We make no use of regional data terms, learning instead what constitutes a likely boundary between segments. Since boundaries are only implicitly specified through cannot-link constraints, this is a hard and nonconvex latent variable problem. We address this problem in a greedy fashion using a randomized decision tree on features associated with interpixel edges. We use a structured purity criterion during tree construction and also show how a backtracking strategy can be used to prevent the greedy search from ending up in poor local optima. The proposed strategy is compared with prior art on natural images.
5 0.64124036 238 iccv-2013-Learning Graphs to Match
Author: Minsu Cho, Karteek Alahari, Jean Ponce
Abstract: Many tasks in computer vision are formulated as graph matching problems. Despite the NP-hard nature of the problem, fast and accurate approximations have led to significant progress in a wide range of applications. Learning graph models from observed data, however, still remains a challenging issue. This paper presents an effective scheme to parameterize a graph model, and learn its structural attributes for visual object matching. For this, we propose a graph representation with histogram-based attributes, and optimize them to increase the matching accuracy. Experimental evaluations on synthetic and real image datasets demonstrate the effectiveness of our approach, and show significant improvement in matching accuracy over graphs with pre-defined structures.
6 0.6406222 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
7 0.63559449 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
8 0.63184804 214 iccv-2013-Improving Graph Matching via Density Maximization
9 0.62409228 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
10 0.60986626 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
11 0.6056999 224 iccv-2013-Joint Optimization for Consistent Multiple Graph Matching
12 0.60153282 237 iccv-2013-Learning Graph Matching: Oriented to Category Modeling from Cluttered Scenes
13 0.60144967 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
14 0.58777642 290 iccv-2013-New Graph Structured Sparsity Model for Multi-label Image Annotations
15 0.58522445 120 iccv-2013-Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
16 0.57911146 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
17 0.57528472 11 iccv-2013-A Fully Hierarchical Approach for Finding Correspondences in Non-rigid Shapes
18 0.54316145 117 iccv-2013-Discovering Details and Scene Structure with Hierarchical Iconoid Shift
19 0.53978556 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
20 0.53936136 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
topicId topicWeight
[(2, 0.053), (4, 0.021), (26, 0.044), (31, 0.035), (42, 0.054), (64, 0.032), (73, 0.018), (89, 0.657)]
simIndex simValue paperId paperTitle
same-paper 1 0.99798054 81 iccv-2013-Combining the Right Features for Complex Event Recognition
Author: Kevin Tang, Bangpeng Yao, Li Fei-Fei, Daphne Koller
Abstract: In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset [17].
2 0.99693161 39 iccv-2013-Action Recognition with Improved Trajectories
Author: Heng Wang, Cordelia Schmid
Abstract: Recently dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets. This paper improves their performance by taking into account camera motion to correct them. To estimate camera motion, we match feature points between frames using SURF descriptors and dense optical flow, which are shown to be complementary. These matches are, then, used to robustly estimate a homography with RANSAC. Human motion is in general different from camera motion and generates inconsistent matches. To improve the estimation, a human detector is employed to remove these matches. Given the estimated camera motion, we remove trajectories consistent with it. We also use this estimation to cancel out camera motion from the optical flow. This significantly improves motion-based descriptors, such as HOF and MBH. Experimental results onfour challenging action datasets (i.e., Hollywood2, HMDB51, Olympic Sports and UCF50) significantly outperform the current state of the art.
3 0.99656713 139 iccv-2013-Elastic Fragments for Dense Scene Reconstruction
Author: Qian-Yi Zhou, Stephen Miller, Vladlen Koltun
Abstract: We present an approach to reconstruction of detailed scene geometry from range video. Range data produced by commodity handheld cameras suffers from high-frequency errors and low-frequency distortion. Our approach deals with both sources of error by reconstructing locally smooth scene fragments and letting these fragments deform in order to align to each other. We develop a volumetric registration formulation that leverages the smoothness of the deformation to make optimization practical for large scenes. Experimental results demonstrate that our approach substantially increases the fidelity of complex scene geometry reconstructed with commodity handheld cameras.
4 0.99361038 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
Author: Dan Xie, Sinisa Todorovic, Song-Chun Zhu
Abstract: This paper presents an approach to localizing functional objects in surveillance videos without domain knowledge about semantic object classes that may appear in the scene. Functional objects do not have discriminative appearance and shape, but they affect behavior of people in the scene. For example, they “attract” people to approach them for satisfying certain needs (e.g., vending machines could quench thirst), or “repel” people to avoid them (e.g., grass lawns). Therefore, functional objects can be viewed as “dark matter”, emanating “dark energy ” that affects people ’s trajectories in the video. To detect “dark matter” and infer their “dark energy ” field, we extend the Lagrangian mechanics. People are treated as particle-agents with latent intents to approach “dark matter” and thus satisfy their needs, where their motions are subject to a composite “dark energy ” field of all functional objects in the scene. We make the assumption that people take globally optimal paths toward the intended “dark matter” while avoiding latent obstacles. A Bayesian framework is used to probabilistically model: people ’s trajectories and intents, constraint map of the scene, and locations of functional objects. A data-driven Markov Chain Monte Carlo (MCMC) process is used for inference. Our evaluation on videos of public squares and courtyards demonstrates our effectiveness in localizing functional objects and predicting people ’s trajectories in unobserved parts of the video footage.
5 0.99175781 103 iccv-2013-Deblurring by Example Using Dense Correspondence
Author: Yoav Hacohen, Eli Shechtman, Dani Lischinski
Abstract: This paper presents a new method for deblurring photos using a sharp reference example that contains some shared content with the blurry photo. Most previous deblurring methods that exploit information from other photos require an accurately registered photo of the same static scene. In contrast, our method aims to exploit reference images where the shared content may have undergone substantial photometric and non-rigid geometric transformations, as these are the kind of reference images most likely to be found in personal photo albums. Our approach builds upon a recent method for examplebased deblurring using non-rigid dense correspondence (NRDC) [11] and extends it in two ways. First, we suggest exploiting information from the reference image not only for blur kernel estimation, but also as a powerful local prior for the non-blind deconvolution step. Second, we introduce a simple yet robust technique for spatially varying blur estimation, rather than assuming spatially uniform blur. Unlike the aboveprevious method, which hasproven successful only with simple deblurring scenarios, we demonstrate that our method succeeds on a variety of real-world examples. We provide quantitative and qualitative evaluation of our method and show that it outperforms the state-of-the-art.
6 0.98774207 56 iccv-2013-Automatic Registration of RGB-D Scans via Salient Directions
7 0.98758239 302 iccv-2013-Optimization Problems for Fast AAM Fitting in-the-Wild
8 0.98704463 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
9 0.98091316 2 iccv-2013-3D Scene Understanding by Voxel-CRF
10 0.97531253 390 iccv-2013-Shufflets: Shared Mid-level Parts for Fast Object Detection
11 0.97089332 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects
12 0.95702511 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
13 0.95621663 129 iccv-2013-Dynamic Scene Deblurring
14 0.95514506 228 iccv-2013-Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences
15 0.95074075 317 iccv-2013-Piecewise Rigid Scene Flow
16 0.95060432 343 iccv-2013-Real-World Normal Map Capture for Nearly Flat Reflective Surfaces
17 0.94956988 9 iccv-2013-A Flexible Scene Representation for 3D Reconstruction Using an RGB-D Camera
18 0.94727802 370 iccv-2013-Saliency Detection in Large Point Sets
19 0.94683188 42 iccv-2013-Active MAP Inference in CRFs for Efficient Semantic Segmentation
20 0.94652361 105 iccv-2013-DeepFlow: Large Displacement Optical Flow with Deep Matching