nips nips2012 nips2012-306 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sung J. Hwang, Kristen Grauman, Fei Sha
Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.
Reference: text
sentIndex sentText sentNum sentScore
1 While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. [sent-7, score-1.304]
2 In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e. [sent-8, score-1.114]
3 , for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). [sent-10, score-0.421]
4 For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. [sent-11, score-0.668]
5 Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. [sent-12, score-0.646]
6 As the basic “image features + labels + classifier” paradigm has reached a level of maturity, we believe it is time to reach beyond it towards models that incorporate richer semantic knowledge about the object categories themselves. [sent-18, score-0.424]
7 A hierarchical semantic taxonomy is a tree that groups classes together in its nodes according to some human-designed merging or splitting criterion. [sent-20, score-0.994]
8 For example, well-known taxonomies include WordNet, which groups words into sets of cognitive synonyms and their super-subordinate relations [3], and the phylogenetic tree of life, which groups biological species based on their physical or genetic properties. [sent-21, score-0.593]
9 Two fundamental issues, however, complicate the use of a semantic taxonomy for learning visual objects. [sent-25, score-0.8]
10 First, a given taxonomy may offer hints about visual relatedness, but its structure need not entirely align with useful splits for recognition. [sent-26, score-0.497]
11 ) Second, given the complexity of visual objects, it is highly unlikely that some single optimal semantic taxonomy exists to lend insight for recognition. [sent-29, score-0.8]
12 Rather than commit to a single taxonomy—which may or may not align well with discriminative visual features—we learn a tree of kernels for each taxonomy that captures the granularity-specific similarity at each node. [sent-31, score-0.787]
13 Then we show how to exploit the inter-taxonomic structure when learning a combination of these kernels from multiple taxonomies (i. [sent-32, score-0.646]
14 in reality objects can be organized along many semantic dimensions or “views”. [sent-35, score-0.303]
15 ) Motivated by these issues, we present a discriminative feature learning approach that leverages multiple taxonomies capturing different semantic views of the object categories. [sent-39, score-1.055]
16 Our key insight is that some combination of the semantic views will be most informative to distinguish a given visual category. [sent-40, score-0.488]
17 Continuing with the sketch in Figure 1, that might mean that the first taxonomy helps learn dog- and cat-like features, while the second taxonomy helps elucidate spots and pointy corner features, while the last reveals context cues such as proximity to humans or indoor scene features. [sent-41, score-0.876]
18 Thus, rather than commit to a single representation, we aim to inject pieces of the various taxonomies as needed. [sent-43, score-0.467]
19 Our method takes as input training images labeled according to their object category, as well as a series of taxonomies, each of which hierarchically partitions those same labels (object classes) by a different semantic view. [sent-45, score-0.386]
20 For each taxonomy, we first learn a tree of semantic kernels: each node in a tree has a Mahalanobis-based kernel optimized to distinguish between the classes in its children nodes. [sent-46, score-0.76]
21 The kernels in one tree isolate image features useful at a range of category granularities. [sent-47, score-0.336]
22 Then, using the resulting semantic kernel forest from all taxonomies, we apply a form of multiple kernel learning (MKL) to obtain class-specific kernel combinations, in order to select only those relationships relevant to recognize each object class. [sent-48, score-0.943]
23 Our main contribution is to simultaneously exploit multiple semantic taxonomies for visual feature learning. [sent-51, score-0.915]
24 Whereas past work focuses on building object hierarchies for scalable classification [12, 13] or using WordNet to gauge semantic distance [5, 6, 8, 9], we learn discriminative kernels that capitalize on the cues in diverse taxonomy views, leading to better recognition accuracy. [sent-52, score-1.087]
25 We demonstrate our approach with challenging images from the Animals with Attributes and ImageNet datasets [14, 7] together with taxonomies spanning cognitive synsets, visual attributes, behavior, and habitats. [sent-54, score-0.543]
26 Our results show that the taxonomies can indeed boost feature learning, letting us benefit from humans’ perceived distinctions as implicitly embedded in the trees. [sent-55, score-0.499]
27 2 2 Related Work Leveraging hierarchies for object recognition Most work in object recognition that leverages category hierarchy does so for the sake of efficient classification [15, 16, 12, 13, 17]. [sent-57, score-0.286]
28 Since taxonomies need not be ideal structures for this goal, recent work focuses on novel ways to optimize the tree structure itself [12, 13, 17], while others consider splits based on initial inter-class confusions [16]. [sent-59, score-0.559]
29 Whereas all such work exploits tree structures to improve efficiency (whether in classification or browsing), our goal is for externally defined semantic hierarchies to enhance recognition accuracy. [sent-61, score-0.474]
30 More related to our problem setting are techniques that exploit the inter-class relationships in a taxonomy [5, 6, 8, 9, 10, 11]. [sent-62, score-0.421]
31 One idea is to combine the decisions of classifiers along the semantic hierarchy [5, 4]. [sent-63, score-0.345]
32 Alternatively, the semantic “distance” between nodes can be used to penalize misclassifications more meaningfully [9], or to share labeled exemplars between similar classes [8]. [sent-64, score-0.422]
33 Classification with multiple semantic views Combining information from multiple “views” of data is a well-researched topic in the machine learning, multimedia, and computer vision communities. [sent-67, score-0.454]
34 Broadly speaking, our problem has a similar spirit to such settings, since we want to leverage multiple parallel taxonomies over the data; however, our goal to aggregate portions of the taxonomies during feature learning is quite distinct. [sent-74, score-1.003]
35 More specifically, while previous methods attempt to find a single structure to accommodate both views, we seek complementary information from the semantic views and assemble task-specific discriminative features. [sent-75, score-0.436]
36 Learning kernel combinations Multiple kernel learning (MKL) algorithms [30] have shown promise for image recognition (e. [sent-76, score-0.294]
37 Our approach also employs a form of MKL, but rather than pool kernels stemming from different low-level features or kernel hyperparameters, it pools kernels stemming from different semantic sources. [sent-79, score-0.755]
38 Furthermore, our addition of a novel regularizer exploits the hierarchical structure from which the kernels originate. [sent-80, score-0.29]
39 3 Approach We cast the problem of learning semantic features from multiple taxonomies as learning to combine kernels. [sent-81, score-0.845]
40 The base kernels capture features specific to individual taxonomies and granularities within those taxonomies, and they are combined discriminatively to improve classification, weighing each taxonomy and granularity only to the extent useful for the target classification task. [sent-82, score-1.123]
41 We describe the two main components of the approach in turn: learning the base kernels—which we call a semantic kernel forest (Sec. [sent-83, score-0.618]
42 In what follows, we assume that we are given a labeled dataset D = {(xi , yi )}N where (xi , yi ) n=1 stands for the ith instance (feature vector) and its class label is yi , as well as a set of tree-structured taxonomies {Tt }T . [sent-88, score-0.467]
43 We index those nodes with double subscripts tn, where t refers to the tth taxonomy and n to the nth node in that taxonomy. [sent-91, score-0.533]
44 1 Learning a semantic kernel forest Our first step is to learn a forest of base kernels. [sent-96, score-0.748]
45 Formally, for each taxonomy Tt , we learn a set of Gaussian kernels for the superclass at every internal node tn for which n ≥ C + 1. [sent-99, score-0.669]
46 The latter exploits the taxonomy semantics, based on the intuition that features used to distinguish more abstract classes (dog vs. [sent-108, score-0.581]
47 We call the collection F = {Ktn } of all base kernels the semantic kernel forest. [sent-122, score-0.63]
48 2 Learning class-specific kernels across taxonomies Base kernels in the semantic kernel forest are learned jointly within each taxonomy but independently across taxonomies. [sent-128, score-1.735]
49 To leverage multiple taxonomies and to capture different semantic views of the object categories, we next combine them discriminatively to improve classification. [sent-129, score-0.967]
50 Additionally, instead of combining all the base kernels in the forest F , we preselect a subset of them based on the taxonomy structure. [sent-131, score-0.748]
51 Specifically, from each taxonomy, we select base kernels that correspond to the nodes on the path from the root to the leaf node class. [sent-132, score-0.353]
52 For example, in the Biological taxonomy of Figure 1, for the category Dalmatian, this path includes the nodes (superclasses) canine and animal. [sent-133, score-0.511]
53 Thus, for class c, the linearly combined kernel is given by Fc (xi , xj ) = βctn Ktn (xi , xj ), t n∼c 4 (3) where n ∼ c indexes the nodes that are ancestors of c, which is a leaf node (recall that the first C nodes in every taxonomy are reserved for leaf class nodes). [sent-134, score-0.811]
54 1 Image datasets and taxonomies We consider two publicly available image collections: Animals with Attributes (AWA) [14] and ImageNet [7]1 . [sent-168, score-0.501]
55 To obtain multiple taxonomies per dataset, we use attribute labels and WordNet. [sent-181, score-0.547]
56 To form semantic taxonomies based on attributes, we first manually divide the attribute labels into subsets according to their mutual semantic relevance (e. [sent-186, score-1.116]
57 To extract a WordNet taxonomy, we find all nodes in WordNet that contain the object class names on their word lists, and then build a hierarchy by pruning nodes with only one child and resolving multiple parentship. [sent-191, score-0.282]
58 For the AWA-4 taxonomies, we simply generate all 3 possible 2-level binary trees, which, based on manual observation, yield taxonomies reflecting Biological, Appearance, and Habitat ties between the animals. [sent-194, score-0.467]
59 We stress that these taxonomies are created externally with human knowledge, and thus they inject perceived object relationships into the feature learning problem. [sent-196, score-0.582]
60 2) Raw feature kernel + MKL: MKL combination of multiple such RBF kernels constructed by varying γ, which is a traditional approach to generate base kernels (e. [sent-200, score-0.538]
61 For this baseline, we generate the same number N of base kernels as in the semantic kernel forest, with γ = σ , for σ = {21−m , . [sent-203, score-0.63]
62 3) Perturbed semantic d 2 kernel tree: a semantic kernel tree trained with taxonomies that have randomly swapped leaves. [sent-207, score-1.425]
63 68 Raw feature kernel Raw feature kernel + MKL Perturbed semantic kernel tree + MKL-H Perturbed semantic kernel forest + MKL-H Semantic kernel tree + Avg Semantic kernel tree + MKL Semantic kernel tree + MKL-H Semantic kernel forest + MKL Semantic kernel forest + MKL-H AWA-10 30. [sent-222, score-2.598]
64 (The perturbed semantic kernel tree baseline is not applicable for AWA-4, since all possible groupings are present in the taxonomies. [sent-259, score-0.603]
65 coaster bonsai pooltable policevan drum sunflower button lamp basketball comb featherboa bridge bathtub strawberry rat panda leopard pig chimp raccoon P. [sent-270, score-0.371]
66 whale ferriswheel −10 −4 Figure 3: Per-class accuracy improvements of each individual taxonomy and the semantic kernel forest (“All”) over the raw feature kernel baseline. [sent-272, score-1.251]
67 The first two baselines will show the accuracy attainable using the same image features and basic classification tools (SVM, MKL) as our approach, but lacking the taxonomy insights. [sent-275, score-0.493]
68 The last baseline will test if weakening the semantics in the taxonomy has a negative impact on accuracy. [sent-276, score-0.451]
69 We evaluate several variants of our approach, in order to analyze the impact of each component: 1) Semantic kernel tree + Avg: an equal-weight average of the semantic kernels from one taxonomy. [sent-277, score-0.667]
70 3) Semantic kernel tree + MKL-H: the same as previous, but adding the proposed hierarchical regularization (eq. [sent-282, score-0.281]
71 4) Semantic kernel forest + MKL: semantic forest kernels from multiple taxonomies combined with MKL. [sent-284, score-1.339]
72 5) Semantic kernel forest + MKL-H: the same as previous, but adding our hierarchical regularizer. [sent-285, score-0.319]
73 For single taxonomy results, we report the average over all individual taxonomies. [sent-289, score-0.421]
74 Our semantic kernel forests approach significantly outperforms all three baselines. [sent-298, score-0.471]
75 The forests’ advantage over the individual trees supports our core claim regarding the value of interleaving semantic cues from multiple taxonomies. [sent-301, score-0.374]
76 Further, the proposed hierarchical regularization (MKL-H) outperforms the generic MKL, particularly for the multiple taxonomy forests. [sent-302, score-0.517]
77 We stress that semantic kernel forests’ success is not simply due to having access to a variety of kernels, as we can see by comparing our method to both the raw feature MKL and perturbed tree 7 l1−only (34. [sent-303, score-0.645]
78 5 5 9 8 4 12 1 22 4 leopard 2 11 14 wolf 3 1 18 8 0 4 11 15 (c) Habitat dalmatian S. [sent-305, score-0.456]
79 cat 17 3 5 5 2 13 4 11 leopard 3 1 21 5 wolf 2 3 10 15 (d) All l1 + Hierarchical (35. [sent-306, score-0.392]
80 67) chimpanzee giant panda leopard Persian cat pig hippopotamus humpback whale raccoon rat seal (e) MKL procyonid feline even−toed aquatic carnivore placental cat/rat hairless ~panda appearance racoon/rat land aquatic predator/prey behavior jungle nonjungle aquatic land habitat 3 3 (b) Appear. [sent-307, score-1.334]
81 11 6 chimpanzee giant panda leopard Persian cat pig hippopotamus humpback whale raccoon rat seal procyonid feline even−toed aquatic carnivore placental cat/rat hairless ~panda appearance racoon/rat land aquatic predator/prey behavior jungle nonjungle aquatic land habitat S. [sent-308, score-1.334]
82 cat 15 3 wolf 3 6 9 13 dalmatian wolf 4 wolf 4 All (55. [sent-311, score-0.565]
83 33) (f) MKL-H Figure 4: (a-d): AWA-4 confusion matrices for individual taxonomies (a-c) and the combined taxonomies (d). [sent-319, score-0.934]
84 Instead, the advantage is leveraging the implicit discriminative criteria embedded in the external semantic groupings. [sent-325, score-0.359]
85 In addition, we note that even perturbed taxonomies can be semantic; some of their groupings of classes may happen to be meaningful, especially when there are fewer categories. [sent-326, score-0.604]
86 Nonetheless, perturbed taxonomies are semantically weaker than the originals, and our kernel trees with the true single or multiple taxonomies perform better. [sent-328, score-1.179]
87 MKL-H has the most impact for the multiple taxonomy forests, and relatively little on the single kernel tree. [sent-329, score-0.588]
88 For a single taxonomy, a single kernel is solely responsible for discriminating a class from the others, making all kernels similarly useful. [sent-331, score-0.272]
89 A single semantic kernel tree often improves accuracy on some classes, but at the expense of reduced accuracy on others. [sent-334, score-0.525]
90 This illustrates that the structure of an individual taxonomy is often suboptimal. [sent-335, score-0.421]
91 For example, the Habitat taxonomy on AWA-10 helps distinguish humpback whale well from the others—it branches early from the other animals due to its distinctive “oceanic” background—but it hurts accuracy for giant panda. [sent-336, score-0.626]
92 The WordNet taxonomy does exactly the opposite, improving giant panda via the Biological taxonomy, but hurting humpback whale. [sent-337, score-0.607]
93 The semantic kernel forest takes the best of both through its learned combination. [sent-338, score-0.563]
94 The only cases in which it fails are when the majority of the taxonomies strongly degrade performance, as to be expected given the linear MKL combination (e. [sent-339, score-0.467]
95 We see how each taxonomy specializes the features, exactly in the manner sketched in Sec. [sent-343, score-0.421]
96 The combination of all taxonomies achieves the highest accuracy (55. [sent-345, score-0.467]
97 00), better than the maximally performing individual taxonomy (Appearance, 50. [sent-346, score-0.421]
98 For example, the humpback whale drops the kernels belonging to the whole Behavior taxonomy block, and gives the strongest weight to “hairless”, and “habitat”. [sent-350, score-0.691]
99 5 Conclusion We proposed a semantic kernel forest approach to learn discriminative visual features that leverage information from multiple semantic taxonomies. [sent-355, score-1.073]
100 In future work, we plan to explore non-additive and/or local per-instance kernel combination techniques for integrating the semantic views. [sent-357, score-0.433]
wordName wordTfidf (topN-words)
[('taxonomies', 0.467), ('taxonomy', 0.421), ('semantic', 0.303), ('leopard', 0.191), ('dalmatian', 0.166), ('habitat', 0.166), ('mkl', 0.158), ('kernels', 0.142), ('wordnet', 0.137), ('forest', 0.13), ('kernel', 0.13), ('mtn', 0.102), ('cat', 0.102), ('wolf', 0.099), ('tree', 0.092), ('ra', 0.089), ('object', 0.083), ('aquatic', 0.079), ('views', 0.077), ('panda', 0.077), ('visual', 0.076), ('attributes', 0.068), ('ctn', 0.064), ('fe', 0.064), ('humpback', 0.064), ('siamese', 0.064), ('whale', 0.064), ('nodes', 0.06), ('classes', 0.059), ('hierarchical', 0.059), ('appearance', 0.059), ('ru', 0.059), ('regularizer', 0.058), ('awa', 0.056), ('discriminative', 0.056), ('base', 0.055), ('tn', 0.054), ('ba', 0.053), ('node', 0.052), ('flo', 0.051), ('ktn', 0.051), ('oa', 0.051), ('tom', 0.05), ('hierarchies', 0.048), ('perturbed', 0.047), ('metrics', 0.046), ('le', 0.046), ('op', 0.046), ('imagenet', 0.045), ('ac', 0.045), ('giant', 0.045), ('oo', 0.045), ('leaf', 0.044), ('attribute', 0.043), ('hierarchy', 0.042), ('bu', 0.042), ('po', 0.041), ('pe', 0.041), ('raw', 0.041), ('cvpr', 0.04), ('da', 0.038), ('feline', 0.038), ('hairless', 0.038), ('isw', 0.038), ('lta', 0.038), ('placental', 0.038), ('qci', 0.038), ('raccoon', 0.038), ('seal', 0.038), ('ton', 0.038), ('zee', 0.038), ('features', 0.038), ('forests', 0.038), ('multiple', 0.037), ('land', 0.037), ('er', 0.037), ('ard', 0.036), ('fc', 0.036), ('cues', 0.034), ('biological', 0.034), ('pig', 0.034), ('rb', 0.034), ('hwang', 0.034), ('classi', 0.034), ('image', 0.034), ('st', 0.032), ('mahalanobis', 0.032), ('distinguish', 0.032), ('feature', 0.032), ('exploits', 0.031), ('wb', 0.031), ('rat', 0.031), ('groupings', 0.031), ('semantically', 0.031), ('category', 0.03), ('semantics', 0.03), ('hi', 0.03), ('ecting', 0.03), ('grauman', 0.029), ('ro', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000017 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
Author: Sung J. Hwang, Kristen Grauman, Fei Sha
Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.
2 0.19076546 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
3 0.13525754 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
4 0.10649782 155 nips-2012-Human memory search as a random walk in a semantic network
Author: Joseph L. Austerweil, Joshua T. Abbott, Thomas L. Griffiths
Abstract: The human mind has a remarkable ability to store a vast amount of information in memory, and an even more remarkable ability to retrieve these experiences when needed. Understanding the representations and algorithms that underlie human memory search could potentially be useful in other information retrieval settings, including internet search. Psychological studies have revealed clear regularities in how people search their memory, with clusters of semantically related items tending to be retrieved together. These findings have recently been taken as evidence that human memory search is similar to animals foraging for food in patchy environments, with people making a rational decision to switch away from a cluster of related information as it becomes depleted. We demonstrate that the results that were taken as evidence for this account also emerge from a random walk on a semantic network, much like the random web surfer model used in internet search engines. This offers a simpler and more unified account of how people search their memory, postulating a single process rather than one process for exploring a cluster and one process for switching between clusters. 1
5 0.10539838 289 nips-2012-Recognizing Activities by Attribute Dynamics
Author: Weixin Li, Nuno Vasconcelos
Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1
6 0.10370784 188 nips-2012-Learning from Distributions via Support Measure Machines
7 0.10065551 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
8 0.08584667 185 nips-2012-Learning about Canonical Views from Internet Image Collections
9 0.085068434 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
10 0.082854815 168 nips-2012-Kernel Latent SVM for Visual Recognition
11 0.079941183 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
12 0.075226754 231 nips-2012-Multiple Operator-valued Kernel Learning
13 0.073914349 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio
14 0.071893997 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
15 0.067466609 260 nips-2012-Online Sum-Product Computation Over Trees
16 0.064928159 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
17 0.063931465 22 nips-2012-A latent factor model for highly multi-relational data
18 0.063166179 344 nips-2012-Timely Object Recognition
19 0.062106542 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
20 0.06058909 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
topicId topicWeight
[(0, 0.164), (1, 0.043), (2, -0.124), (3, -0.055), (4, 0.066), (5, -0.084), (6, 0.002), (7, 0.024), (8, -0.099), (9, -0.009), (10, -0.014), (11, 0.013), (12, 0.126), (13, -0.046), (14, -0.004), (15, -0.033), (16, -0.035), (17, 0.092), (18, 0.024), (19, -0.059), (20, -0.023), (21, 0.026), (22, 0.073), (23, -0.176), (24, 0.004), (25, 0.004), (26, 0.009), (27, -0.121), (28, 0.025), (29, 0.04), (30, 0.049), (31, -0.108), (32, -0.002), (33, -0.089), (34, 0.066), (35, 0.094), (36, 0.036), (37, -0.124), (38, 0.016), (39, 0.017), (40, 0.125), (41, 0.086), (42, 0.093), (43, 0.003), (44, -0.036), (45, -0.047), (46, 0.052), (47, -0.012), (48, 0.042), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.94911897 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
Author: Sung J. Hwang, Kristen Grauman, Fei Sha
Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
3 0.61696225 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
4 0.54401129 231 nips-2012-Multiple Operator-valued Kernel Learning
Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach
Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1
5 0.5252738 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
6 0.51791358 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
7 0.50649607 168 nips-2012-Kernel Latent SVM for Visual Recognition
8 0.4888151 167 nips-2012-Kernel Hyperalignment
9 0.47893092 289 nips-2012-Recognizing Activities by Attribute Dynamics
10 0.47759399 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
11 0.47657564 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity
12 0.47143793 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
13 0.46993887 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
14 0.46965799 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
15 0.46542495 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
16 0.45066053 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
17 0.44733331 155 nips-2012-Human memory search as a random walk in a semantic network
18 0.43969351 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification
19 0.43782791 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio
20 0.43627045 185 nips-2012-Learning about Canonical Views from Internet Image Collections
topicId topicWeight
[(0, 0.056), (17, 0.019), (21, 0.036), (38, 0.074), (39, 0.01), (42, 0.039), (54, 0.033), (55, 0.073), (64, 0.017), (74, 0.068), (76, 0.094), (80, 0.069), (81, 0.28), (92, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.77642041 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
Author: Sung J. Hwang, Kristen Grauman, Fei Sha
Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.
2 0.51847875 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
Author: Francisco Ruiz, Isabel Valera, Carlos Blanco, Fernando Pérez-Cruz
Abstract: The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) database contains a large amount of information, regarding the way of life, medical conditions, etc., of a representative sample of the U.S. population. In this paper, we are interested in seeking the hidden causes behind the suicide attempts, for which we propose to model the subjects using a nonparametric latent model based on the Indian Buffet Process (IBP). Due to the nature of the data, we need to adapt the observation model for discrete random variables. We propose a generative model in which the observations are drawn from a multinomial-logit distribution given the IBP matrix. The implementation of an efficient Gibbs sampler is accomplished using the Laplace approximation, which allows integrating out the weighting factors of the multinomial-logit likelihood model. Finally, the experiments over the NESARC database show that our model properly captures some of the hidden causes that model suicide attempts. 1
3 0.51639175 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks
Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1
4 0.51349264 215 nips-2012-Minimizing Uncertainty in Pipelines
Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi
Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1
5 0.51192886 193 nips-2012-Learning to Align from Scratch
Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller
Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1
6 0.51181221 210 nips-2012-Memorability of Image Regions
7 0.50563544 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
8 0.50380421 211 nips-2012-Meta-Gaussian Information Bottleneck
10 0.50052315 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
11 0.50021076 188 nips-2012-Learning from Distributions via Support Measure Machines
12 0.49952081 168 nips-2012-Kernel Latent SVM for Visual Recognition
13 0.49924847 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
14 0.49811789 155 nips-2012-Human memory search as a random walk in a semantic network
15 0.49636897 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images
16 0.49611068 197 nips-2012-Learning with Recursive Perceptual Representations
17 0.49513289 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
18 0.49493855 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
19 0.49469405 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
20 0.49432164 298 nips-2012-Scalable Inference of Overlapping Communities