nips nips2012 nips2012-207 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wei Bi, James T. Kwok
Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract In hierarchical classification, the prediction paths may be required to always end at leaf nodes. [sent-4, score-0.436]
2 This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. [sent-5, score-0.843]
3 However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. [sent-6, score-0.496]
4 In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. [sent-7, score-0.204]
5 We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. [sent-8, score-0.225]
6 Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. [sent-10, score-0.063]
7 The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. [sent-11, score-0.347]
8 For example, gene functions are arranged in a tree in the Functional Catalog (FunCat) or as a directed acyclic graph (DAG) in the Gene Ontology (GO) [1]; musical signals are organized in an audio taxonomy [2]; and documents in the Wikipedia hierarchy. [sent-13, score-0.119]
9 Hierarchical classification algorithms, which utilize these hierarchical relationships between labels in making predictions, often lead to better performance than traditional non-hierarchical (flat) approaches. [sent-14, score-0.156]
10 In hierarchical classification, the labels associated with each pattern can be on a path from the root to a leaf (full-path prediction); or stop at an internal node (partial-path prediction [3]). [sent-15, score-0.665]
11 Following the terminology in the recent survey [4], when only full-path predictions are allowed, it is called mandatory leaf node prediction (MLNP); whereas when partial-path predictions are also allowed, it is called non-mandatory leaf node prediction (NMLNP). [sent-16, score-0.923]
12 Depending on the application and how the label hierarchy is generated, either one of these prediction modes may be more relevant. [sent-17, score-0.254]
13 For example, in the taxonomies of musical signals [2] and genes [5], the leaf nodes have much stronger semantic/biological meanings than the internal nodes, and MLNP is more important. [sent-18, score-0.393]
14 Besides, sometimes the label hierarchy is learned from the data, using methods like hierarchical clustering [6], Bayesian network structure learning [7] and label tree methods [8, 9]. [sent-19, score-0.435]
15 In these cases, the internal nodes are only artificial, and MLNP is again more relevant. [sent-20, score-0.103]
16 In this paper, we focus on hierarchical multilabel classification (HMC), which differs from hierarchical multiclass classification in that the labels of each pattern can fall on a union of paths in the hierarchy [10]. [sent-22, score-0.713]
17 1 While there have been a lot of MLNP methods in hierarchical multiclass classification [4], none of these can be easily extended for the more difficult HMC setting. [sent-25, score-0.149]
18 In hierarchical multiclass classification, exactly one subtree is to be pursued; whereas in HMC, one has to decide at each node how many and which subtrees to pursue. [sent-27, score-0.279]
19 , by adjusting the classification threshold heuristically), it is difficult to ensure that all the prediction paths will end at leaf nodes, and so a lot of partial paths may be resulted. [sent-30, score-0.359]
20 Alternatively, one may perform MLNP by first predicting the number of leaf labels (k) that the test pattern has, and then pick the k leaf labels whose posterior probabilities are the largest. [sent-31, score-0.571]
21 Moreover, the posterior probability computed at each leaf l corresponds to a single prediction path from the root to l. [sent-33, score-0.385]
22 However, the target multilabel in HMC can have multiple paths. [sent-34, score-0.232]
23 Hence, a better approach is to compute the posterior probabilities of all subtrees/subgraphs that have k leaf nodes; and then pick the one with the largest probability. [sent-35, score-0.259]
24 Its main idea is to reduce the hierarchical problem to a non-hierarchical problem by running the (non-hierarchical) multilabel classification method of label-powerset [15] at each level of the hierarchy. [sent-39, score-0.347]
25 Moreover, as it processes the hierarchy level-by-level, this cannot be applied on DAGs, where “levels” are not well-defined. [sent-41, score-0.138]
26 In this paper, we propose an efficient algorithm for MLNP in both tree-structured and DAGstructured hierarchical multilabel classification. [sent-42, score-0.347]
27 The target multilabel is obtained by maximizing the posterior probability among all feasible multilabels. [sent-43, score-0.261]
28 2 Maximum a Posteriori MLNP on Label Trees In this section, we assume that the label hierarchy is a tree T . [sent-49, score-0.257]
29 With a slight abuse of notation, we will also use T to denote the set of all the tree nodes, which are indexed from 0 (for the root), 1, 2, . [sent-50, score-0.056]
30 For a node i, denote its parent by pa(i), and its set of children by child(i). [sent-56, score-0.158]
31 , yN ] ∈ {0, 1}N +1 is the multilabel denoting memberships of x to each of the nodes. [sent-61, score-0.232]
32 For y (or Ω) to respect the tree structure, we require that yi = 1 ⇒ ypa(i) = 1 for any non-root node i ∈ T . [sent-63, score-0.225]
33 , im }, their labels are conditionally independent given the label of their parent pa(i1 ) and x, i. [sent-67, score-0.132]
34 In the experiments, we train a multitask lasso model for each group of sibling nodes, using those training examples that their shared parent is labeled positive. [sent-81, score-0.078]
35 2 Prediction For maximum a posteriori MLNP of a test pattern x, we want to find the multilabel Ω∗ that (i) maximizes the posterior probability in (1); and (ii) respects T . [sent-83, score-0.284]
36 Suppose that it is also known that x has k leaf labels. [sent-84, score-0.23]
37 p(yΩ = 1, yΩc = 0 | x) y0 = 1, k of the leaves in L are labeled 1, Ω contains no partial path, all yi ’s respect the label hierarchy. [sent-87, score-0.128]
38 (2) (3) Note that p(yΩ = 1, yΩc = 0 | x) considers all the node labels in the hierarchy simultaneously. [sent-88, score-0.309]
39 In contrast, as discussed in Section 1, existing MLNP methods in hierarchical multiclass/multilabel classification only considers the hierarchy information locally at each node. [sent-89, score-0.253]
40 For a label tree, problem (2) can be rewritten as maxψ w i ψi (4) i∈T ψi = k, ψ0 = 1, ψi ∈ {0, 1} ∀i ∈ T , s. [sent-93, score-0.063]
41 A multilabel y is k-leaf-sparse if k of the leaf nodes are labeled one. [sent-99, score-0.568]
42 For a pattern x, let its optimal k-leaf-sparse multilabel be Ωk . [sent-101, score-0.232]
43 For example, consider the common approach that trains a binary classifier at each node and recursively predicts from the root to the subtrees. [sent-104, score-0.176]
44 When the classification threshold at each node is high, prediction stops early; whereas when the threshold is lowered, prediction can go further down the hierarchy. [sent-105, score-0.331]
45 Hence, nodes that are labeled positive at a high threshold will always be labeled at a lower threshold, implying NAP. [sent-106, score-0.132]
46 Algorithm 1 shows the proposed algorithm, which will be called MAS (MAndatory leaf node prediction on Structures). [sent-109, score-0.413]
47 However, the definition of a supernode and its updating are different. [sent-111, score-0.238]
48 Each node i ∈ T is associated with the weight wi in (6). [sent-112, score-0.172]
49 For each leaf l in L, we create a supernode, which is a subset in T containing all the nodes on the path from l to the root. [sent-114, score-0.337]
50 Each supernode S has a supernode value (SNV) which is defined as SNV(S) = i∈S wi . [sent-119, score-0.46]
51 3 Algorithm 1 MAS (Mandatory leaf node prediction on structures). [sent-120, score-0.413]
52 1: Initialization: Initialize every node (except the root) with ψi ← 0; Ω ← {0}; Create a supernode from each leaf with its ancestors. [sent-121, score-0.569]
53 S ∗ is then assigned, with the ψi ’s of all its constituent nodes set to 1, and Ω is updated accordingly. [sent-123, score-0.08]
54 For each remaining unassigned supernode S, we update its SNV to be the value that it will take if S is merged with Ω, i. [sent-124, score-0.337]
55 Since each unassigned S contains exactly one leaf and we have a tree structure, this update can be implemented efficiently in O(h2 ) time, where h is the height of the tree (Algorithm 2). [sent-127, score-0.47]
56 Algorithm 2 Updating the SNV of an unassigned tree supernode S, containing the leaf l. [sent-128, score-0.623]
57 1: 2: 3: 4: 5: 6: node ← l; SNV(S) ← SNV(Ω); repeat SNV(S) ← SNV(S) + wnode ; node ← pa(node); until node ∈ Ω. [sent-129, score-0.422]
58 Algorithm 3 Updating the SNV of an unassigned DAG supernode S, containing the leaf l. [sent-130, score-0.567]
59 1: 2: 3: 4: 5: 6: 7: 8: insert l to T ; SNV(S) ← SNV(Ω); repeat node ← find-max(T ); delete node from T ; SNV(S) ← SNV(S) + wnode ; insert nodes in Pa(node)\(Ω ∪ T ) to T ; until T = ∅. [sent-131, score-0.459]
60 Step 3 takes O(|L|) time; steps 4 and 5 take O(h) time; and updating all the remaining unassigned supernodes takes O(h2 |L|) time. [sent-136, score-0.189]
61 3 MLNP that Minimizes Risk While maximizing the posterior probability minimizes the 0-1 loss, another loss function that has been popularly used in hierarchical classification is the H-loss [12]. [sent-154, score-0.172]
62 However, along each prediction path, H-loss only penalizes the first classification mistake closest to the root. [sent-155, score-0.076]
63 On the other hand, we are more interested in the leaf nodes in MLNP. [sent-156, score-0.31]
64 Hence, we will adopt the symmetric loss instead, which is defined as (Ω, ˚ = |Ω\˚ + |˚ Ω) Ω| Ω\Ω|, where ˚ is the true multilabel for the given x, and Ω Ω is the prediction. [sent-157, score-0.26]
65 However, this weights mistakes in any part of the hierarchy equally; whereas in HMC, a mistake that occurs at the higher level of the hierarchy is usually considered more crucial. [sent-158, score-0.299]
66 We thus incorporate the hierarchy structure into (Ω, ˚ by extending it as i ci I(i ∈ Ω\˚ Ω) Ω)+ci I(i ∈ ˚ Ω\Ω), where c0 = 1, ci = cpa(i) /nsibl(i) as in [3], and nsibl(i) is the number of siblings of i (including i itself). [sent-160, score-0.21]
67 Given a loss function (·, ·), from Bayesian decision theory, the optimal multilabel Ω∗ is the one that minimizes the expected loss: Ω∗ = arg minΩ ˚ (Ω, ˚ p(y˚ = 1, y˚c = 0|x). [sent-162, score-0.26]
68 With a label tree and the loss function in (7), the optimal Ω∗ that minimizes the expected loss can be obtained by solving (4), but with wi = (c+ + c− )p(yi = 1|x) − c+ . [sent-167, score-0.217]
69 i i i 3 Maximum a Posteriori MLNP on Label DAGs When the label hierarchy is a DAG G, on using the same conditional independence simplification in Section 2, we have p(y0 , y1 , . [sent-168, score-0.201]
70 , yN |x) = p(y0 |x) p(yi | yPa(i) , x), (8) i∈G\{0} where Pa(i) is the set of parents of node i. [sent-171, score-0.166]
71 The prediction task involves the same optimization problem as in (2). [sent-172, score-0.053]
72 The first one requires that if a node is labeled positive, all its parents must also be positive. [sent-174, score-0.192]
73 In bioinformatics, this is also called the true path rule that governs the DAG-structured GO taxonomy on gene functions. [sent-175, score-0.056]
74 The alternative is that a node can be labeled positive if at least one of its parents is positive. [sent-176, score-0.192]
75 The connection between composite likelihood and various (flat) multilabel classification models is also recently discussed in [21]. [sent-191, score-0.266]
76 With the assumption (9), problem (2) for the label DAG can be rewritten as maxψ wi ψi (10) i∈G ψi = k, ψ0 = 1, ψi ∈ {0, 1} ∀i ∈ G, s. [sent-195, score-0.105]
77 i∈L ψj ≥ 1 ∀i ∈ Lc : ψi = 1, j∈child(i) ψi ≤ ψj ∀j ∈ Pa(i), ∀i ∈ G\{0}, 5 (11) log(1 − pl0 ) (log pij − log(1 − pij )) j∈Pa(i) j∈Pa(i) (log pij − log(1 − pij )) + and pij ≡ p(yi = 1|yj = 1, x) for j ∈ Pa(i). [sent-197, score-0.255]
78 where wi= i = 0, i ∈ L, log(1 − pli ) i ∈ Lc \{0}, l∈child(i) l∈child(0) Problem (10) is similar to problem (4), except in the definition of wi and that the hierarchy constraint (11) is more general than (5). [sent-198, score-0.18]
79 Hence, (10) can be solved efficiently as before, except for two modifications: (i) Each initial supernode now contains a leaf and its ancestors along all paths to the root. [sent-204, score-0.477]
80 (ii) Since Pa(i) is a set and the hierarchy is a DAG, updating the SNV gets more complicated. [sent-205, score-0.167]
81 In Algorithm 3, T is a self-balancing binary search tree (BST) that keeps track of the nodes in S\Ω using their topological order1 . [sent-206, score-0.136]
82 To facilitate the checking of whether a node is in Ω (step 7), Ω also stores its nodes in a self-balancing BST. [sent-207, score-0.21]
83 Recall that for a self-balancing BST, the operations of insert, delete, find-max and finding an element all take O(log V ) time, where V ≤ N is the number of nodes in the BST. [sent-208, score-0.08]
84 Hence, updating the SNV of one supernode by Algorithm 3 takes O(N log N ) time. [sent-209, score-0.238]
85 4 Experiments In this section, experiments are performed on a number of benchmark multilabel data sets2 , with both tree- and DAG-structured label hierarchies (Table 1). [sent-212, score-0.33]
86 As pre-processing, we remove examples that contain partial label paths and nodes with fewer than 10 positive examples. [sent-213, score-0.181]
87 At each parent node, we then train a multitask lasso model with logistic loss using the MALSAR package [22]. [sent-214, score-0.08]
88 These NMLNP methods include (i) HBR, which is modified from the hierarchical classifier H-SVM [3], by replacing its base learner SVM with the multitask lasso as for MAS; (ii) CLUS-HMC [1]; and (iii) flat BR [23], which is a popular MLNP method but does not use the hierarchy information. [sent-218, score-0.277]
89 For performance evaluation, we use the hierarchical F-measure (HF) which has been commonly used in hierarchical classification [4]. [sent-219, score-0.23]
90 Next, we compare the methods using the loss in (7), where the relative importance for false positives vs negatives (α) is set to be the ratio of the numbers of negative and positive training labels. [sent-222, score-0.051]
91 As can be seen, even when MAS/MASR misclassifies the image, the hierarchy often helps to keep the prediction close to the true label. [sent-231, score-0.191]
92 As brute-force search is very expensive, experiments are only performed on four 1 We number the sorted order such that nodes nearer to the root are assigned smaller values. [sent-235, score-0.126]
93 5 Conclusion In this paper, we proposed a novel hierarchical multilabel classification (HMC) algorithm for mandatory leaf node prediction. [sent-435, score-0.804]
94 Unlike many hierarchical multilabel/multiclass classification algorithms, it utilizes the global hierarchy information by finding the multilabel that has the largest posterior probability over all the node labels. [sent-436, score-0.644]
95 Experiments performed on a number of real-world data sets demonstrate that the proposed algorithms are computationally simple and more accurate than existing HMC and flat multilabel classification methods. [sent-439, score-0.232]
96 2 10 1/10 1/5 (a) rcv1subset1 1 5 10 1/10 1/5 (b) enron 1 5 10 (c) struc(funcat) Figure 1: Hierarchically weighted symmetric loss values (7) for different α’s. [sent-466, score-0.065]
97 90 instances satisfying NAP(%) 100 99 98 instances satisfying NAP(%) 100 99 instances satisfying NAP(%) instances satisfying NAP(%) 100 98 97 96 95 94 93 92 91 2 4 6 8 90 10 k 97 96 95 94 93 92 91 2 4 6 8 10 k (b) pheno(GO). [sent-594, score-0.104]
98 A survey of hierarchical classification across different application domains. [sent-623, score-0.115]
99 Fast and balanced: Efficient label tree learning for large scale object recognition. [sent-656, score-0.119]
100 Adapting non-hierarchical multilabel classification methods for hierarchical multilabel classification. [sent-694, score-0.579]
wordName wordTfidf (topN-words)
[('funcat', 0.451), ('snv', 0.322), ('mlnp', 0.306), ('mas', 0.242), ('nap', 0.242), ('multilabel', 0.232), ('leaf', 0.23), ('supernode', 0.209), ('masr', 0.161), ('hmc', 0.141), ('hierarchy', 0.138), ('node', 0.13), ('unassigned', 0.128), ('ypa', 0.128), ('hierarchical', 0.115), ('hbr', 0.097), ('mandatory', 0.097), ('go', 0.095), ('pa', 0.093), ('pheno', 0.081), ('nodes', 0.08), ('dag', 0.076), ('eisen', 0.071), ('classi', 0.071), ('metalabeler', 0.064), ('struc', 0.064), ('label', 0.063), ('br', 0.061), ('tree', 0.056), ('prediction', 0.053), ('pij', 0.051), ('child', 0.05), ('dags', 0.049), ('cellcycle', 0.048), ('derisi', 0.048), ('expr', 0.048), ('hom', 0.048), ('nmlnp', 0.048), ('spo', 0.048), ('proposition', 0.047), ('root', 0.046), ('seq', 0.043), ('wi', 0.042), ('labels', 0.041), ('water', 0.04), ('animate', 0.039), ('nested', 0.039), ('yi', 0.039), ('paths', 0.038), ('enron', 0.037), ('church', 0.037), ('parents', 0.036), ('gentile', 0.035), ('cation', 0.035), ('hierarchies', 0.035), ('multiclass', 0.034), ('composite', 0.034), ('musical', 0.034), ('insert', 0.032), ('butterfly', 0.032), ('cerri', 0.032), ('delicious', 0.032), ('malsar', 0.032), ('supernodes', 0.032), ('wipo', 0.032), ('wnode', 0.032), ('lc', 0.031), ('trees', 0.031), ('kong', 0.029), ('updating', 0.029), ('gene', 0.029), ('posterior', 0.029), ('inanimate', 0.028), ('unselected', 0.028), ('bst', 0.028), ('loss', 0.028), ('parent', 0.028), ('hierarchically', 0.028), ('hong', 0.028), ('yn', 0.027), ('path', 0.027), ('insect', 0.026), ('siblings', 0.026), ('taxonomies', 0.026), ('rajan', 0.026), ('satisfying', 0.026), ('labeled', 0.026), ('greedy', 0.025), ('misclassi', 0.025), ('multitask', 0.024), ('mistake', 0.023), ('delete', 0.023), ('tsoumakas', 0.023), ('negatives', 0.023), ('posteriori', 0.023), ('mining', 0.023), ('ci', 0.023), ('bioinformatics', 0.023), ('internal', 0.023), ('hf', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
Author: Wei Bi, James T. Kwok
Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1
2 0.14115936 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
3 0.13255304 215 nips-2012-Minimizing Uncertainty in Pipelines
Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi
Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1
4 0.12968615 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil
Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1
5 0.085068434 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
Author: Sung J. Hwang, Kristen Grauman, Fei Sha
Abstract: When learning features for complex visual recognition problems, labeled image exemplars alone can be insufficient. While an object taxonomy specifying the categories’ semantic relationships could bolster the learning process, not all relationships are relevant to a given visual classification task, nor does a single taxonomy capture all ties that are relevant. In light of these issues, we propose a discriminative feature learning approach that leverages multiple hierarchical taxonomies representing different semantic views of the object categories (e.g., for animal classes, one taxonomy could reflect their phylogenic ties, while another could reflect their habitats). For each taxonomy, we first learn a tree of semantic kernels, where each node has a Mahalanobis kernel optimized to distinguish between the classes in its children nodes. Then, using the resulting semantic kernel forest, we learn class-specific kernel combinations to select only those relationships relevant to recognize each object class. To learn the weights, we introduce a novel hierarchical regularization term that further exploits the taxonomies’ structure. We demonstrate our method on challenging object recognition datasets, and show that interleaving multiple taxonomic views yields significant accuracy improvements.
6 0.083560847 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
7 0.079959743 260 nips-2012-Online Sum-Product Computation Over Trees
8 0.077669129 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables
9 0.075563498 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
10 0.070164219 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
11 0.067843907 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
12 0.062366031 200 nips-2012-Local Supervised Learning through Space Partitioning
13 0.057150751 298 nips-2012-Scalable Inference of Overlapping Communities
14 0.056311455 180 nips-2012-Learning Mixtures of Tree Graphical Models
15 0.054091141 182 nips-2012-Learning Networks of Heterogeneous Influence
16 0.052253645 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
17 0.051340517 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
18 0.04843403 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
19 0.04811275 100 nips-2012-Discriminative Learning of Sum-Product Networks
20 0.046487842 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
topicId topicWeight
[(0, 0.125), (1, 0.019), (2, -0.022), (3, -0.026), (4, -0.027), (5, -0.073), (6, 0.005), (7, 0.005), (8, -0.159), (9, 0.059), (10, -0.037), (11, 0.048), (12, 0.054), (13, 0.032), (14, -0.046), (15, -0.001), (16, -0.056), (17, 0.028), (18, 0.023), (19, 0.012), (20, -0.11), (21, 0.156), (22, 0.071), (23, 0.026), (24, 0.01), (25, -0.047), (26, 0.003), (27, -0.013), (28, -0.053), (29, -0.025), (30, -0.046), (31, -0.08), (32, 0.024), (33, -0.055), (34, -0.057), (35, 0.008), (36, 0.024), (37, -0.041), (38, -0.017), (39, 0.005), (40, 0.101), (41, 0.002), (42, 0.029), (43, 0.012), (44, 0.009), (45, 0.035), (46, -0.026), (47, 0.051), (48, 0.013), (49, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.92571759 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
Author: Wei Bi, James T. Kwok
Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1
2 0.80301857 215 nips-2012-Minimizing Uncertainty in Pipelines
Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi
Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1
3 0.70422089 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
Author: Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescu-mizil
Abstract: A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for large scale problems. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivariate logistic regression. Specifically, the parent-child relationships are modeled by placing a hierarchical prior over the children nodes centered around the parameters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present variational algorithms for tractable posterior inference in these models, and provide a parallel implementation that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple large-scale benchmark datasets that highlights the scalability of our approach and shows improved performance over the other state-of-the-art hierarchical methods. 1
4 0.6485216 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1
5 0.64315832 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof
Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t: if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1) P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9
6 0.62854213 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering
7 0.58737254 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
8 0.57540846 182 nips-2012-Learning Networks of Heterogeneous Influence
9 0.55374551 260 nips-2012-Online Sum-Product Computation Over Trees
10 0.53528845 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
11 0.51390636 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
12 0.51002884 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
13 0.50610715 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables
14 0.50166303 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees
15 0.48511568 180 nips-2012-Learning Mixtures of Tree Graphical Models
16 0.48067605 100 nips-2012-Discriminative Learning of Sum-Product Networks
17 0.47909683 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
18 0.47472969 298 nips-2012-Scalable Inference of Overlapping Communities
19 0.4559716 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
20 0.45365378 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
topicId topicWeight
[(0, 0.071), (38, 0.095), (42, 0.016), (54, 0.017), (55, 0.03), (74, 0.047), (76, 0.096), (80, 0.125), (91, 0.342), (92, 0.042)]
simIndex simValue paperId paperTitle
1 0.72872001 46 nips-2012-Assessing Blinding in Clinical Trials
Author: Ognjen Arandjelovic
Abstract: The interaction between the patient’s expected outcome of an intervention and the inherent effects of that intervention can have extraordinary effects. Thus in clinical trials an effort is made to conceal the nature of the administered intervention from the participants in the trial i.e. to blind it. Yet, in practice perfect blinding is impossible to ensure or even verify. The current standard is follow up the trial with an auxiliary questionnaire, which allows trial participants to express their belief concerning the assigned intervention and which is used to compute a measure of the extent of blinding in the trial. If the estimated extent of blinding exceeds a threshold the trial is deemed sufficiently blinded; otherwise, the trial is deemed to have failed. In this paper we make several important contributions. Firstly, we identify a series of fundamental problems of the aforesaid practice and discuss them in context of the most commonly used blinding measures. Secondly, motivated by the highlighted problems, we formulate a novel method for handling imperfectly blinded trials. We too adopt a post-trial feedback questionnaire but interpret the collected data using an original approach, fundamentally different from those previously proposed. Unlike previous approaches, ours is void of any ad hoc free parameters, is robust to small changes in auxiliary data and is not predicated on any strong assumptions used to interpret participants’ feedback. 1
same-paper 2 0.71034646 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
Author: Wei Bi, James T. Kwok
Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1
3 0.58213174 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer
Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1
4 0.50173295 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
Author: Anima Anandkumar, Ragupathyraj Valluvan
Abstract: Graphical model selection refers to the problem of estimating the unknown graph structure given observations at the nodes in the model. We consider a challenging instance of this problem when some of the nodes are latent or hidden. We characterize conditions for tractable graph estimation and develop efficient methods with provable guarantees. We consider the class of Ising models Markov on locally tree-like graphs, which are in the regime of correlation decay. We propose an efficient method for graph estimation, and establish its structural consistency −δη(η+1)−2 when the number of samples n scales as n = Ω(θmin log p), where θmin is the minimum edge potential, δ is the depth (i.e., distance from a hidden node to the nearest observed nodes), and η is a parameter which depends on the minimum and maximum node and edge potentials in the Ising model. The proposed method is practical to implement and provides flexibility to control the number of latent variables and the cycle lengths in the output graph. We also present necessary conditions for graph estimation by any method and show that our method nearly matches the lower bound on sample requirements. Keywords: Graphical model selection, latent variables, quartet methods, locally tree-like graphs. 1
5 0.49805629 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
6 0.49749047 200 nips-2012-Local Supervised Learning through Space Partitioning
7 0.49462837 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
8 0.49354851 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
9 0.491974 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
10 0.49108446 65 nips-2012-Cardinality Restricted Boltzmann Machines
11 0.49056348 251 nips-2012-On Lifting the Gibbs Sampling Algorithm
12 0.48987511 168 nips-2012-Kernel Latent SVM for Visual Recognition
13 0.48886213 100 nips-2012-Discriminative Learning of Sum-Product Networks
14 0.48784727 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models
15 0.48782232 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
16 0.4873578 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
17 0.48697534 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
18 0.48637095 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
19 0.48506001 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
20 0.48450842 192 nips-2012-Learning the Dependency Structure of Latent Factors