nips nips2010 nips2010-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. [sent-3, score-0.224]
2 We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. [sent-5, score-1.208]
3 We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. [sent-6, score-0.605]
4 Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. [sent-7, score-1.332]
5 Current multi-class applications such as web advertising [6], textual document categorization [11] or image annotation [12] have tens or hundreds of thousands of classes, and these datasets are still growing. [sent-9, score-0.288]
6 Our approach rests on two main ideas: firstly, an algorithm for learning a label tree: each node makes a prediction of the subset of labels to be considered by its children, thus decreasing the number of labels k at a logarithmic rate until a prediction is reached. [sent-21, score-0.788]
7 We provide a novel algorithm that both learns the sets of labels at each node, and the predictors at the nodes to optimize the overall tree loss, and show that this approach is superior to existing tree-based approaches [7, 6] which typically lose accuracy compared to O(kd) approaches. [sent-22, score-0.86]
8 Balanced label trees have O(d log k) complexity as the predictor at each node is still 1 Algorithm 1 Label Tree Prediction Algorithm Input: test example x, parameters T . [sent-23, score-0.686]
9 Our second main idea is to learn an embedding of the labels into a space of dimension de that again still optimizes the overall tree loss. [sent-30, score-1.079]
10 Hence, we are required at test time to: (1) map the test example in the label embedding space with cost O(dde ) and then (2) predict using the label tree resulting in our overall cost O(de (log k + d)). [sent-31, score-1.8]
11 We also show that our label embedding approach outperforms other recently proposed label embedding approaches such as compressed sensing [17]. [sent-32, score-1.674]
12 Label trees are discussed and label tree learning algorithms are proposed in Section 2. [sent-34, score-1.054]
13 2 Label Trees A label tree is a tree T = (N, E, F, L) with n + 1 indexed nodes N = {0, . [sent-39, score-1.51]
14 n}, a set of edges E = {(p1 , c1 ), (p|E| , c|E| )} which are ordered pairs of parent and child node indices, label predictors F = {f1 , . [sent-42, score-0.629]
15 The label sets indicate the set of labels to which a point should belong if it arrives at the given node, and progress from generic to specific along the tree, i. [sent-51, score-0.498]
16 the root label set contains all classes | 0 | = k and each child label set is a subset of its parent label set with p = (p,c)∈E c . [sent-53, score-1.355]
17 We differentiate between disjoint label trees where there are only k leaf nodes, one per class, and hence any two nodes i and j at the same depth cannot share any labels, i ∩ j = ∅, and joint label trees that can have more than k leaf nodes. [sent-54, score-1.275]
18 Classifying an example with the label tree is achieved by applying Algorithm 1. [sent-55, score-0.928]
19 Prediction begins at the root node (s = 0) and for each edge leading to a child (s, c) ∈ E one computes the score of the label predictor fc (x) which predicts whether the example x belongs to the set of labels c . [sent-56, score-0.753]
20 Instances of label trees have been used in the literature before with various methods for choosing the parameters (N, E, F, L). [sent-59, score-0.511]
21 Our goal is to provide an algorithm to learn these parameters to optimize the overall empirical loss (called the tree loss) as accurately as possible for a given tree size (speed). [sent-61, score-1.241]
22 We can define the tree loss we wish to minimize as: R(ftree ) = I(ftree (x) = y)dP (x, y) = max i∈B(x)={b1 (x),. [sent-62, score-0.654]
23 bD(x) (x)} I(y ∈ / i )dP (x, y) (1) where I is the indicator function and bj (x) = argmax{c : (bj−1 (x),c)∈E} fc (x) is the index of the winning (“best”) node at depth j, b0 (x) = 0, and D(x) is the depth in the tree of the final prediction for x, i. [sent-65, score-0.929]
24 The tree loss measures an intermediate loss of 1 for each prediction at each depth j of the label tree where the true label is not in the label set bj (x) . [sent-68, score-2.543]
25 Hence, any algorithm wishing to optimize the overall tree loss should train all the nodes jointly with respect to this maximum. [sent-70, score-0.734]
26 We will now describe how we propose to learn the parameters T of our label tree. [sent-71, score-0.417]
27 In the next subsection we show how to minimize the tree loss for a given fixed tree (N, E and L are fixed, F is to be learned). [sent-72, score-1.197]
28 1 Learning with a Fixed Label Tree Let us suppose we are given a fixed label tree N, E, L chosen in advance. [sent-75, score-0.928]
29 Our goal is simply to minimize the tree loss (1) over the variables F , given training data {(xi , yi )}i=1,. [sent-76, score-0.71]
30 The number of errors counted by the approximation cannot be less than the empirical tree loss Remp as when, for a particular example, the loss is zero for the approximation it is also zero for Remp . [sent-83, score-0.701]
31 Relaxation 2: Tree Loss Optimization (Joint convex problem) We propose a tighter minimization of the tree loss with the following: m 1 ξα m i=1 i s. [sent-94, score-0.658]
32 When α is close to zero, the shared slack variables simply count a single error if any of the predictions at any depth of the tree are incorrect, so this is very close to the true optimization of the tree loss. [sent-100, score-1.156]
33 This is measured by checking, out of all the nodes that share the same parent, if the one containing the true label in its label set is highest ranked. [sent-101, score-0.809]
34 2 Learning Label Tree Structures The previous section shows how to optimize the label predictors F while the nodes N , edges E and label sets L which specify the structure of the tree are fixed in advance. [sent-106, score-1.395]
35 However, we want to be able to learn specific tree structures dependent on our prediction problem such that we minimize the 3 Algorithm 2 Learning the Label Tree Structure ¯ ¯ Train k One-vs-Rest classifiers f1 , . [sent-107, score-0.65]
36 Learn the parameters f of the tree by minimizing (4) subject to constraints (2) and (3). [sent-118, score-0.572]
37 The key to the generalization ability of a particular choice of tree structure is the learnability of the label sets . [sent-123, score-0.928]
38 If some classes are often confused but are in different label sets the functions f may not be easily learnable, and the overall tree loss will hence be poor. [sent-124, score-1.119]
39 For example for an image labeling task, a decision in the tree between two label sets, one containing tiger and jaguar labels versus one containing frog and toad labels is presumably more learnable than (tiger, frog) vs. [sent-125, score-1.315]
40 In the following, we consider a learning strategy for disjoint label trees (the methods in the previous section were for both joint and disjoint trees). [sent-127, score-0.579]
41 We begin by noticing that Remp can be rewritten as: m 1 max I(yi ∈ j ) C(xi , y ) ¯ Remp (ftree ) = m i=1 j y∈ ¯/ j where C(xi , y ) = I(ftree (xi ) = y ) is the confusion of labeling example xi (with true label yi ) with ¯ ¯ label y instead. [sent-128, score-1.015]
42 That is, the tree loss for a given example is 1 if there is a node j in the tree containing ¯ yi , but we predict a different node at the same depth leading to a prediction not in the label set of j. [sent-129, score-1.945]
43 Intuitively, the confusion of predicting node i instead of j comes about because of the class confusion between the labels y ∈ i and the labels y ∈ j . [sent-130, score-0.637]
44 Hence, to provide the smallest tree loss we ¯ want to group together labels into the same label set that are likely to be confused at test time. [sent-131, score-1.126]
45 Unfortunately we do not know the confusion matrix of a particular tree without training it first, but as a proxy we can use the class confusion matrix of a surrogate classifier with the supposition that the matrices will be highly correlated. [sent-132, score-0.897]
46 The main idea is to recursively partition the labels into label sets between which there is little confusion (measuring confusion using One-vs-Rest as a surrogate classifier) solving at each step a graph cut problem where standard spectral clustering is applied [20, 21]. [sent-134, score-0.815]
47 (To obtain logarithmic speedups the tree has to be balanced; one could also enforce this constraint directly in the k-means step. [sent-136, score-0.577]
48 ) The results in Section 5 show that our learnt trees outperform random structures and in fact match the accuracy of not using a tree at all, while being orders of magnitude faster. [sent-137, score-0.839]
49 3 Label Embeddings An orthogonal angle of attack of the solution of large multi-class problems is to employ shared representations for the labelings, which we term label embeddings. [sent-138, score-0.385]
50 , 0) which is a k-dimensional vector with a 1 in the y th position and 0 otherwise, we would like to find a linear embedding E(y) = V φ(y) where V is a de × k matrix assuming that labels y ∈ {1, . [sent-145, score-0.46]
51 Without a tree structure, multi-class classification is then achieved with: fembed (x) = argmaxi=1,. [sent-149, score-0.543]
52 This method, unlike label trees, is unfortunately still linear with respect to k. [sent-155, score-0.385]
53 If the embedding dimension de is much smaller than d this gives a significant saving. [sent-157, score-0.375]
54 For example, the method of compressed sensing [17] has a similar form to (5), but the matrix V is not learnt but chosen randomly, and only W is learnt. [sent-159, score-0.231]
55 In the next section we will show how we can train such models so that the matrix V captures the semantic similarity between classes, which can improve generalization performance over random choices of V in an analogous way to the improvement of label trees over random trees. [sent-160, score-0.54]
56 Subsequently, we will show how to combine label embeddings with label trees to gain the advantages of both approaches. [sent-161, score-1.004]
57 Sequence of Convex Problems Firstly, we consider learning the label embedding by solving a sequence of convex problems using the following method. [sent-164, score-0.796]
58 Then, find the label embedding vectors Vi that minimize: k Aij ||Vi − Vj ||2 , where A = i,j=1 1 ¯ ¯ (C + C ) is the symmetrized confusion matrix, 2 subject to the constraint V DV = I where Dii = j Aij (to prevent trivial solutions) which is the same problem solved by Laplacian Eigenmaps [4]. [sent-171, score-0.985]
59 We then obtain an embedding matrix V where similar classes i and j should have small distance between their vectors Vi and Vj . [sent-172, score-0.443]
60 To do this, we can then train a convex multi-class classifier utilizing the label embedding V : minimize γ||W ||F RO + 1 m m ξi i=1 where ||. [sent-174, score-0.857]
61 2 Learning Label Embedding Trees In this work, we also propose to combine the use of embeddings and label trees to obtain the advantages of both approaches, which we call the label embedding tree. [sent-189, score-1.379]
62 At test time, the resulting label embedding tree prediction is given in Algorithm 3. [sent-190, score-1.38]
63 The label embedding tree has potentially O(de (d + log(k))) testing speed, depending on the structure of the tree (e. [sent-191, score-1.92]
64 - Start at the root node repeat - Traverse to the most Let s = argmax{c:(s,c)∈E} fc (x) = argmax{c:(s,c)∈E} z E(c). [sent-197, score-0.216]
65 To learn a label embedding tree we propose the following minimization problem: γ||W ||F RO + 1 m m ξi i=1 subject to constraints: (W xi ) V φ(r) ≥ (W xi ) V φ(s) − ξi , ∀r, s : yi ∈ r ∧ yi ∈ / s ∧ (∃p : (p, r) ∈ E ∧ (p, s) ∈ E) ||Vi || ≤ 1, ξi ≥ 0, i = 1, . [sent-201, score-1.548]
66 Learning the tree structure for these models can still be achieved using Algorithm 2. [sent-206, score-0.543]
67 Common multi-class strategies include one-versus-rest, one-versus-one, label ranking and Decision Directed Acyclic Graph (DDAG). [sent-209, score-0.385]
68 5 [24], can also yield a tree whose depth (and hence test cost) is logarithmic in k. [sent-221, score-0.681]
69 Filter tree [7] and Conditional Probability Tree (CPT) [6] are logarithmic approaches that have been introduced recently with motivations similar to ours, i. [sent-223, score-0.605]
70 Filter tree considers a random binary tree in which each leaf is associated with a class and each node is associated with a binary classifier. [sent-226, score-1.302]
71 At each node, the node classifier decides whether the example is directed to the right or to the left subtree, each of which are associated to half of the labels of the parent node. [sent-228, score-0.247]
72 Finally, the label of the reached leaf is predicted. [sent-229, score-0.44]
73 Conditional Probability Tree (CPT) relies on a similar paradigm but builds the tree during training. [sent-230, score-0.543]
74 Hence, CPT builds the tree greedily: when a new class is encountered, it is added by splitting an existing leaf. [sent-232, score-0.591]
75 In our case, we consider that the set of classes are available prior to training and propose to tessellate the class label sets such that the node classifiers are likely to achieve high generalization performance. [sent-233, score-0.614]
76 6 Finally, we should mention that a related active area of research involves partitioning the feature space rather than the label space, e. [sent-235, score-0.385]
77 Label embedding is another key aspect of our work when it comes to efficiently handling thousands of classes. [sent-238, score-0.375]
78 Compressed sensing based approaches [17] do propose to embed class labels, but rely on a random projection for embedding the vector representing class memberships, with the added advantages of handling problems for which multiple classes are active for a given example. [sent-242, score-0.667]
79 However, relying on a random projection does not allow for the class embedding to capture the relation between classes. [sent-243, score-0.423]
80 Finally, the authors of [2] do propose an embedding approach over class labels, but it is not clear to us if their approach is scalable to our setting. [sent-245, score-0.423]
81 5 Experimental Study We consider three datasets: one publicly available image annotation dataset and two proprietary datasets based on images and textual descriptions of products. [sent-246, score-0.304]
82 We consider two tasks: predicting the label given the textual description, and predicting the label given the image. [sent-258, score-0.84]
83 Flat versus Tree Learning Approaches In Table 2 we compare label tree predictor training methods from Section 2. [sent-263, score-0.956]
84 In all cases we considered disjoint trees of depth 2 with 200 internal nodes. [sent-266, score-0.23]
85 The results show that learnt structure performs better than random structure and tree loss optimization is superior to independent optimization. [sent-267, score-0.769]
86 The combination of Learnt Label Tree structure and Tree Loss Optimization for the label predictors is the only method that is comparable to or better than One-vs-Rest while being around 60× faster to compute at test time. [sent-270, score-0.462]
87 For ImageNet one could wonder how well using WordNet (a graph of human annotated label similarities) to build a tree would perform instead. [sent-271, score-0.928]
88 We constructed a matrix C for Algorithm 2 where Cij = 1 if there is an edge in the WordNet graph, and 0 otherwise, and used that to learn a label tree as before, obtaining 0. [sent-272, score-0.96]
89 This is better than a random tree but not as good as using the confusion matrix, implying that the best tree to use is the one adapted to the supervised task of interest. [sent-274, score-1.239]
90 0% Table 2: Flat versus Tree Learning Results Test set accuracies for various tree and non-tree methods on three datasets. [sent-279, score-0.543]
91 3% 142× 20 MB Embedding and Embedding Tree Approaches In Table 3 we compare several label embedding methods: (i) the convex and non-convex methods from Section 5; (ii) compressed sensing; and (iii) the label embedding tree from Section 3. [sent-313, score-2.164]
92 In all cases we fixed the embedding dimension de = 100. [sent-315, score-0.375]
93 The results show that the random embeddings given by compressed sensing are inferior to learnt embeddings and Non-Convex Embedding is superior to Sequential Convex Embedding, presumably as the overall loss which is dependent on both W and V is jointly optimized. [sent-316, score-0.612]
94 Note, we do not detail results on the product descriptions task because no speed-up is gained there from embedding as the sparsity is already so high, however the methods still gave good test accuracy (e. [sent-318, score-0.53]
95 Finally, combining embedding and label tree learning using the “Label Embedding Tree” of Section 3. [sent-322, score-1.303]
96 Moreover, memory usage of this method (and other embedding methods) is significantly less than One-vs-Rest. [sent-324, score-0.414]
97 6 Conclusion We have introduced an approach for fast multi-class classification by learning label embedding trees by (approximately) optimizing the overall tree loss. [sent-325, score-1.473]
98 Our approach obtained orders of magnitude speedup compared to One-vs-Rest while yielding as good or better accuracy, and outperformed other tree-based or embedding approaches. [sent-326, score-0.404]
99 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-355, score-0.443]
100 The effects of training set size on decision tree complexity. [sent-456, score-0.543]
wordName wordTfidf (topN-words)
[('tree', 0.543), ('label', 0.385), ('embedding', 0.375), ('confusion', 0.153), ('mb', 0.138), ('cpt', 0.132), ('trees', 0.126), ('imagenet', 0.117), ('node', 0.113), ('embeddings', 0.108), ('ftree', 0.107), ('learnt', 0.105), ('labels', 0.085), ('remp', 0.081), ('loss', 0.079), ('testing', 0.074), ('classi', 0.072), ('textual', 0.07), ('kd', 0.07), ('depth', 0.07), ('wordnet', 0.069), ('classes', 0.068), ('filter', 0.066), ('compressed', 0.065), ('ro', 0.061), ('sensing', 0.061), ('fc', 0.059), ('yi', 0.056), ('leaf', 0.055), ('nl', 0.054), ('images', 0.053), ('grangier', 0.052), ('thousand', 0.051), ('ers', 0.05), ('dent', 0.05), ('parent', 0.049), ('class', 0.048), ('document', 0.047), ('multiclass', 0.046), ('argmax', 0.046), ('langford', 0.046), ('documents', 0.045), ('er', 0.045), ('relaxation', 0.045), ('product', 0.044), ('root', 0.044), ('overall', 0.044), ('prediction', 0.043), ('annotation', 0.043), ('predictors', 0.043), ('ddag', 0.043), ('jaguar', 0.043), ('symmetrized', 0.043), ('toad', 0.043), ('taxonomy', 0.042), ('balanced', 0.042), ('superior', 0.042), ('descriptions', 0.041), ('spectral', 0.039), ('embed', 0.039), ('nodes', 0.039), ('child', 0.039), ('none', 0.039), ('memory', 0.039), ('traverses', 0.038), ('flat', 0.038), ('accuracy', 0.036), ('convex', 0.036), ('image', 0.036), ('xi', 0.036), ('speed', 0.035), ('vi', 0.035), ('samy', 0.035), ('frog', 0.035), ('logarithmic', 0.034), ('disjoint', 0.034), ('test', 0.034), ('advertising', 0.032), ('learnable', 0.032), ('traverse', 0.032), ('validation', 0.032), ('children', 0.032), ('minimize', 0.032), ('learn', 0.032), ('bj', 0.031), ('cj', 0.031), ('proprietary', 0.031), ('categorization', 0.03), ('datasets', 0.03), ('dekel', 0.029), ('eigenmaps', 0.029), ('orders', 0.029), ('subject', 0.029), ('train', 0.029), ('grows', 0.029), ('beygelzimer', 0.028), ('arrives', 0.028), ('tiger', 0.028), ('predictor', 0.028), ('approaches', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
2 0.20416433 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
3 0.20294636 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
4 0.19803558 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
5 0.17261559 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
Author: Umar Syed, Ben Taskar
Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1
6 0.16050199 144 nips-2010-Learning Efficient Markov Networks
7 0.13982476 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
8 0.13341242 177 nips-2010-Multitask Learning without Label Correspondences
9 0.125545 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
10 0.12305571 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
11 0.12178835 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
12 0.11881509 223 nips-2010-Rates of convergence for the cluster tree
13 0.11521532 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
14 0.10729864 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
15 0.10448495 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
16 0.10265755 280 nips-2010-Unsupervised Kernel Dimension Reduction
17 0.10265654 228 nips-2010-Reverse Multi-Label Learning
18 0.10236255 27 nips-2010-Agnostic Active Learning Without Constraints
19 0.10062923 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
20 0.098992713 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
topicId topicWeight
[(0, 0.272), (1, 0.103), (2, 0.081), (3, -0.131), (4, -0.106), (5, 0.01), (6, -0.062), (7, -0.057), (8, 0.041), (9, -0.039), (10, -0.001), (11, 0.077), (12, -0.051), (13, 0.091), (14, 0.093), (15, -0.156), (16, -0.055), (17, -0.109), (18, -0.144), (19, -0.184), (20, -0.067), (21, -0.092), (22, 0.106), (23, -0.288), (24, -0.002), (25, 0.106), (26, 0.024), (27, 0.169), (28, -0.057), (29, -0.095), (30, 0.116), (31, 0.023), (32, -0.054), (33, 0.059), (34, 0.066), (35, -0.008), (36, 0.068), (37, -0.067), (38, -0.021), (39, 0.125), (40, 0.104), (41, 0.017), (42, -0.034), (43, -0.09), (44, -0.004), (45, -0.055), (46, -0.027), (47, 0.031), (48, 0.052), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.97579086 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
2 0.68707532 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
3 0.64396644 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
4 0.62340331 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
Author: Umar Syed, Ben Taskar
Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1
5 0.61760575 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
6 0.59317315 94 nips-2010-Feature Set Embedding for Incomplete Data
7 0.52611524 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
8 0.50419772 177 nips-2010-Multitask Learning without Label Correspondences
9 0.49405286 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
10 0.49012494 228 nips-2010-Reverse Multi-Label Learning
11 0.48434907 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
12 0.48308203 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
13 0.47678837 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
14 0.4756414 1 nips-2010-(RF)^2 -- Random Forest Random Field
15 0.45274663 108 nips-2010-Graph-Valued Regression
16 0.44419345 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
17 0.42566907 290 nips-2010-t-logistic regression
18 0.42543653 151 nips-2010-Learning from Candidate Labeling Sets
19 0.41124758 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
20 0.40201241 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
topicId topicWeight
[(13, 0.077), (17, 0.016), (27, 0.063), (30, 0.107), (35, 0.036), (45, 0.281), (48, 0.1), (50, 0.039), (52, 0.033), (60, 0.048), (77, 0.032), (78, 0.033), (90, 0.059)]
simIndex simValue paperId paperTitle
1 0.95166278 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
same-paper 2 0.95039952 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
3 0.93984491 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
4 0.93779749 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
Author: Gowtham Bellala, Suresh Bhavnani, Clayton Scott
Abstract: Generalized Binary Search (GBS) is a well known greedy algorithm for identifying an unknown object while minimizing the number of “yes” or “no” questions posed about that object, and arises in problems such as active learning and active diagnosis. Here, we provide a coding-theoretic interpretation for GBS and show that GBS can be viewed as a top-down algorithm that greedily minimizes the expected number of queries required to identify an object. This interpretation is then used to extend GBS in two ways. First, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. Then, we consider the case where the cost of identifying an object grows exponentially in the number of queries. In each case, we present an exact formula for the objective function involving Shannon or R´ nyi entropy, and e develop a greedy algorithm for minimizing it. 1
5 0.93481588 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
Author: Umar Syed, Ben Taskar
Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1
6 0.93480635 290 nips-2010-t-logistic regression
7 0.93476814 63 nips-2010-Distributed Dual Averaging In Networks
8 0.93398786 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
9 0.93388534 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
10 0.93378812 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
11 0.93342346 27 nips-2010-Agnostic Active Learning Without Constraints
12 0.932643 222 nips-2010-Random Walk Approach to Regret Minimization
13 0.93176538 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
14 0.93158251 280 nips-2010-Unsupervised Kernel Dimension Reduction
15 0.93153667 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
16 0.93114585 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
17 0.93102324 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
18 0.9305588 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
19 0.93012559 265 nips-2010-The LASSO risk: asymptotic results and real world examples
20 0.92935532 196 nips-2010-Online Markov Decision Processes under Bandit Feedback