nips nips2010 nips2010-135 knowledge-graph by maker-knowledge-mining

135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Source: pdf

Author: Samy Bengio, Jason Weston, David Grangier

Abstract: Multi-class classiﬁcation becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classiﬁers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Multi-class classiﬁcation becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. [sent-3, score-0.224]

2 We propose an algorithm for learning a treestructure of classiﬁers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. [sent-5, score-1.208]

3 We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. [sent-6, score-0.605]

4 Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. [sent-7, score-1.332]

5 Current multi-class applications such as web advertising [6], textual document categorization [11] or image annotation [12] have tens or hundreds of thousands of classes, and these datasets are still growing. [sent-9, score-0.288]

6 Our approach rests on two main ideas: ﬁrstly, an algorithm for learning a label tree: each node makes a prediction of the subset of labels to be considered by its children, thus decreasing the number of labels k at a logarithmic rate until a prediction is reached. [sent-21, score-0.788]

7 We provide a novel algorithm that both learns the sets of labels at each node, and the predictors at the nodes to optimize the overall tree loss, and show that this approach is superior to existing tree-based approaches [7, 6] which typically lose accuracy compared to O(kd) approaches. [sent-22, score-0.86]

8 Balanced label trees have O(d log k) complexity as the predictor at each node is still 1 Algorithm 1 Label Tree Prediction Algorithm Input: test example x, parameters T . [sent-23, score-0.686]

9 Our second main idea is to learn an embedding of the labels into a space of dimension de that again still optimizes the overall tree loss. [sent-30, score-1.079]

10 Hence, we are required at test time to: (1) map the test example in the label embedding space with cost O(dde ) and then (2) predict using the label tree resulting in our overall cost O(de (log k + d)). [sent-31, score-1.8]

11 We also show that our label embedding approach outperforms other recently proposed label embedding approaches such as compressed sensing [17]. [sent-32, score-1.674]

12 Label trees are discussed and label tree learning algorithms are proposed in Section 2. [sent-34, score-1.054]

13 2 Label Trees A label tree is a tree T = (N, E, F, L) with n + 1 indexed nodes N = {0, . [sent-39, score-1.51]

14 n}, a set of edges E = {(p1 , c1 ), (p|E| , c|E| )} which are ordered pairs of parent and child node indices, label predictors F = {f1 , . [sent-42, score-0.629]

15 The label sets indicate the set of labels to which a point should belong if it arrives at the given node, and progress from generic to speciﬁc along the tree, i. [sent-51, score-0.498]

16 the root label set contains all classes | 0 | = k and each child label set is a subset of its parent label set with p = (p,c)∈E c . [sent-53, score-1.355]

17 We differentiate between disjoint label trees where there are only k leaf nodes, one per class, and hence any two nodes i and j at the same depth cannot share any labels, i ∩ j = ∅, and joint label trees that can have more than k leaf nodes. [sent-54, score-1.275]

18 Classifying an example with the label tree is achieved by applying Algorithm 1. [sent-55, score-0.928]

19 Prediction begins at the root node (s = 0) and for each edge leading to a child (s, c) ∈ E one computes the score of the label predictor fc (x) which predicts whether the example x belongs to the set of labels c . [sent-56, score-0.753]

20 Instances of label trees have been used in the literature before with various methods for choosing the parameters (N, E, F, L). [sent-59, score-0.511]

21 Our goal is to provide an algorithm to learn these parameters to optimize the overall empirical loss (called the tree loss) as accurately as possible for a given tree size (speed). [sent-61, score-1.241]

22 We can deﬁne the tree loss we wish to minimize as: R(ftree ) = I(ftree (x) = y)dP (x, y) = max i∈B(x)={b1 (x),. [sent-62, score-0.654]

23 bD(x) (x)} I(y ∈ / i )dP (x, y) (1) where I is the indicator function and bj (x) = argmax{c : (bj−1 (x),c)∈E} fc (x) is the index of the winning (“best”) node at depth j, b0 (x) = 0, and D(x) is the depth in the tree of the ﬁnal prediction for x, i. [sent-65, score-0.929]

24 The tree loss measures an intermediate loss of 1 for each prediction at each depth j of the label tree where the true label is not in the label set bj (x) . [sent-68, score-2.543]

25 Hence, any algorithm wishing to optimize the overall tree loss should train all the nodes jointly with respect to this maximum. [sent-70, score-0.734]

26 We will now describe how we propose to learn the parameters T of our label tree. [sent-71, score-0.417]

27 In the next subsection we show how to minimize the tree loss for a given ﬁxed tree (N, E and L are ﬁxed, F is to be learned). [sent-72, score-1.197]

28 1 Learning with a Fixed Label Tree Let us suppose we are given a ﬁxed label tree N, E, L chosen in advance. [sent-75, score-0.928]

29 Our goal is simply to minimize the tree loss (1) over the variables F , given training data {(xi , yi )}i=1,. [sent-76, score-0.71]

30 The number of errors counted by the approximation cannot be less than the empirical tree loss Remp as when, for a particular example, the loss is zero for the approximation it is also zero for Remp . [sent-83, score-0.701]

31 Relaxation 2: Tree Loss Optimization (Joint convex problem) We propose a tighter minimization of the tree loss with the following: m 1 ξα m i=1 i s. [sent-94, score-0.658]

32 When α is close to zero, the shared slack variables simply count a single error if any of the predictions at any depth of the tree are incorrect, so this is very close to the true optimization of the tree loss. [sent-100, score-1.156]

33 This is measured by checking, out of all the nodes that share the same parent, if the one containing the true label in its label set is highest ranked. [sent-101, score-0.809]

34 2 Learning Label Tree Structures The previous section shows how to optimize the label predictors F while the nodes N , edges E and label sets L which specify the structure of the tree are ﬁxed in advance. [sent-106, score-1.395]

35 However, we want to be able to learn speciﬁc tree structures dependent on our prediction problem such that we minimize the 3 Algorithm 2 Learning the Label Tree Structure ¯ ¯ Train k One-vs-Rest classiﬁers f1 , . [sent-107, score-0.65]

36 Learn the parameters f of the tree by minimizing (4) subject to constraints (2) and (3). [sent-118, score-0.572]

37 The key to the generalization ability of a particular choice of tree structure is the learnability of the label sets . [sent-123, score-0.928]

38 If some classes are often confused but are in different label sets the functions f may not be easily learnable, and the overall tree loss will hence be poor. [sent-124, score-1.119]

39 For example for an image labeling task, a decision in the tree between two label sets, one containing tiger and jaguar labels versus one containing frog and toad labels is presumably more learnable than (tiger, frog) vs. [sent-125, score-1.315]

40 In the following, we consider a learning strategy for disjoint label trees (the methods in the previous section were for both joint and disjoint trees). [sent-127, score-0.579]

41 We begin by noticing that Remp can be rewritten as:   m 1 max I(yi ∈ j ) C(xi , y ) ¯ Remp (ftree ) = m i=1 j y∈ ¯/ j where C(xi , y ) = I(ftree (xi ) = y ) is the confusion of labeling example xi (with true label yi ) with ¯ ¯ label y instead. [sent-128, score-1.015]

42 That is, the tree loss for a given example is 1 if there is a node j in the tree containing ¯ yi , but we predict a different node at the same depth leading to a prediction not in the label set of j. [sent-129, score-1.945]

43 Intuitively, the confusion of predicting node i instead of j comes about because of the class confusion between the labels y ∈ i and the labels y ∈ j . [sent-130, score-0.637]

44 Hence, to provide the smallest tree loss we ¯ want to group together labels into the same label set that are likely to be confused at test time. [sent-131, score-1.126]

45 Unfortunately we do not know the confusion matrix of a particular tree without training it ﬁrst, but as a proxy we can use the class confusion matrix of a surrogate classiﬁer with the supposition that the matrices will be highly correlated. [sent-132, score-0.897]

46 The main idea is to recursively partition the labels into label sets between which there is little confusion (measuring confusion using One-vs-Rest as a surrogate classiﬁer) solving at each step a graph cut problem where standard spectral clustering is applied [20, 21]. [sent-134, score-0.815]

47 (To obtain logarithmic speedups the tree has to be balanced; one could also enforce this constraint directly in the k-means step. [sent-136, score-0.577]

48 ) The results in Section 5 show that our learnt trees outperform random structures and in fact match the accuracy of not using a tree at all, while being orders of magnitude faster. [sent-137, score-0.839]

49 3 Label Embeddings An orthogonal angle of attack of the solution of large multi-class problems is to employ shared representations for the labelings, which we term label embeddings. [sent-138, score-0.385]

50 , 0) which is a k-dimensional vector with a 1 in the y th position and 0 otherwise, we would like to ﬁnd a linear embedding E(y) = V φ(y) where V is a de × k matrix assuming that labels y ∈ {1, . [sent-145, score-0.46]

51 Without a tree structure, multi-class classiﬁcation is then achieved with: fembed (x) = argmaxi=1,. [sent-149, score-0.543]

52 This method, unlike label trees, is unfortunately still linear with respect to k. [sent-155, score-0.385]

53 If the embedding dimension de is much smaller than d this gives a signiﬁcant saving. [sent-157, score-0.375]

54 For example, the method of compressed sensing [17] has a similar form to (5), but the matrix V is not learnt but chosen randomly, and only W is learnt. [sent-159, score-0.231]

55 In the next section we will show how we can train such models so that the matrix V captures the semantic similarity between classes, which can improve generalization performance over random choices of V in an analogous way to the improvement of label trees over random trees. [sent-160, score-0.54]

56 Subsequently, we will show how to combine label embeddings with label trees to gain the advantages of both approaches. [sent-161, score-1.004]

57 Sequence of Convex Problems Firstly, we consider learning the label embedding by solving a sequence of convex problems using the following method. [sent-164, score-0.796]

58 Then, ﬁnd the label embedding vectors Vi that minimize: k Aij ||Vi − Vj ||2 , where A = i,j=1 1 ¯ ¯ (C + C ) is the symmetrized confusion matrix, 2 subject to the constraint V DV = I where Dii = j Aij (to prevent trivial solutions) which is the same problem solved by Laplacian Eigenmaps [4]. [sent-171, score-0.985]

59 We then obtain an embedding matrix V where similar classes i and j should have small distance between their vectors Vi and Vj . [sent-172, score-0.443]

60 To do this, we can then train a convex multi-class classiﬁer utilizing the label embedding V : minimize γ||W ||F RO + 1 m m ξi i=1 where ||. [sent-174, score-0.857]

61 2 Learning Label Embedding Trees In this work, we also propose to combine the use of embeddings and label trees to obtain the advantages of both approaches, which we call the label embedding tree. [sent-189, score-1.379]

62 At test time, the resulting label embedding tree prediction is given in Algorithm 3. [sent-190, score-1.38]

63 The label embedding tree has potentially O(de (d + log(k))) testing speed, depending on the structure of the tree (e. [sent-191, score-1.92]

64 - Start at the root node repeat - Traverse to the most Let s = argmax{c:(s,c)∈E} fc (x) = argmax{c:(s,c)∈E} z E(c). [sent-197, score-0.216]

65 To learn a label embedding tree we propose the following minimization problem: γ||W ||F RO + 1 m m ξi i=1 subject to constraints: (W xi ) V φ(r) ≥ (W xi ) V φ(s) − ξi , ∀r, s : yi ∈ r ∧ yi ∈ / s ∧ (∃p : (p, r) ∈ E ∧ (p, s) ∈ E) ||Vi || ≤ 1, ξi ≥ 0, i = 1, . [sent-201, score-1.548]

66 Learning the tree structure for these models can still be achieved using Algorithm 2. [sent-206, score-0.543]

67 Common multi-class strategies include one-versus-rest, one-versus-one, label ranking and Decision Directed Acyclic Graph (DDAG). [sent-209, score-0.385]

68 5 [24], can also yield a tree whose depth (and hence test cost) is logarithmic in k. [sent-221, score-0.681]

69 Filter tree [7] and Conditional Probability Tree (CPT) [6] are logarithmic approaches that have been introduced recently with motivations similar to ours, i. [sent-223, score-0.605]

70 Filter tree considers a random binary tree in which each leaf is associated with a class and each node is associated with a binary classiﬁer. [sent-226, score-1.302]

71 At each node, the node classiﬁer decides whether the example is directed to the right or to the left subtree, each of which are associated to half of the labels of the parent node. [sent-228, score-0.247]

72 Finally, the label of the reached leaf is predicted. [sent-229, score-0.44]

73 Conditional Probability Tree (CPT) relies on a similar paradigm but builds the tree during training. [sent-230, score-0.543]

74 Hence, CPT builds the tree greedily: when a new class is encountered, it is added by splitting an existing leaf. [sent-232, score-0.591]

75 In our case, we consider that the set of classes are available prior to training and propose to tessellate the class label sets such that the node classiﬁers are likely to achieve high generalization performance. [sent-233, score-0.614]

76 6 Finally, we should mention that a related active area of research involves partitioning the feature space rather than the label space, e. [sent-235, score-0.385]

77 Label embedding is another key aspect of our work when it comes to efﬁciently handling thousands of classes. [sent-238, score-0.375]

78 Compressed sensing based approaches [17] do propose to embed class labels, but rely on a random projection for embedding the vector representing class memberships, with the added advantages of handling problems for which multiple classes are active for a given example. [sent-242, score-0.667]

79 However, relying on a random projection does not allow for the class embedding to capture the relation between classes. [sent-243, score-0.423]

80 Finally, the authors of [2] do propose an embedding approach over class labels, but it is not clear to us if their approach is scalable to our setting. [sent-245, score-0.423]

81 5 Experimental Study We consider three datasets: one publicly available image annotation dataset and two proprietary datasets based on images and textual descriptions of products. [sent-246, score-0.304]

82 We consider two tasks: predicting the label given the textual description, and predicting the label given the image. [sent-258, score-0.84]

83 Flat versus Tree Learning Approaches In Table 2 we compare label tree predictor training methods from Section 2. [sent-263, score-0.956]

84 In all cases we considered disjoint trees of depth 2 with 200 internal nodes. [sent-266, score-0.23]

85 The results show that learnt structure performs better than random structure and tree loss optimization is superior to independent optimization. [sent-267, score-0.769]

86 The combination of Learnt Label Tree structure and Tree Loss Optimization for the label predictors is the only method that is comparable to or better than One-vs-Rest while being around 60× faster to compute at test time. [sent-270, score-0.462]

87 For ImageNet one could wonder how well using WordNet (a graph of human annotated label similarities) to build a tree would perform instead. [sent-271, score-0.928]

88 We constructed a matrix C for Algorithm 2 where Cij = 1 if there is an edge in the WordNet graph, and 0 otherwise, and used that to learn a label tree as before, obtaining 0. [sent-272, score-0.96]

89 This is better than a random tree but not as good as using the confusion matrix, implying that the best tree to use is the one adapted to the supervised task of interest. [sent-274, score-1.239]

90 0% Table 2: Flat versus Tree Learning Results Test set accuracies for various tree and non-tree methods on three datasets. [sent-279, score-0.543]

91 3% 142× 20 MB Embedding and Embedding Tree Approaches In Table 3 we compare several label embedding methods: (i) the convex and non-convex methods from Section 5; (ii) compressed sensing; and (iii) the label embedding tree from Section 3. [sent-313, score-2.164]

92 In all cases we ﬁxed the embedding dimension de = 100. [sent-315, score-0.375]

93 The results show that the random embeddings given by compressed sensing are inferior to learnt embeddings and Non-Convex Embedding is superior to Sequential Convex Embedding, presumably as the overall loss which is dependent on both W and V is jointly optimized. [sent-316, score-0.612]

94 Note, we do not detail results on the product descriptions task because no speed-up is gained there from embedding as the sparsity is already so high, however the methods still gave good test accuracy (e. [sent-318, score-0.53]

95 Finally, combining embedding and label tree learning using the “Label Embedding Tree” of Section 3. [sent-322, score-1.303]

96 Moreover, memory usage of this method (and other embedding methods) is signiﬁcantly less than One-vs-Rest. [sent-324, score-0.414]

97 6 Conclusion We have introduced an approach for fast multi-class classiﬁcation by learning label embedding trees by (approximately) optimizing the overall tree loss. [sent-325, score-1.473]

98 Our approach obtained orders of magnitude speedup compared to One-vs-Rest while yielding as good or better accuracy, and outperformed other tree-based or embedding approaches. [sent-326, score-0.404]

99 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-355, score-0.443]

100 The effects of training set size on decision tree complexity. [sent-456, score-0.543]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tree', 0.543), ('label', 0.385), ('embedding', 0.375), ('confusion', 0.153), ('mb', 0.138), ('cpt', 0.132), ('trees', 0.126), ('imagenet', 0.117), ('node', 0.113), ('embeddings', 0.108), ('ftree', 0.107), ('learnt', 0.105), ('labels', 0.085), ('remp', 0.081), ('loss', 0.079), ('testing', 0.074), ('classi', 0.072), ('textual', 0.07), ('kd', 0.07), ('depth', 0.07), ('wordnet', 0.069), ('classes', 0.068), ('filter', 0.066), ('compressed', 0.065), ('ro', 0.061), ('sensing', 0.061), ('fc', 0.059), ('yi', 0.056), ('leaf', 0.055), ('nl', 0.054), ('images', 0.053), ('grangier', 0.052), ('thousand', 0.051), ('ers', 0.05), ('dent', 0.05), ('parent', 0.049), ('class', 0.048), ('document', 0.047), ('multiclass', 0.046), ('argmax', 0.046), ('langford', 0.046), ('documents', 0.045), ('er', 0.045), ('relaxation', 0.045), ('product', 0.044), ('root', 0.044), ('overall', 0.044), ('prediction', 0.043), ('annotation', 0.043), ('predictors', 0.043), ('ddag', 0.043), ('jaguar', 0.043), ('symmetrized', 0.043), ('toad', 0.043), ('taxonomy', 0.042), ('balanced', 0.042), ('superior', 0.042), ('descriptions', 0.041), ('spectral', 0.039), ('embed', 0.039), ('nodes', 0.039), ('child', 0.039), ('none', 0.039), ('memory', 0.039), ('traverses', 0.038), ('flat', 0.038), ('accuracy', 0.036), ('convex', 0.036), ('image', 0.036), ('xi', 0.036), ('speed', 0.035), ('vi', 0.035), ('samy', 0.035), ('frog', 0.035), ('logarithmic', 0.034), ('disjoint', 0.034), ('test', 0.034), ('advertising', 0.032), ('learnable', 0.032), ('traverse', 0.032), ('validation', 0.032), ('children', 0.032), ('minimize', 0.032), ('learn', 0.032), ('bj', 0.031), ('cj', 0.031), ('proprietary', 0.031), ('categorization', 0.03), ('datasets', 0.03), ('dekel', 0.029), ('eigenmaps', 0.029), ('orders', 0.029), ('subject', 0.029), ('train', 0.029), ('grows', 0.029), ('beygelzimer', 0.028), ('arrives', 0.028), ('tiger', 0.028), ('predictor', 0.028), ('approaches', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Author: Samy Bengio, Jason Weston, David Grangier

2 0.20416433 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning

Author: Jun Liu, Jieping Ye

Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-deﬁned tree structure is based on a group-Lasso penalty, where one group is deﬁned for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efﬁcient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efﬁcient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efﬁciency and effectiveness of the proposed algorithm.

3 0.20294636 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams

Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a ﬂexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are inﬁnitely exchangeable. One can view our model as providing inﬁnite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1

4 0.19803558 94 nips-2010-Feature Set Embedding for Incomplete Data

Author: David Grangier, Iain Melvin

Abstract: We present a new learning strategy for classiﬁcation problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-speciﬁc subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classiﬁcation strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the ﬁnal classiﬁcation objective. This simple strategy allows great ﬂexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1

5 0.17261559 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

Author: Umar Syed, Ben Taskar

Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efﬁcient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1

6 0.16050199 144 nips-2010-Learning Efficient Markov Networks

7 0.13982476 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

8 0.13341242 177 nips-2010-Multitask Learning without Label Correspondences

9 0.125545 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

10 0.12305571 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

11 0.12178835 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction

12 0.11881509 223 nips-2010-Rates of convergence for the cluster tree

13 0.11521532 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence

14 0.10729864 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

15 0.10448495 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

16 0.10265755 280 nips-2010-Unsupervised Kernel Dimension Reduction

17 0.10265654 228 nips-2010-Reverse Multi-Label Learning

18 0.10236255 27 nips-2010-Agnostic Active Learning Without Constraints

19 0.10062923 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

20 0.098992713 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.272), (1, 0.103), (2, 0.081), (3, -0.131), (4, -0.106), (5, 0.01), (6, -0.062), (7, -0.057), (8, 0.041), (9, -0.039), (10, -0.001), (11, 0.077), (12, -0.051), (13, 0.091), (14, 0.093), (15, -0.156), (16, -0.055), (17, -0.109), (18, -0.144), (19, -0.184), (20, -0.067), (21, -0.092), (22, 0.106), (23, -0.288), (24, -0.002), (25, 0.106), (26, 0.024), (27, 0.169), (28, -0.057), (29, -0.095), (30, 0.116), (31, 0.023), (32, -0.054), (33, 0.059), (34, 0.066), (35, -0.008), (36, 0.068), (37, -0.067), (38, -0.021), (39, 0.125), (40, 0.104), (41, 0.017), (42, -0.034), (43, -0.09), (44, -0.004), (45, -0.055), (46, -0.027), (47, 0.031), (48, 0.052), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97579086 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Author: Samy Bengio, Jason Weston, David Grangier

2 0.68707532 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-speciﬁc independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efﬁcient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

3 0.64396644 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classiﬁcation and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1

4 0.62340331 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

Author: Umar Syed, Ben Taskar

5 0.61760575 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning

Author: Jun Liu, Jieping Ye

6 0.59317315 94 nips-2010-Feature Set Embedding for Incomplete Data

7 0.52611524 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

8 0.50419772 177 nips-2010-Multitask Learning without Label Correspondences

9 0.49405286 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

10 0.49012494 228 nips-2010-Reverse Multi-Label Learning

11 0.48434907 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

12 0.48308203 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

13 0.47678837 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction

14 0.4756414 1 nips-2010-(RF)^2 -- Random Forest Random Field

15 0.45274663 108 nips-2010-Graph-Valued Regression

16 0.44419345 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities

17 0.42566907 290 nips-2010-t-logistic regression

18 0.42543653 151 nips-2010-Learning from Candidate Labeling Sets

19 0.41124758 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

20 0.40201241 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.077), (17, 0.016), (27, 0.063), (30, 0.107), (35, 0.036), (45, 0.281), (48, 0.1), (50, 0.039), (52, 0.033), (60, 0.048), (77, 0.032), (78, 0.033), (90, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95166278 124 nips-2010-Inductive Regularized Learning of Kernel Functions

Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon

Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We ﬁrst propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classiﬁcation accuracy signiﬁcantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique signiﬁcantly reduces the dimensionality of the feature space while achieving competitive classiﬁcation accuracies. 1

same-paper 2 0.95039952 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Author: Samy Bengio, Jason Weston, David Grangier

3 0.93984491 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

4 0.93779749 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

Author: Gowtham Bellala, Suresh Bhavnani, Clayton Scott

Abstract: Generalized Binary Search (GBS) is a well known greedy algorithm for identifying an unknown object while minimizing the number of “yes” or “no” questions posed about that object, and arises in problems such as active learning and active diagnosis. Here, we provide a coding-theoretic interpretation for GBS and show that GBS can be viewed as a top-down algorithm that greedily minimizes the expected number of queries required to identify an object. This interpretation is then used to extend GBS in two ways. First, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. Then, we consider the case where the cost of identifying an object grows exponentially in the number of queries. In each case, we present an exact formula for the objective function involving Shannon or R´ nyi entropy, and e develop a greedy algorithm for minimizing it. 1

5 0.93481588 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

Author: Umar Syed, Ben Taskar

6 0.93480635 290 nips-2010-t-logistic regression

7 0.93476814 63 nips-2010-Distributed Dual Averaging In Networks

8 0.93398786 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

9 0.93388534 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

10 0.93378812 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

11 0.93342346 27 nips-2010-Agnostic Active Learning Without Constraints

12 0.932643 222 nips-2010-Random Walk Approach to Regret Minimization

13 0.93176538 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

14 0.93158251 280 nips-2010-Unsupervised Kernel Dimension Reduction

15 0.93153667 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

16 0.93114585 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

17 0.93102324 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

18 0.9305588 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

19 0.93012559 265 nips-2010-The LASSO risk: asymptotic results and real world examples

20 0.92935532 196 nips-2010-Online Markov Decision Processes under Bandit Feedback