nips nips2012 nips2012-226 knowledge-graph by maker-knowledge-mining

226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Source: pdf

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. All, All Pairs, Tree-based classiﬁers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the ﬁrst four methods, the classiﬁcation is based on a reduction to binary classiﬁcation. We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Hebrew University Jerusalem, Israel Abstract We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. [sent-2, score-0.241]

2 We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . [sent-5, score-0.432]

3 We analyze both the estimation error and the approximation error of these methods. [sent-6, score-0.372]

4 Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. [sent-8, score-0.462]

5 1 Introduction In this work we consider multiclass prediction: The problem of classifying objects into one of several possible target classes. [sent-10, score-0.2]

6 The centrality of the multiclass learning problem has spurred the development of various approaches for tackling the task. [sent-23, score-0.2]

7 Perhaps the most straightforward approach is a reduction from multiclass classiﬁcation to binary classiﬁcation. [sent-24, score-0.32]

8 For example, the One-vs-All (OvA) method is based on a reduction of the multiclass problem into k binary problems, each of which discriminates between one class to all the rest of the classes (e. [sent-25, score-0.476]

9 A tree-based classiﬁer (TC) is another reduction in which the prediction is obtained by traversing a binary tree, where at each node of the tree a binary classiﬁer is used to decide on the rest of the path (see for example Beygelzimer et al. [sent-32, score-0.358]

10 While halfspaces are our primary focus, many of our results hold for any underlying binary hypothesis class of VC dimension d + 1. [sent-37, score-0.637]

11 1 Other, more direct approaches to multiclass classiﬁcation over Rd have also been proposed (e. [sent-38, score-0.2]

12 In this paper we analyze the Multiclass SVM (MSVM) formulation of Crammer and Singer [2001], in which each hypothesis is of the form hW (x) = argmaxi∈[k] (W x)i , where W is a k × (d + 1) matrix and (W x)i is the i’th ¯ ¯ element of the vector W x ∈ Rk . [sent-41, score-0.282]

13 The error of a multiclass predictor h : Rd → [k] is deﬁned to be the probability that h(x) = y, where (x, y) is sampled from the underlying distribution D over Rd × [k], namely, Err(h) = P(x,y)∼D [h(x) = y]. [sent-43, score-0.361]

14 More precisely, each method corresponds to a hypothesis class, H, which contains the multiclass predictors that may be returned by the method. [sent-46, score-0.461]

15 For example, the hypothesis class of MSVM is H = {x → argmaxi∈[k] (W x)i : W ∈ Rk×(d+1) }. [sent-47, score-0.307]

16 according to i=1 D, and returns a multiclass predictor which we denote by A(S) ∈ H. [sent-51, score-0.229]

17 A learning algorithm is called an Empirical Risk Minimizer (ERM) if it returns a hypothesis in H that minimizes the empirical error on the sample. [sent-52, score-0.343]

18 We denote by h a hypothesis in H with minimal error,1 that is, h ∈ argminh∈H Err(h). [sent-53, score-0.241]

19 When analyzing the error of A(S), it is convenient to decompose this error as a sum of approximation error and estimation error: Err(A(S)) = Err(h ) + Err(A(S)) − Err(h ) . [sent-54, score-0.433]

20 approximation (1) estimation • The approximation error is the minimum error achievable by a predictor in the hypothesis class, H. [sent-55, score-0.709]

21 The approximation error does not depend on the sample size, and is determined solely by the allowed hypothesis class2 . [sent-56, score-0.449]

22 • The estimation error of an algorithm is the difference between the approximation error, and the error of the classiﬁer the algorithm chose based on the sample. [sent-57, score-0.331]

23 This error exists both for statistical reasons, since the sample may not be large enough to determine the best hypothesis, and for algorithmic reasons, since the learning algorithm may not output the best possible hypothesis given the sample. [sent-58, score-0.371]

24 For the ERM algorithm, the estimation error can be bounded from above by order of C(H)/m where C(H) is a complexity measure of H (analogous to the VC dimension) and m is the sample size. [sent-59, score-0.22]

25 A similar term also bounds the estimation error from below for any algorithm. [sent-60, score-0.184]

26 Thus C(H) is an estimate of the best achievable estimation error for the class. [sent-61, score-0.181]

27 Hence, in this case the difference in prediction performance between the two methods will be dominated by the approximation error and by the success of the learning algorithm in approaching the best possible estimation error. [sent-65, score-0.251]

28 For the approximation error we will provide even stronger results, by comparing the approximation error of classes for any distribution. [sent-68, score-0.45]

29 Note that, when comparing different hypothesis classes over the same distribution, the Bayes error is constant. [sent-71, score-0.433]

30 Given two hypothesis classes, H, H , we say that H essentially contains H if for any distribution, the approximation error of H is at most the approximation error of H . [sent-75, score-0.651]

31 H strictly contains H if, in addition, there is a distribution for which the approximation error of H is strictly smaller than that of H . [sent-76, score-0.2]

32 • The estimation errors of OvA, MSVM, and TC are all roughly the same, in the sense that ˜ C(H) = Θ(dk) for all of the corresponding hypothesis classes. [sent-79, score-0.29]

33 • We prove that the hypothesis class of MSVM essentially contains the hypothesis classes of both OvA and TC. [sent-85, score-0.688]

34 We show that whenever d k, for any distribution D, with high probability over the choice of a random permutation, the approximation error of the resulting tree would be close to 1/2. [sent-91, score-0.304]

35 • We show that if d k, for any distribution D, the approximation error of ECOC with a randomly generated code matrix is likely to be close to 1/2. [sent-93, score-0.261]

36 • We show that the hypothesis class of AP essentially contains the hypothesis class of MSVM (hence also that of OvA and TC), and that there can be a substantial gap in the containment. [sent-94, score-0.664]

37 Therefore, as expected, the relative performance of AP and MSVM depends on the wellknown trade-off between estimation error and approximation error. [sent-95, score-0.229]

38 d = k) the prediction success of these methods can be similar, while TC has the advantage of having a testing run-time of d log(k), compared to the testing run-time of dk for OvA and MSVM. [sent-99, score-0.287]

39 [2000] analyzed the multiclass error of ECOC as a function of the binary error. [sent-103, score-0.396]

40 Indeed, our analysis reveals that the underlying binary problems would be too hard if d k and the code is randomly generated. [sent-105, score-0.205]

41 Here again we show that the regret values of the underlying binary classiﬁers are likely to be very large whenever d k and the leaves of the tree are associated to labels in a random way. [sent-117, score-0.347]

42 [2011] analyzed the properties of multiclass learning with various ERM learners, and have also provided some bounds on the estimation error of multiclass SVM and of trees. [sent-123, score-0.584]

43 In this paper we both improve these bounds, derive new bounds for other classes, and also analyze the approximation error of the classes. [sent-124, score-0.254]

44 2 Deﬁnitions and Preliminaries We ﬁrst formally deﬁne the hypothesis classes that we analyze in this paper. [sent-125, score-0.372]

45 Though NP-hard in general, solving ¯ the ERM problem with respect to L can be done efﬁciently in the realizable case (namely, whenever exists a hypothesis with zero empirical error on the sample). [sent-127, score-0.428]

46 Tree-based classiﬁers (TC): A tree-based multiclass classiﬁer is a full binary tree whose leaves are associated with class labels and whose internal nodes are associated with binary classiﬁers. [sent-128, score-0.614]

47 Formally, a tree for k classes is a full binary tree T together with a bijection λ : leaf(T ) → [k], which associates a label to each of the leaves. [sent-133, score-0.455]

48 Given a mapping C : N (T ) → H, deﬁne a multiclass predictor, hC : X → [k], by setting hC (x) = λ(v) where v is the last node of the root-to-leaf path v1 , . [sent-137, score-0.2]

49 Also, let Htrees = ∪T is a tree for k classes HT . [sent-144, score-0.188]

50 If H is the class of linear separators over Rd , then for any tree T the ERM problem with respect to HT can be solved efﬁciently in the realizable case. [sent-145, score-0.251]

51 Given a code M , and the result of l binary classiﬁers represented by a vector u ∈ {−1, 1}l , the code selects l ˜ ˜ a label via M : {−1, 1}l → [k], deﬁned by M (u) = λ arg maxi∈[k] Mij uj . [sent-149, score-0.294]

52 , hl for each column in the code matrix, the code assigns to the instance ˜ x ∈ X the label M (h1 (x), . [sent-153, score-0.276]

53 Then, the hypothesis class of AP is HAP = HM AP . [sent-170, score-0.307]

54 Our analysis of the estimation error is based on results that bound the sample complexity of multiclass learning. [sent-171, score-0.42]

55 (2) h∈H The ﬁrst term on the right-hand side is the approximation error of H. [sent-176, score-0.18]

56 Therefore, the sample complexity is the number of examples required to ensure that the estimation error of A is at most (with high probability). [sent-177, score-0.22]

57 To bound the sample complexity of a hypothesis class we rely on upper and lower bounds on the sample complexity in terms of two generalizations of the VC dimension for multiclass problems, called the Graph dimension and the Natarajan dimension and denoted dG (H) and dN (H). [sent-179, score-0.819]

58 [2011] For every hypothesis class H, and for every ERM rule, dN (H) + ln( 1 ) min{dN (H) ln(|Y|), dG (H)} + ln( 1 ) δ δ Ω ≤ mH ( , δ) ≤ mERM ( , δ) ≤ O 2 2 We note that the constants in the O, Ω notations are universal. [sent-184, score-0.347]

59 1 we analyze the sample complexity of the different hypothesis classes. [sent-186, score-0.351]

60 We provide lower bounds on the Natarajan dimensions of the various hypothesis classes, thus concluding, in light of Theorem 2. [sent-187, score-0.274]

61 We also provide upper bounds on the graph dimensions of these hypothesis classes, yielding, by the same theorem, an upper bound on the estimation error of ERM. [sent-189, score-0.445]

62 2 we analyze the approximation error of the different hypothesis classes. [sent-191, score-0.462]

63 These methods rely on an underlying hypothesis class of binary classiﬁers. [sent-201, score-0.431]

64 While our main focus is the case in which the binary hypothesis class is halfspaces over Rd , the upper bounds on the sample complexity we derive below holds for any binary hypothesis class of VC dimension d + 1. [sent-202, score-1.11]

65 For every binary hypothesis class of VC dimension d + 1, and for any tree T , dG (HT ) ≤ dG (Htrees ) ≤ O(dk log(dk)). [sent-205, score-0.566]

66 If the underlying hypothesis class is halfspaces over Rd , then also d(k − 1) ≤ dN (HT ) ≤ dG (HT ) ≤ dG (Htrees ) ≤ O(dk log(dk)). [sent-206, score-0.496]

67 Further it was shown that if H is the set of halfspaces over Rd , then Ω dk log(k) ≤ dN (HT ). [sent-211, score-0.424]

68 For every M ∈ Rk×l and every binary hypothesis class of VC dimension d, dG (HM ) ≤ O(dl log(dl)). [sent-215, score-0.488]

69 Moreover, if M ∈ {±1}k×l and the underlying hypothesis class is halfspaces over Rd , then d · δ(M )/2 ≤ dN (HM ) ≤ dG (HM ) ≤ O(dl log(dl)) . [sent-216, score-0.496]

70 For any binary hypothesis class of VC dimension d, dG (HOvA ) ≤ O(dk log(dk)) and dG (HAP ) ≤ O(dk 2 log(dk)). [sent-221, score-0.448]

71 If the underlying hypothesis class is halfspaces over Rd we also have: d(k − 1) ≤ dN (HOvA ) ≤ dG (HOvA ) ≤ O(dk log(dk)) d 3. [sent-222, score-0.496]

72 Approximation error We ﬁrst show that the class L essentially contains HOvA and HT for any tree T , assuming, of course, that H is the class of halfspaces in Rd . [sent-224, score-0.541]

73 Any embedding into a higher dimension that allows HOvA or HT (for some tree T ˜ for k classes) to essentially contain L, necessarily embeds into a dimension of at least Ω(dk). [sent-234, score-0.222]

74 The next theorem shows that the approximation error of AP is better than that of MSVM (and hence also better than OvA and TC). [sent-235, score-0.207]

75 This is expected as the sample complexity of AP is considerably higher, and therefore we face the usual trade-off between approximation and estimation error. [sent-236, score-0.196]

76 Let H ⊆ {±1}X be any hypothesis class of VC-dimension d, let µ ∈ (0, 1/2], and let D be any distribution over X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-249, score-0.307]

77 Then, for any ν > 0, if k ≥ C · , then with probability of at least 1 − δ ν2 over the choice of φ, the approximation error of H with respect to Dφ will be at least µ − ν. [sent-252, score-0.18]

78 Let (T, λ) be a tree for k classes such that λ : leaf(T ) → [k] is chosen uniformly at random. [sent-258, score-0.188]

79 Let H ⊆ {±1}X be a hypothesis class of VC-dimension d, let ν > 0, and let D k k d+ln( 1 ) δ be any distribution over X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-260, score-0.307]

80 Then, for k ≥ C · , k ν2 with probability of at least 1 − δ over the choice of λ, the approximation error of HT with respect to D is at least µ − ν. [sent-261, score-0.18]

81 Let H ⊆ {±1}X be a hypothesis class of VC-dimension d, let ν > 0, and let D be any distribution over dl log(dl)+ln( 1 ) δ X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-266, score-0.415]

82 Then, for k ≥ C · , with probability k ν2 of at least 1 − δ over the choice of λ, the approximation error of HM with respect to D is at least 1/2 − ν. [sent-267, score-0.18]

83 Note that the ﬁrst corollary holds even if only the top level of the binary tree is balanced and splits the labels randomly to the left and the right sub-trees. [sent-268, score-0.271]

84 Moreover, most current theoretical analyses of ECOC estimate the error of the learned multiclass hypothesis in terms of the average error of the binary classiﬁers. [sent-276, score-0.739]

85 Is is not hard to see that for every φ : [d + 1] → {±1}, the approximation error of the class of halfspaces with respect to Dφ is zero. [sent-285, score-0.425]

86 Thus, in order to ensure a large approximation error for every distribution, the number of classes must be at least linear in the dimension, so in this sense, the lemma is tight. [sent-286, score-0.317]

87 7 The technique for the lower bound on dN (L(W)) when W is the class of halfspaces in Rd is more involved, and quite general. [sent-309, score-0.225]

88 We consider a binary hypothesis class G ⊆ {±1}[d]×[l] which consists of functions having an arbitrary behaviour over [d] × {i}, and a very uniform behaviour on other inputs (such as mapping all other inputs to a constant). [sent-310, score-0.401]

89 Finally, we show that the class of halfspaces is richer than G, in the sense that the inputs to G can be mapped to points in Rd such that the functions of G can be mapped to halfspaces. [sent-313, score-0.225]

90 To prove the approximation error lower bounds stated in Section 3. [sent-315, score-0.239]

91 For these hypotheses, we show that with high probability over the choice of label mapping, the approximation error on the sample is high. [sent-319, score-0.246]

92 A union bound on the ﬁnite set of possible hypotheses shows that the approximation error on the distribution will be high, with high probability over the choice of the label mapping. [sent-320, score-0.25]

93 This is certainly true if the hypothesis class of MSVM, L, has a zero approximation error (the realizable case), since the ERM is then solvable with respect to L. [sent-322, score-0.568]

94 Nonetheless, for each method there are reasonable heuristics to approximate the ERM, which should work well when the approximation error is small. [sent-326, score-0.205]

95 However, variations in the optimality of algorithms for different hypothesis classes should also be taken into account in this analysis. [sent-328, score-0.331]

96 Finally, when the number of examples is much larger than dk 2 , the analysis implies that it is better to choose the AP approach. [sent-331, score-0.265]

97 In the leftmost graph below, there are two classes in R2 , and the approximation error of all algorithms is zero. [sent-333, score-0.29]

98 Here, both MSVM and OvA have a zero approximation error, but the error of TC and of ECOC with a random code will most likely be large. [sent-335, score-0.261]

99 Reducing multiclass to binary: A unifying approach for margin classiﬁers. [sent-348, score-0.2]

100 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-375, score-0.2]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('msvm', 0.418), ('ecoc', 0.337), ('ova', 0.289), ('dk', 0.265), ('hypothesis', 0.241), ('tc', 0.237), ('multiclass', 0.2), ('dg', 0.178), ('halfspaces', 0.159), ('hova', 0.146), ('erm', 0.13), ('vc', 0.125), ('dn', 0.118), ('err', 0.111), ('daniely', 0.109), ('hap', 0.109), ('dl', 0.108), ('ht', 0.107), ('ap', 0.106), ('error', 0.102), ('tree', 0.098), ('binary', 0.094), ('classes', 0.09), ('allwein', 0.089), ('code', 0.081), ('hm', 0.08), ('approximation', 0.078), ('hl', 0.076), ('rd', 0.074), ('htrees', 0.073), ('hw', 0.07), ('classi', 0.066), ('class', 0.066), ('realizable', 0.059), ('ers', 0.058), ('crammer', 0.055), ('inclusions', 0.055), ('estimation', 0.049), ('dimension', 0.047), ('rk', 0.046), ('ln', 0.045), ('er', 0.045), ('argmaxi', 0.044), ('beygelzimer', 0.044), ('codes', 0.042), ('natarajan', 0.042), ('complexity', 0.041), ('analyze', 0.041), ('rumelhart', 0.04), ('hc', 0.038), ('label', 0.038), ('regret', 0.037), ('bijection', 0.037), ('leaf', 0.036), ('leaves', 0.035), ('bounds', 0.033), ('sabato', 0.032), ('hypotheses', 0.032), ('singer', 0.03), ('achievable', 0.03), ('rifkin', 0.03), ('underlying', 0.03), ('essentially', 0.03), ('predictor', 0.029), ('sample', 0.028), ('separators', 0.028), ('corollary', 0.027), ('weston', 0.027), ('labels', 0.027), ('lemma', 0.027), ('theorem', 0.027), ('whenever', 0.026), ('svm', 0.026), ('reduction', 0.026), ('break', 0.026), ('stated', 0.026), ('traverse', 0.025), ('balanced', 0.025), ('heuristics', 0.025), ('dietterich', 0.024), ('jerusalem', 0.024), ('et', 0.024), ('ma', 0.024), ('correcting', 0.024), ('learnability', 0.024), ('corollaries', 0.024), ('hebrew', 0.024), ('log', 0.023), ('amit', 0.023), ('certainly', 0.022), ('strict', 0.022), ('prediction', 0.022), ('ld', 0.022), ('mh', 0.022), ('symmetry', 0.021), ('shai', 0.021), ('vi', 0.021), ('graph', 0.02), ('every', 0.02), ('contains', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

2 0.14033276 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

3 0.12736414 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, deﬁned by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classiﬁcation. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the ﬁrst provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

4 0.11347025 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1

5 0.11030393 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

6 0.10937089 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

7 0.09769278 271 nips-2012-Pointwise Tracking the Optimal Regression Function

8 0.087379634 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

9 0.087025478 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

10 0.081062734 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

11 0.079159699 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

12 0.076837748 291 nips-2012-Reducing statistical time-series problems to binary classification

13 0.070843831 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

14 0.069650985 182 nips-2012-Learning Networks of Heterogeneous Influence

15 0.063652612 242 nips-2012-Non-linear Metric Learning

16 0.060476832 180 nips-2012-Learning Mixtures of Tree Graphical Models

17 0.059395757 344 nips-2012-Timely Object Recognition

18 0.059256837 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

19 0.058408339 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

20 0.057575185 195 nips-2012-Learning visual motion in recurrent neural networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.017), (2, 0.017), (3, -0.029), (4, 0.079), (5, -0.043), (6, 0.007), (7, 0.111), (8, -0.122), (9, -0.008), (10, 0.061), (11, 0.053), (12, 0.072), (13, -0.006), (14, -0.043), (15, -0.034), (16, -0.072), (17, 0.067), (18, -0.014), (19, 0.115), (20, -0.097), (21, 0.078), (22, 0.014), (23, 0.051), (24, -0.01), (25, -0.074), (26, -0.039), (27, -0.036), (28, -0.013), (29, -0.104), (30, -0.017), (31, 0.011), (32, -0.081), (33, -0.113), (34, -0.007), (35, 0.008), (36, -0.093), (37, 0.016), (38, -0.056), (39, -0.056), (40, 0.016), (41, -0.012), (42, -0.027), (43, -0.018), (44, -0.018), (45, 0.036), (46, -0.074), (47, -0.06), (48, -0.044), (49, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94309801 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

2 0.76505059 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

3 0.71229923 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

4 0.70948225 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

5 0.67476535 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

Author: Wei Bi, James T. Kwok

Abstract: In hierarchical classiﬁcation, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classiﬁcation, performing MLNP in hierarchical multilabel classiﬁcation is much more difﬁcult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efﬁciently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and ﬂat multilabel classiﬁcation methods. 1

6 0.6675418 271 nips-2012-Pointwise Tracking the Optimal Regression Function

7 0.65982431 361 nips-2012-Volume Regularization for Binary Classification

8 0.61550933 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

9 0.609519 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

10 0.58929402 227 nips-2012-Multiclass Learning with Simplex Coding

11 0.58532751 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

12 0.57722235 291 nips-2012-Reducing statistical time-series problems to binary classification

13 0.5496037 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

14 0.54254407 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

15 0.53542703 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

16 0.52163613 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

17 0.50128925 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

18 0.49408707 215 nips-2012-Minimizing Uncertainty in Pipelines

19 0.49053553 197 nips-2012-Learning with Recursive Perceptual Representations

20 0.47823313 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.02), (21, 0.022), (30, 0.036), (32, 0.016), (38, 0.182), (39, 0.014), (42, 0.057), (45, 0.018), (54, 0.013), (55, 0.013), (74, 0.045), (76, 0.114), (80, 0.084), (87, 0.02), (89, 0.158), (92, 0.056), (98, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88110238 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be deﬁned by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

same-paper 2 0.86845404 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

3 0.86467671 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classiﬁcation and regression problems. High-performance algorithms and implementations are critical to efﬁciently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a uniﬁed convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the ﬁeld, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

4 0.81892395 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

Author: Elad Hazan, Zohar Karnin

Abstract: We present a simplex algorithm for linear programming in a linear classiﬁcation formulation. The paramount complexity parameter in linear classiﬁcation problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1

5 0.81544238 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassiﬁcation error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

6 0.81513488 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

7 0.81460822 227 nips-2012-Multiclass Learning with Simplex Coding

8 0.81183952 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

9 0.80988812 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

10 0.80801678 358 nips-2012-Value Pursuit Iteration

11 0.80768156 292 nips-2012-Regularized Off-Policy TD-Learning

12 0.80765557 187 nips-2012-Learning curves for multi-task Gaussian process regression

13 0.80764395 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

14 0.80726546 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

15 0.80687732 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

16 0.80585462 267 nips-2012-Perceptron Learning of SAT

17 0.80585343 324 nips-2012-Stochastic Gradient Descent with Only One Projection

18 0.80447012 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

19 0.8039 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

20 0.80356705 285 nips-2012-Query Complexity of Derivative-Free Optimization