nips nips2012 nips2012-226 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz
Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The Hebrew University Jerusalem, Israel Abstract We theoretically analyze and compare the following five popular multiclass classification methods: One vs. [sent-2, score-0.241]
2 We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . [sent-5, score-0.432]
3 We analyze both the estimation error and the approximation error of these methods. [sent-6, score-0.372]
4 Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. [sent-8, score-0.462]
5 1 Introduction In this work we consider multiclass prediction: The problem of classifying objects into one of several possible target classes. [sent-10, score-0.2]
6 The centrality of the multiclass learning problem has spurred the development of various approaches for tackling the task. [sent-23, score-0.2]
7 Perhaps the most straightforward approach is a reduction from multiclass classification to binary classification. [sent-24, score-0.32]
8 For example, the One-vs-All (OvA) method is based on a reduction of the multiclass problem into k binary problems, each of which discriminates between one class to all the rest of the classes (e. [sent-25, score-0.476]
9 A tree-based classifier (TC) is another reduction in which the prediction is obtained by traversing a binary tree, where at each node of the tree a binary classifier is used to decide on the rest of the path (see for example Beygelzimer et al. [sent-32, score-0.358]
10 While halfspaces are our primary focus, many of our results hold for any underlying binary hypothesis class of VC dimension d + 1. [sent-37, score-0.637]
11 1 Other, more direct approaches to multiclass classification over Rd have also been proposed (e. [sent-38, score-0.2]
12 In this paper we analyze the Multiclass SVM (MSVM) formulation of Crammer and Singer [2001], in which each hypothesis is of the form hW (x) = argmaxi∈[k] (W x)i , where W is a k × (d + 1) matrix and (W x)i is the i’th ¯ ¯ element of the vector W x ∈ Rk . [sent-41, score-0.282]
13 The error of a multiclass predictor h : Rd → [k] is defined to be the probability that h(x) = y, where (x, y) is sampled from the underlying distribution D over Rd × [k], namely, Err(h) = P(x,y)∼D [h(x) = y]. [sent-43, score-0.361]
14 More precisely, each method corresponds to a hypothesis class, H, which contains the multiclass predictors that may be returned by the method. [sent-46, score-0.461]
15 For example, the hypothesis class of MSVM is H = {x → argmaxi∈[k] (W x)i : W ∈ Rk×(d+1) }. [sent-47, score-0.307]
16 according to i=1 D, and returns a multiclass predictor which we denote by A(S) ∈ H. [sent-51, score-0.229]
17 A learning algorithm is called an Empirical Risk Minimizer (ERM) if it returns a hypothesis in H that minimizes the empirical error on the sample. [sent-52, score-0.343]
18 We denote by h a hypothesis in H with minimal error,1 that is, h ∈ argminh∈H Err(h). [sent-53, score-0.241]
19 When analyzing the error of A(S), it is convenient to decompose this error as a sum of approximation error and estimation error: Err(A(S)) = Err(h ) + Err(A(S)) − Err(h ) . [sent-54, score-0.433]
20 approximation (1) estimation • The approximation error is the minimum error achievable by a predictor in the hypothesis class, H. [sent-55, score-0.709]
21 The approximation error does not depend on the sample size, and is determined solely by the allowed hypothesis class2 . [sent-56, score-0.449]
22 • The estimation error of an algorithm is the difference between the approximation error, and the error of the classifier the algorithm chose based on the sample. [sent-57, score-0.331]
23 This error exists both for statistical reasons, since the sample may not be large enough to determine the best hypothesis, and for algorithmic reasons, since the learning algorithm may not output the best possible hypothesis given the sample. [sent-58, score-0.371]
24 For the ERM algorithm, the estimation error can be bounded from above by order of C(H)/m where C(H) is a complexity measure of H (analogous to the VC dimension) and m is the sample size. [sent-59, score-0.22]
25 A similar term also bounds the estimation error from below for any algorithm. [sent-60, score-0.184]
26 Thus C(H) is an estimate of the best achievable estimation error for the class. [sent-61, score-0.181]
27 Hence, in this case the difference in prediction performance between the two methods will be dominated by the approximation error and by the success of the learning algorithm in approaching the best possible estimation error. [sent-65, score-0.251]
28 For the approximation error we will provide even stronger results, by comparing the approximation error of classes for any distribution. [sent-68, score-0.45]
29 Note that, when comparing different hypothesis classes over the same distribution, the Bayes error is constant. [sent-71, score-0.433]
30 Given two hypothesis classes, H, H , we say that H essentially contains H if for any distribution, the approximation error of H is at most the approximation error of H . [sent-75, score-0.651]
31 H strictly contains H if, in addition, there is a distribution for which the approximation error of H is strictly smaller than that of H . [sent-76, score-0.2]
32 • The estimation errors of OvA, MSVM, and TC are all roughly the same, in the sense that ˜ C(H) = Θ(dk) for all of the corresponding hypothesis classes. [sent-79, score-0.29]
33 • We prove that the hypothesis class of MSVM essentially contains the hypothesis classes of both OvA and TC. [sent-85, score-0.688]
34 We show that whenever d k, for any distribution D, with high probability over the choice of a random permutation, the approximation error of the resulting tree would be close to 1/2. [sent-91, score-0.304]
35 • We show that if d k, for any distribution D, the approximation error of ECOC with a randomly generated code matrix is likely to be close to 1/2. [sent-93, score-0.261]
36 • We show that the hypothesis class of AP essentially contains the hypothesis class of MSVM (hence also that of OvA and TC), and that there can be a substantial gap in the containment. [sent-94, score-0.664]
37 Therefore, as expected, the relative performance of AP and MSVM depends on the wellknown trade-off between estimation error and approximation error. [sent-95, score-0.229]
38 d = k) the prediction success of these methods can be similar, while TC has the advantage of having a testing run-time of d log(k), compared to the testing run-time of dk for OvA and MSVM. [sent-99, score-0.287]
39 [2000] analyzed the multiclass error of ECOC as a function of the binary error. [sent-103, score-0.396]
40 Indeed, our analysis reveals that the underlying binary problems would be too hard if d k and the code is randomly generated. [sent-105, score-0.205]
41 Here again we show that the regret values of the underlying binary classifiers are likely to be very large whenever d k and the leaves of the tree are associated to labels in a random way. [sent-117, score-0.347]
42 [2011] analyzed the properties of multiclass learning with various ERM learners, and have also provided some bounds on the estimation error of multiclass SVM and of trees. [sent-123, score-0.584]
43 In this paper we both improve these bounds, derive new bounds for other classes, and also analyze the approximation error of the classes. [sent-124, score-0.254]
44 2 Definitions and Preliminaries We first formally define the hypothesis classes that we analyze in this paper. [sent-125, score-0.372]
45 Though NP-hard in general, solving ¯ the ERM problem with respect to L can be done efficiently in the realizable case (namely, whenever exists a hypothesis with zero empirical error on the sample). [sent-127, score-0.428]
46 Tree-based classifiers (TC): A tree-based multiclass classifier is a full binary tree whose leaves are associated with class labels and whose internal nodes are associated with binary classifiers. [sent-128, score-0.614]
47 Formally, a tree for k classes is a full binary tree T together with a bijection λ : leaf(T ) → [k], which associates a label to each of the leaves. [sent-133, score-0.455]
48 Given a mapping C : N (T ) → H, define a multiclass predictor, hC : X → [k], by setting hC (x) = λ(v) where v is the last node of the root-to-leaf path v1 , . [sent-137, score-0.2]
49 Also, let Htrees = ∪T is a tree for k classes HT . [sent-144, score-0.188]
50 If H is the class of linear separators over Rd , then for any tree T the ERM problem with respect to HT can be solved efficiently in the realizable case. [sent-145, score-0.251]
51 Given a code M , and the result of l binary classifiers represented by a vector u ∈ {−1, 1}l , the code selects l ˜ ˜ a label via M : {−1, 1}l → [k], defined by M (u) = λ arg maxi∈[k] Mij uj . [sent-149, score-0.294]
52 , hl for each column in the code matrix, the code assigns to the instance ˜ x ∈ X the label M (h1 (x), . [sent-153, score-0.276]
53 Then, the hypothesis class of AP is HAP = HM AP . [sent-170, score-0.307]
54 Our analysis of the estimation error is based on results that bound the sample complexity of multiclass learning. [sent-171, score-0.42]
55 (2) h∈H The first term on the right-hand side is the approximation error of H. [sent-176, score-0.18]
56 Therefore, the sample complexity is the number of examples required to ensure that the estimation error of A is at most (with high probability). [sent-177, score-0.22]
57 To bound the sample complexity of a hypothesis class we rely on upper and lower bounds on the sample complexity in terms of two generalizations of the VC dimension for multiclass problems, called the Graph dimension and the Natarajan dimension and denoted dG (H) and dN (H). [sent-179, score-0.819]
58 [2011] For every hypothesis class H, and for every ERM rule, dN (H) + ln( 1 ) min{dN (H) ln(|Y|), dG (H)} + ln( 1 ) δ δ Ω ≤ mH ( , δ) ≤ mERM ( , δ) ≤ O 2 2 We note that the constants in the O, Ω notations are universal. [sent-184, score-0.347]
59 1 we analyze the sample complexity of the different hypothesis classes. [sent-186, score-0.351]
60 We provide lower bounds on the Natarajan dimensions of the various hypothesis classes, thus concluding, in light of Theorem 2. [sent-187, score-0.274]
61 We also provide upper bounds on the graph dimensions of these hypothesis classes, yielding, by the same theorem, an upper bound on the estimation error of ERM. [sent-189, score-0.445]
62 2 we analyze the approximation error of the different hypothesis classes. [sent-191, score-0.462]
63 These methods rely on an underlying hypothesis class of binary classifiers. [sent-201, score-0.431]
64 While our main focus is the case in which the binary hypothesis class is halfspaces over Rd , the upper bounds on the sample complexity we derive below holds for any binary hypothesis class of VC dimension d + 1. [sent-202, score-1.11]
65 For every binary hypothesis class of VC dimension d + 1, and for any tree T , dG (HT ) ≤ dG (Htrees ) ≤ O(dk log(dk)). [sent-205, score-0.566]
66 If the underlying hypothesis class is halfspaces over Rd , then also d(k − 1) ≤ dN (HT ) ≤ dG (HT ) ≤ dG (Htrees ) ≤ O(dk log(dk)). [sent-206, score-0.496]
67 Further it was shown that if H is the set of halfspaces over Rd , then Ω dk log(k) ≤ dN (HT ). [sent-211, score-0.424]
68 For every M ∈ Rk×l and every binary hypothesis class of VC dimension d, dG (HM ) ≤ O(dl log(dl)). [sent-215, score-0.488]
69 Moreover, if M ∈ {±1}k×l and the underlying hypothesis class is halfspaces over Rd , then d · δ(M )/2 ≤ dN (HM ) ≤ dG (HM ) ≤ O(dl log(dl)) . [sent-216, score-0.496]
70 For any binary hypothesis class of VC dimension d, dG (HOvA ) ≤ O(dk log(dk)) and dG (HAP ) ≤ O(dk 2 log(dk)). [sent-221, score-0.448]
71 If the underlying hypothesis class is halfspaces over Rd we also have: d(k − 1) ≤ dN (HOvA ) ≤ dG (HOvA ) ≤ O(dk log(dk)) d 3. [sent-222, score-0.496]
72 Approximation error We first show that the class L essentially contains HOvA and HT for any tree T , assuming, of course, that H is the class of halfspaces in Rd . [sent-224, score-0.541]
73 Any embedding into a higher dimension that allows HOvA or HT (for some tree T ˜ for k classes) to essentially contain L, necessarily embeds into a dimension of at least Ω(dk). [sent-234, score-0.222]
74 The next theorem shows that the approximation error of AP is better than that of MSVM (and hence also better than OvA and TC). [sent-235, score-0.207]
75 This is expected as the sample complexity of AP is considerably higher, and therefore we face the usual trade-off between approximation and estimation error. [sent-236, score-0.196]
76 Let H ⊆ {±1}X be any hypothesis class of VC-dimension d, let µ ∈ (0, 1/2], and let D be any distribution over X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-249, score-0.307]
77 Then, for any ν > 0, if k ≥ C · , then with probability of at least 1 − δ ν2 over the choice of φ, the approximation error of H with respect to Dφ will be at least µ − ν. [sent-252, score-0.18]
78 Let (T, λ) be a tree for k classes such that λ : leaf(T ) → [k] is chosen uniformly at random. [sent-258, score-0.188]
79 Let H ⊆ {±1}X be a hypothesis class of VC-dimension d, let ν > 0, and let D k k d+ln( 1 ) δ be any distribution over X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-260, score-0.307]
80 Then, for k ≥ C · , k ν2 with probability of at least 1 − δ over the choice of λ, the approximation error of HT with respect to D is at least µ − ν. [sent-261, score-0.18]
81 Let H ⊆ {±1}X be a hypothesis class of VC-dimension d, let ν > 0, and let D be any distribution over dl log(dl)+ln( 1 ) δ X × [k] such that ∀i P(x,y)∼D (y = i) ≤ 10 . [sent-266, score-0.415]
82 Then, for k ≥ C · , with probability k ν2 of at least 1 − δ over the choice of λ, the approximation error of HM with respect to D is at least 1/2 − ν. [sent-267, score-0.18]
83 Note that the first corollary holds even if only the top level of the binary tree is balanced and splits the labels randomly to the left and the right sub-trees. [sent-268, score-0.271]
84 Moreover, most current theoretical analyses of ECOC estimate the error of the learned multiclass hypothesis in terms of the average error of the binary classifiers. [sent-276, score-0.739]
85 Is is not hard to see that for every φ : [d + 1] → {±1}, the approximation error of the class of halfspaces with respect to Dφ is zero. [sent-285, score-0.425]
86 Thus, in order to ensure a large approximation error for every distribution, the number of classes must be at least linear in the dimension, so in this sense, the lemma is tight. [sent-286, score-0.317]
87 7 The technique for the lower bound on dN (L(W)) when W is the class of halfspaces in Rd is more involved, and quite general. [sent-309, score-0.225]
88 We consider a binary hypothesis class G ⊆ {±1}[d]×[l] which consists of functions having an arbitrary behaviour over [d] × {i}, and a very uniform behaviour on other inputs (such as mapping all other inputs to a constant). [sent-310, score-0.401]
89 Finally, we show that the class of halfspaces is richer than G, in the sense that the inputs to G can be mapped to points in Rd such that the functions of G can be mapped to halfspaces. [sent-313, score-0.225]
90 To prove the approximation error lower bounds stated in Section 3. [sent-315, score-0.239]
91 For these hypotheses, we show that with high probability over the choice of label mapping, the approximation error on the sample is high. [sent-319, score-0.246]
92 A union bound on the finite set of possible hypotheses shows that the approximation error on the distribution will be high, with high probability over the choice of the label mapping. [sent-320, score-0.25]
93 This is certainly true if the hypothesis class of MSVM, L, has a zero approximation error (the realizable case), since the ERM is then solvable with respect to L. [sent-322, score-0.568]
94 Nonetheless, for each method there are reasonable heuristics to approximate the ERM, which should work well when the approximation error is small. [sent-326, score-0.205]
95 However, variations in the optimality of algorithms for different hypothesis classes should also be taken into account in this analysis. [sent-328, score-0.331]
96 Finally, when the number of examples is much larger than dk 2 , the analysis implies that it is better to choose the AP approach. [sent-331, score-0.265]
97 In the leftmost graph below, there are two classes in R2 , and the approximation error of all algorithms is zero. [sent-333, score-0.29]
98 Here, both MSVM and OvA have a zero approximation error, but the error of TC and of ECOC with a random code will most likely be large. [sent-335, score-0.261]
99 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-348, score-0.2]
100 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-375, score-0.2]
wordName wordTfidf (topN-words)
[('msvm', 0.418), ('ecoc', 0.337), ('ova', 0.289), ('dk', 0.265), ('hypothesis', 0.241), ('tc', 0.237), ('multiclass', 0.2), ('dg', 0.178), ('halfspaces', 0.159), ('hova', 0.146), ('erm', 0.13), ('vc', 0.125), ('dn', 0.118), ('err', 0.111), ('daniely', 0.109), ('hap', 0.109), ('dl', 0.108), ('ht', 0.107), ('ap', 0.106), ('error', 0.102), ('tree', 0.098), ('binary', 0.094), ('classes', 0.09), ('allwein', 0.089), ('code', 0.081), ('hm', 0.08), ('approximation', 0.078), ('hl', 0.076), ('rd', 0.074), ('htrees', 0.073), ('hw', 0.07), ('classi', 0.066), ('class', 0.066), ('realizable', 0.059), ('ers', 0.058), ('crammer', 0.055), ('inclusions', 0.055), ('estimation', 0.049), ('dimension', 0.047), ('rk', 0.046), ('ln', 0.045), ('er', 0.045), ('argmaxi', 0.044), ('beygelzimer', 0.044), ('codes', 0.042), ('natarajan', 0.042), ('complexity', 0.041), ('analyze', 0.041), ('rumelhart', 0.04), ('hc', 0.038), ('label', 0.038), ('regret', 0.037), ('bijection', 0.037), ('leaf', 0.036), ('leaves', 0.035), ('bounds', 0.033), ('sabato', 0.032), ('hypotheses', 0.032), ('singer', 0.03), ('achievable', 0.03), ('rifkin', 0.03), ('underlying', 0.03), ('essentially', 0.03), ('predictor', 0.029), ('sample', 0.028), ('separators', 0.028), ('corollary', 0.027), ('weston', 0.027), ('labels', 0.027), ('lemma', 0.027), ('theorem', 0.027), ('whenever', 0.026), ('svm', 0.026), ('reduction', 0.026), ('break', 0.026), ('stated', 0.026), ('traverse', 0.025), ('balanced', 0.025), ('heuristics', 0.025), ('dietterich', 0.024), ('jerusalem', 0.024), ('et', 0.024), ('ma', 0.024), ('correcting', 0.024), ('learnability', 0.024), ('corollaries', 0.024), ('hebrew', 0.024), ('log', 0.023), ('amit', 0.023), ('certainly', 0.022), ('strict', 0.022), ('prediction', 0.022), ('ld', 0.022), ('mh', 0.022), ('symmetry', 0.021), ('shai', 0.021), ('vi', 0.021), ('graph', 0.02), ('every', 0.02), ('contains', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz
Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1
2 0.14033276 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
3 0.12736414 227 nips-2012-Multiclass Learning with Simplex Coding
Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine
Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1
4 0.11347025 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
5 0.11030393 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
Author: Chi Jin, Liwei Wang
Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.
6 0.10937089 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
7 0.09769278 271 nips-2012-Pointwise Tracking the Optimal Regression Function
8 0.087379634 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
9 0.087025478 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
10 0.081062734 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
11 0.079159699 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
12 0.076837748 291 nips-2012-Reducing statistical time-series problems to binary classification
13 0.070843831 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
14 0.069650985 182 nips-2012-Learning Networks of Heterogeneous Influence
15 0.063652612 242 nips-2012-Non-linear Metric Learning
16 0.060476832 180 nips-2012-Learning Mixtures of Tree Graphical Models
17 0.059395757 344 nips-2012-Timely Object Recognition
18 0.059256837 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
19 0.058408339 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
20 0.057575185 195 nips-2012-Learning visual motion in recurrent neural networks
topicId topicWeight
[(0, 0.165), (1, 0.017), (2, 0.017), (3, -0.029), (4, 0.079), (5, -0.043), (6, 0.007), (7, 0.111), (8, -0.122), (9, -0.008), (10, 0.061), (11, 0.053), (12, 0.072), (13, -0.006), (14, -0.043), (15, -0.034), (16, -0.072), (17, 0.067), (18, -0.014), (19, 0.115), (20, -0.097), (21, 0.078), (22, 0.014), (23, 0.051), (24, -0.01), (25, -0.074), (26, -0.039), (27, -0.036), (28, -0.013), (29, -0.104), (30, -0.017), (31, 0.011), (32, -0.081), (33, -0.113), (34, -0.007), (35, 0.008), (36, -0.093), (37, 0.016), (38, -0.056), (39, -0.056), (40, 0.016), (41, -0.012), (42, -0.027), (43, -0.018), (44, -0.018), (45, 0.036), (46, -0.074), (47, -0.06), (48, -0.044), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.94309801 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz
Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1
2 0.76505059 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
Author: Chi Jin, Liwei Wang
Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.
3 0.71229923 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
4 0.70948225 200 nips-2012-Local Supervised Learning through Space Partitioning
Author: Joseph Wang, Venkatesh Saligrama
Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.
5 0.67476535 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification
Author: Wei Bi, James T. Kwok
Abstract: In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The proposed method consistently outperforms other hierarchical and flat multilabel classification methods. 1
6 0.6675418 271 nips-2012-Pointwise Tracking the Optimal Regression Function
7 0.65982431 361 nips-2012-Volume Regularization for Binary Classification
8 0.61550933 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
9 0.609519 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
10 0.58929402 227 nips-2012-Multiclass Learning with Simplex Coding
11 0.58532751 81 nips-2012-Context-Sensitive Decision Forests for Object Detection
12 0.57722235 291 nips-2012-Reducing statistical time-series problems to binary classification
13 0.5496037 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
14 0.54254407 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
15 0.53542703 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
16 0.52163613 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
17 0.50128925 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
18 0.49408707 215 nips-2012-Minimizing Uncertainty in Pipelines
19 0.49053553 197 nips-2012-Learning with Recursive Perceptual Representations
20 0.47823313 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
topicId topicWeight
[(0, 0.02), (21, 0.022), (30, 0.036), (32, 0.016), (38, 0.182), (39, 0.014), (42, 0.057), (45, 0.018), (54, 0.013), (55, 0.013), (74, 0.045), (76, 0.114), (80, 0.084), (87, 0.02), (89, 0.158), (92, 0.056), (98, 0.031)]
simIndex simValue paperId paperTitle
1 0.88110238 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
Author: Borja Balle, Mehryar Mohri
Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be defined by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1
same-paper 2 0.86845404 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz
Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1
3 0.86467671 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin
Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1
4 0.81892395 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
Author: Elad Hazan, Zohar Karnin
Abstract: We present a simplex algorithm for linear programming in a linear classification formulation. The paramount complexity parameter in linear classification problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1
5 0.81544238 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
Author: Aharon Birnbaum, Shai S. Shwartz
Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1
6 0.81513488 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
7 0.81460822 227 nips-2012-Multiclass Learning with Simplex Coding
8 0.81183952 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
9 0.80988812 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
10 0.80801678 358 nips-2012-Value Pursuit Iteration
11 0.80768156 292 nips-2012-Regularized Off-Policy TD-Learning
12 0.80765557 187 nips-2012-Learning curves for multi-task Gaussian process regression
13 0.80764395 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem
14 0.80726546 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
15 0.80687732 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
16 0.80585462 267 nips-2012-Perceptron Learning of SAT
17 0.80585343 324 nips-2012-Stochastic Gradient Descent with Only One Projection
18 0.80447012 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
19 0.8039 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
20 0.80356705 285 nips-2012-Query Complexity of Derivative-Free Optimization