jmlr jmlr2005 jmlr2005-56 knowledge-graph by maker-knowledge-mining

56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels


Source: pdf

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. [sent-7, score-0.367]

2 We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. [sent-9, score-0.379]

3 We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. [sent-10, score-0.721]

4 Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. [sent-11, score-0.282]

5 Introduction Maximum margin algorithms, notably the Support Vector Machines (SVM) introduced by Boser et al. [sent-13, score-0.257]

6 The first idea is to learn using the linear separator which achieves the maximum margin on the training data rather than an arbitrary consistent linear threshold hypothesis. [sent-18, score-0.42]

7 Given a kernel K which corresponds to some expanded feature space, the SVM hypothesis h is (an implicit representation of) the maximum margin linear threshold hypothesis over this expanded feature space rather than the original feature space. [sent-21, score-0.903]

8 , Shawe-Taylor and Cristianini, 2000) implies that if the kernel K is efficiently computable then it is possible to efficiently construct this maximum margin hypothesis h and that h itself is efficiently c 2005 Roni Khardon and Rocco A. [sent-24, score-0.473]

9 Several on-line algorithms have also been proposed which iteratively construct large margin hypotheses in the feature space (see, e. [sent-27, score-0.284]

10 In particular, convergence bounds based on the maximum margin of the classifier on the observed data have been obtained by Shawe-Taylor et al. [sent-32, score-0.337]

11 In this paper we analyze the performance of maximum margin algorithms when used with Boolean kernels to learn DNF formulas. [sent-43, score-0.379]

12 1 Since linear threshold elements can represent disjunctions, one can naturally view any DNF formula as a linear threshold function over this expanded feature space. [sent-48, score-0.266]

13 It is thus natural to ask whether the Kk kernel maximum margin learning algorithms are good algorithms for learning DNF. [sent-49, score-0.365]

14 The fastest known algorithm for PAC learning DNF is due to Klivans and Servedio (2001); it works by explicitly expanding each example into a feature space of monotone conjunctions and explicitly learning a consistent linear threshold function over this expanded feature space. [sent-51, score-0.441]

15 Since the Kk kernel enables us to do such expansions implicitly in a computationally efficient way, it is natural to investigate whether the Kk -kernel maximum margin algorithm yields a computationally efficient algorithm for PAC learning DNF. [sent-52, score-0.365]

16 2 Discussion of the Problem and Previous Work Recall that a polynomial size sample is sufficient for PAC learning any concept class where each concept in the class has a polynomial size description. [sent-54, score-0.246]

17 This Boolean kernel is similar to the well known polynomial kernel in that all monomials of length up to k are represented. [sent-57, score-0.323]

18 The main difference is that the polynomial kernel assigns weights to monomials which depend on certain binomial coefficients; thus the weights of different monomials can differ by an exponential factor. [sent-58, score-0.439]

19 (1998) and Shawe-Taylor and Cristianini (2000) has introduced convergence bounds for maximum margin learners. [sent-64, score-0.337]

20 These bounds are independent of the dimension of the expanded feature space but they depend on the L2 norm of examples in this space, as well as the margin obtained on the sample. [sent-65, score-0.432]

21 In particular they depend on R/δ where δ is the margin and R bounds the L2 norm of examples. [sent-66, score-0.292]

22 It is instructive to consider applying these results in our setting, where we assume for concreteness that we are learning a function given by one k-monomial T , and that we are using the Kk monotone kernel with the maximum margin algorithm. [sent-67, score-0.497]

23 only one weight is non-zero and the (non-normalized) margin obtained is 1. [sent-70, score-0.296]

24 Seen in another way, we can normalize the examples to have a maximum norm of 1, but then the normalized margin obtained is Θ(n−k/2 ). [sent-72, score-0.334]

25 (2003) gives bounds on the margin (or the dimension required) for concrete concept classes. [sent-84, score-0.319]

26 It is worth noting that the notion of embedding used in these results is slightly stronger than the requirement in the upper bounds, in that the embedding and margin are for all the examples (or a large fraction of the instance space) and not just for a small sample. [sent-86, score-0.289]

27 However, such upper bounds do not imply that the Kk kernel maximum margin algorithm must have poor generalization error if run with a smaller sample. [sent-89, score-0.426]

28 Analogously, in this paper we perform detailed algorithm-specific analysis for the Kk kernel maximum margin algorithms. [sent-96, score-0.365]

29 The current paper differs in several ways from this earlier work: we study the maximum margin algorithm rather than Perceptron, we consider PAC learning from a random sample rather than online learning, and we analyze the Kk kernels for all 1 ≤ k ≤ n. [sent-102, score-0.376]

30 3 Our Results In this paper we study the kernels corresponding to all monotone monomials of length up to k, which we denote by Kk . [sent-107, score-0.289]

31 In addition to unaugmented maximum margin algorithms we also consider a natural scheme of structural risk minimization (SRM) that can be used with maximum margin algorithms over this family of Boolean kernels. [sent-109, score-0.604]

32 , n the Kk kernel maximum margin algorithm cannot PAC learn t(n)-term DNF. [sent-127, score-0.404]

33 , n} the Kk maximum margin hypothesis has error larger than ε (with overwhelmingly high probability over the choice of a polynomial size random sample from D ). [sent-132, score-0.524]

34 Note that this result implies that the Kk maximum margin algorithms fail even when combined with SRM regardless of the cost function. [sent-133, score-0.337]

35 This is simply because the maximum margin hypothesis has error > ε for all k, and hence the final SRM hypothesis must also have error > ε. [sent-134, score-0.518]

36 There is a distribution D over {0, 1}n such that for Ω(1) 1 any k = ω(1) the Kk maximum margin hypothesis has error at least 2 −2−n (with overwhelmingly high probability over the choice of a polynomial size random sample from D ). [sent-140, score-0.524]

37 Note that for any k = Θ(1), standard bounds on maximum margin algorithms show that the Kk kernel algorithm can learn f (x) = x1 from a polynomial size sample. [sent-142, score-0.517]

38 Given these strong negative results for PAC learning under arbitrary distributions, we next consider the problem of PAC learning monotone DNF under the uniform distribution. [sent-143, score-0.221]

39 There is a O(t(n)1/3 )-term monotone DNF over t(n) relevant variables such that for all k < t(n) the Kk maximum margin hypothesis has error at least ε (with probability 1 over the choice of a random sample from the uniform distribution). [sent-148, score-0.637]

40 For any k = ω( n), the Kk maximum margin hypothesis will have error 1 − 2−Ω(n) with probability at least 0. [sent-151, score-0.41]

41 It is of significant interest to characterize the performance of the Kk maximum margin algorithm under the uniform distribution for these intermediate values of k; a discussion of this point is given in Section 5. [sent-154, score-0.361]

42 The margin of h on z, b ∈ RN × {−1, 1} is mh (z, b) = b(W · z − θ) . [sent-164, score-0.373]

43 x,b ∈S The maximum margin classifier for S is the linear threshold function h(x) = sign(W · x − θ) such that mh (S) = max W ∈RN ,θ ∈R b(W · x − θ ) . [sent-172, score-0.497]

44 W x,b ∈S min (1) The quantity (1) is called the margin of S and is denoted mS . [sent-173, score-0.257]

45 If mS > 0 then the maximum margin classifier for S is unique (see, e. [sent-175, score-0.302]

46 We refer to the following learning algorithm as the K-maximum margin learner: • The algorithm takes as input a sample S = { xi , bi }i=1,. [sent-188, score-0.351]

47 If these conditions do not hold then the maximum margin hypothesis is not defined. [sent-193, score-0.41]

48 • The algorithm’s hypothesis is h : {0, 1}n → {−1, 1}, h(x) = sign(W · φ(x) − θ) where sign(W · x − θ) is the maximum margin classifier for φ(S). [sent-200, score-0.41]

49 SVM theory tells us that if K(x, y) can be computed in poly(n) time then the K-maximum margin learning algorithm runs in poly(n, m) =poly(n) time and the output hypothesis h(x) can be evaluated in poly(n, m) =poly(n) time (see, e. [sent-202, score-0.365]

50 Our goal is to analyze the PAC learning ability of various kernel maximum margin learning algorithms. [sent-205, score-0.365]

51 Applying this framework to the maximum margin learner, we assume that the sample S is drawn by taking IID samples from D and providing the label according to the target function f . [sent-211, score-0.363]

52 Note that the number of nonempty monotone conjunctions (i. [sent-214, score-0.227]

53 the components of φk (x) are all monotone conjunctions of the desired size. [sent-222, score-0.227]

54 Distribution-Free Non-Learnability We give a DNF and a distribution which are such that the maximum margin algorithm using the k-monomials kernel fails to learn, for all 1 ≤ k ≤ n. [sent-241, score-0.396]

55 A polynomial threshold function is defined by a multivariate polynomial p(x1 , . [sent-246, score-0.235]

56 A simple but useful observation is that any hypothesis output by the Kk kernel maximum margin algorithm must be a polynomial threshold function of degree at most k. [sent-255, score-0.671]

57 Minsky and Papert (1968) (see also Klivans and Servedio, 2001) gave the following lower bound on polynomial threshold function degree for DNF: 1411 K HARDON AND S ERVEDIO Theorem 5 Any polynomial threshold function for f (x) in Equation (2) must have degree at least . [sent-256, score-0.428]

58 For small values of k the result is representation based and does not depend on the sample drawn: Lemma 6 If the maximum margin algorithm uses the kernel Kk for k < when learning f (x) under 1 D then its hypothesis has error greater than ε = 4·21t(n) = 4n . [sent-259, score-0.509]

59 Since we are using the kernel Kk , the hypothesis h is some polynomial threshold 2·2t(n) function of degree at most k which has error τ ≤ 2·21t(n) under D . [sent-261, score-0.369]

60 Under this setting of variables the hypothesis is a degree-k polynomial threshold function on the first t(n) variables. [sent-263, score-0.265]

61 For larger values of k (in fact for all k = ω(1)) we show that with high probability the maximum margin hypothesis will overfit the sample. [sent-267, score-0.41]

62 As a result of these facts, one can find a simple hypothesis with relatively large margin by using all the structure from the examples, i. [sent-271, score-0.365]

63 It is hard in general to analyze the maximum margin hypothesis directly, and in particular it does not necessarily follow the overfitting scheme of the simple hypothesis. [sent-275, score-0.41]

64 However, our analysis uses the simple hypothesis to infer some properties of the maximum margin hypothesis and through this provide error bounds for it. [sent-276, score-0.553]

65 Since the maximum margin algorithm uses m =poly(n) = nΩ(1) many examples (see Ω(1) Section 2), the first condition fails with probability 2−m = 2−n as well. [sent-288, score-0.365]

66 Then the maximum margin mS satisfies mS ≥ Mh ≡ 1 ρk (. [sent-305, score-0.302]

67 01n2/3 ) Proof We exhibit an explicit linear threshold function h which has margin at least Mh on the data set. [sent-309, score-0.336]

68 • θ is the value that gives the maximum margin on φk (S) for this W , i. [sent-311, score-0.302]

69 Putting these conditions together, we get that the margin of h on the sample is at least 1 ρk (. [sent-327, score-0.293]

70 It is instructive to use a rough calculation and compare the margin obtained to the one calculated in the introduction. [sent-331, score-0.257]

71 ρk (n2/3 ) m which is Lemma 12 If S is a D -typical sample, then the threshold θ in the maximum margin classifier for S is at least Mh . [sent-333, score-0.381]

72 Proof Let h(x) = sign(W · φ(x) − θ) be the maximum margin hypothesis. [sent-334, score-0.302]

73 Since W = 1 we have θ= θ = mh (φk (0n ), −1) ≥ mh (S) ≥ Mh W where the second equality holds because W · φ(0n ) = 0 and the last inequality is by Lemma 11. [sent-335, score-0.262]

74 Lemma 13 If the maximum margin algorithm uses the kernel Kk for k = ω(1) when learning f (x) Ω(1) 1 under D then with probability 1 − 2−n its hypothesis has error greater than ε = 4·21t(n) = 4n . [sent-336, score-0.473]

75 Proof Let S be the sample used for learning and let h(x) = sign(W · φk (x) − θ) be the maximum margin hypothesis. [sent-337, score-0.338]

76 5) that the maximum margin weight vector W is a linear combination of the support vectors, i. [sent-341, score-0.341]

77 Together, Lemma 6 and Lemma 13 imply Result 1: Theorem 14 For any value of k, if the maximum margin algorithm (as defined in Section 2) uses Ω(1) the kernel Kk when learning f (x) under D then with probability 1 − 2−n its hypothesis has error 1 greater than ε = 4·21t(n) = 4n . [sent-366, score-0.499]

78 It is easy to see the that previous arguments go through for this case and we get: Theorem 15 For k = ω(1), if the maximum margin algorithm uses the kernel Kk when learning Ω(1) Ω(1) f (x) = x1 under D then with probability 1 − 2−n its hypothesis has error at least ε = 1 − 2−n . [sent-373, score-0.473]

79 01n2/3 )k (4) Now the proof of Lemma 12 shows that (4) is a lower bound on the threshold of the maximum margin classifier. [sent-392, score-0.413]

80 Therefore, each weight in φ(z) is represented by a weight in one of the intersections, and the value of the weight depends only on the monomial so it is the same in φ(z) and φ(z ∩ xi ). [sent-401, score-0.202]

81 This provided a lower bound (of 0) on the value of W · φk (x) for some negative example in the sample, and then we could argue that the value of θ in the maximum margin classifier must be at least as large as mS . [sent-418, score-0.364]

82 This event fails 1 1 to occur only if each of the M− draws from B(n , 2 ) misses the left tail of weight at most 4m . [sent-494, score-0.26]

83 1 1 The lemma implies that the left tail of weight at most 4m must have weight at least 12m . [sent-505, score-0.236]

84 Then the maximum margin mS satisfies mS ≥ 1 2 1 √ m ρk (u1 ) − mρk (. [sent-565, score-0.302]

85 • Set θ to be the value that gives the maximum margin on φk (S) for this W , i. [sent-572, score-0.302]

86 Now we can give a lower bound on the threshold θ for the maximum margin classifier. [sent-584, score-0.413]

87 Lemma 26 Let S be a labeled sample of size m which is U -typical and positive skewed, and let h(x) = sign(W · φk (x) − θ) be the maximum margin hypothesis for S. [sent-585, score-0.472]

88 Putting all of the pieces together, we have: √ 3 Theorem 27 If the maximum margin algorithm uses the kernel Kk for k = ω( n log 2 n) when learning f (x) = x1 under the uniform distribution then with probability at least 0. [sent-591, score-0.478]

89 Arguing as in Remark 16 we get that the maximum margin is at least 1 uk − m(0. [sent-616, score-0.302]

90 We have studied the performance of the maximum margin algorithm with the Boolean kernels, giving negative results for several settings of the problem. [sent-626, score-0.332]

91 Our results indicate that the maximum margin algorithm can overfit even when learning simple target functions and using natural and expressive kernels for such functions, and even when combined with structural risk minimization. [sent-627, score-0.365]

92 This seems necessary for learning DNF; note that while one can use an exponential function to define a kernel with weighted monomials where the weight decays exponentially depending on the degree k, this implies that the margin for functions of high degree is exponentially small. [sent-629, score-0.56]

93 Many interesting variants of the basic maximum margin algorithm have been used in recent years, such as soft margin criteria and kernel regularization. [sent-632, score-0.622]

94 , Verbeurgt, 1990) that when learning polynomial 1422 M AXIMUM M ARGIN A LGORITHMS WITH B OOLEAN K ERNELS size DNF under the uniform distribution, conjunctions of length ω(log n) can be ignored with little effect. [sent-638, score-0.232]

95 As a first concrete problem in this scenario, one might consider the question of whether a k = Θ(log n) kernel maximum margin algorithm can efficiently PAC learn the target function f (x) = x1 . [sent-642, score-0.455]

96 For this problem it is easy to show that that the naive hypothesis h constructed in our proofs achieves both a large margin and high accuracy. [sent-643, score-0.365]

97 Moreover, it is possible to show that with high probability the maximum margin hypothesis has a margin which is within a multiplicative factor of (1 + o(1)) of the margin achieved by h . [sent-644, score-0.924]

98 Finally, the kernel we have used is natural in terms of capturing all monomials of a certain length but there are other ways to capture natural kernels for Boolean problems. [sent-647, score-0.22]

99 (1993); Kushilevitz and Mansour (1993); Mansour (1995) but the algorithmic ideas are very different to the ones used by maximum margin algorithms. [sent-650, score-0.302]

100 Learning monotone log-term DNF formulas under the uniform distribution. [sent-944, score-0.222]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dnf', 0.429), ('kk', 0.401), ('poly', 0.27), ('margin', 0.257), ('pac', 0.173), ('aximum', 0.17), ('ervedio', 0.17), ('hardon', 0.17), ('oolean', 0.17), ('boolean', 0.161), ('monotone', 0.132), ('argin', 0.127), ('monomials', 0.119), ('mh', 0.116), ('khardon', 0.114), ('hypothesis', 0.108), ('ernels', 0.106), ('pr', 0.1), ('servedio', 0.099), ('srm', 0.095), ('conjunctions', 0.095), ('ms', 0.087), ('lgorithms', 0.083), ('expanded', 0.081), ('tail', 0.081), ('ces', 0.08), ('threshold', 0.079), ('polynomial', 0.078), ('lemma', 0.077), ('bits', 0.069), ('chernoff', 0.067), ('kernel', 0.063), ('event', 0.06), ('perceptron', 0.059), ('uniform', 0.059), ('klivans', 0.057), ('kushilevitz', 0.057), ('monomial', 0.057), ('log', 0.054), ('cristianini', 0.049), ('draws', 0.049), ('mansour', 0.048), ('maximum', 0.045), ('minsky', 0.043), ('rocco', 0.043), ('roni', 0.043), ('tufts', 0.043), ('lightest', 0.042), ('occam', 0.042), ('winnow', 0.042), ('fourier', 0.042), ('degree', 0.041), ('weight', 0.039), ('learn', 0.039), ('kernels', 0.038), ('wt', 0.038), ('blum', 0.037), ('sample', 0.036), ('bshouty', 0.036), ('bounds', 0.035), ('fail', 0.035), ('sign', 0.034), ('bound', 0.032), ('forster', 0.032), ('vm', 0.032), ('roth', 0.032), ('examples', 0.032), ('fails', 0.031), ('formulas', 0.031), ('weights', 0.03), ('negative', 0.03), ('bi', 0.03), ('holds', 0.03), ('nonzero', 0.029), ('contradicts', 0.029), ('hancock', 0.028), ('kowalczyk', 0.028), ('kucera', 0.028), ('papert', 0.028), ('prx', 0.028), ('sakai', 0.028), ('tarui', 0.028), ('verbeurgt', 0.028), ('weigh', 0.028), ('xi', 0.028), ('contributes', 0.028), ('concept', 0.027), ('ui', 0.027), ('feature', 0.027), ('question', 0.026), ('labeled', 0.026), ('imply', 0.026), ('substantial', 0.025), ('features', 0.025), ('um', 0.025), ('skewed', 0.025), ('succeed', 0.025), ('target', 0.025), ('misclassi', 0.024), ('erroneously', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

2 0.10601156 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds

3 0.090895452 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

4 0.070787363 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

5 0.058311529 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon

Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension

6 0.051207706 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

7 0.04967168 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

8 0.041580357 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

9 0.04069598 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

10 0.04018968 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

11 0.039440751 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

12 0.03905924 64 jmlr-2005-Semigroup Kernels on Measures

13 0.039034076 59 jmlr-2005-New Horn Revision Algorithms

14 0.031162707 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

15 0.029235182 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

16 0.029218804 67 jmlr-2005-Stability of Randomized Learning Algorithms

17 0.029199123 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

18 0.028875455 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

19 0.028851165 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

20 0.028454656 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.181), (1, 0.018), (2, 0.167), (3, 0.165), (4, -0.093), (5, -0.177), (6, -0.036), (7, -0.091), (8, 0.094), (9, -0.04), (10, 0.112), (11, -0.043), (12, 0.198), (13, 0.105), (14, -0.087), (15, -0.321), (16, -0.014), (17, 0.0), (18, 0.182), (19, -0.309), (20, 0.087), (21, -0.091), (22, 0.07), (23, -0.081), (24, 0.127), (25, -0.047), (26, -0.011), (27, 0.007), (28, 0.006), (29, 0.039), (30, 0.077), (31, -0.248), (32, -0.176), (33, 0.02), (34, -0.026), (35, -0.043), (36, -0.045), (37, 0.071), (38, -0.048), (39, -0.008), (40, -0.101), (41, 0.029), (42, -0.054), (43, 0.191), (44, 0.237), (45, -0.004), (46, 0.053), (47, 0.054), (48, 0.007), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95419365 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

2 0.48904604 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

3 0.44623998 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds

4 0.29345009 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

5 0.25286669 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

Author: Atsuyoshi Nakamura, Michael Schmitt, Niels Schmitt, Hans Ulrich Simon

Abstract: Bayesian networks have become one of the major models used for statistical inference. We study the question whether the decisions computed by a Bayesian network can be represented within a low-dimensional inner product space. We focus on two-label classification tasks over the Boolean domain. As main results we establish upper and lower bounds on the dimension of the inner product space for Bayesian networks with an explicitly given (full or reduced) parameter collection. In particular, these bounds are tight up to a factor of 2. For some nontrivial cases of Bayesian networks we even determine the exact values of this dimension. We further consider logistic autoregressive Bayesian networks and show that every sufficiently expressive inner product space must have dimension at least Ω(n2 ), where n is the number of network nodes. We also derive the bound 2Ω(n) for an artificial variant of this network, thereby demonstrating the limits of our approach and raising an interesting open question. As a major technical contribution, this work reveals combinatorial and algebraic structures within Bayesian networks such that known methods for the derivation of lower bounds on the dimension of inner product spaces can be brought into play. Keywords: Bayesian network, inner product space, embedding, linear arrangement, Euclidean dimension

6 0.21171691 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

7 0.20987785 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

8 0.19778447 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

9 0.19657171 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

10 0.18965918 64 jmlr-2005-Semigroup Kernels on Measures

11 0.18010771 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

12 0.16757019 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

13 0.16596417 59 jmlr-2005-New Horn Revision Algorithms

14 0.14836435 67 jmlr-2005-Stability of Randomized Learning Algorithms

15 0.14460959 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

16 0.14437668 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

17 0.14196812 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

18 0.13366568 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

19 0.13245501 48 jmlr-2005-Learning the Kernel Function via Regularization

20 0.13114616 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.036), (17, 0.023), (19, 0.025), (36, 0.036), (37, 0.021), (42, 0.011), (43, 0.036), (47, 0.018), (52, 0.095), (59, 0.017), (70, 0.022), (82, 0.39), (88, 0.109), (90, 0.016), (92, 0.029), (94, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.61679834 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

Author: Roni Khardon, Rocco A. Servedio

Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions

2 0.39779449 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

3 0.37560299 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

Author: John Langford

Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds

4 0.37485909 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

5 0.37404987 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming

6 0.36731958 11 jmlr-2005-Algorithmic Stability and Meta-Learning

7 0.36679035 71 jmlr-2005-Variational Message Passing

8 0.36631933 39 jmlr-2005-Information Bottleneck for Gaussian Variables

9 0.36315939 44 jmlr-2005-Learning Module Networks

10 0.3621453 3 jmlr-2005-A Classification Framework for Anomaly Detection

11 0.3606073 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

12 0.36051148 36 jmlr-2005-Gaussian Processes for Ordinal Regression

13 0.3593936 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.35852021 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

15 0.3580614 5 jmlr-2005-A Generalization Error for Q-Learning

16 0.35584709 20 jmlr-2005-Clustering with Bregman Divergences

17 0.35515934 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

18 0.35413527 29 jmlr-2005-Efficient Margin Maximizing with Boosting

19 0.35395241 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

20 0.35379344 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors