jmlr jmlr2006 jmlr2006-92 knowledge-graph by maker-knowledge-mining

92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities


Source: pdf

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Computer Science Columbia University New York, NY 10027, USA Editor: Dana Ron Abstract We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. [sent-7, score-1.095]

2 First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . [sent-8, score-0.604]

3 This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. [sent-9, score-0.587]

4 Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. [sent-10, score-1.272]

5 For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. [sent-11, score-0.939]

6 Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. [sent-12, score-0.415]

7 For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. [sent-13, score-0.957]

8 We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. [sent-14, score-0.572]

9 Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow 1. [sent-15, score-0.434]

10 , 1995; Blum and Langley, 1997; Valiant, 1999), is whether or not there exist attribute efficient algorithms for learning decision lists, which are essentially nested “if-then-else” statements (we give a precise definition in Section 2). [sent-34, score-0.434]

11 (1995) showed that for many concept classes (including decision lists) attribute efficient learnability in the standard n-attribute model is equivalent to learnability in the infinite attribute model. [sent-37, score-0.748]

12 Since simple classes such as disjunctions and conjunctions are attribute efficiently learnable (and hence learnable in the infinite attribute model), this motivated Blum (1990) to ask whether the richer class of decision lists is thus learnable as well. [sent-38, score-1.202]

13 More recently, Valiant (1999) relates the problem of learning decision lists attribute efficiently to questions about human learning abilities. [sent-41, score-0.68]

14 Another outstanding challenge in machine learning is to determine whether there exist attribute efficient algorithms for learning parity functions. [sent-42, score-0.692]

15 The parity function on a set of 0/1-valued variables xi1 , . [sent-43, score-0.415]

16 As with decision lists, a simple PAC learning algorithm is known for the class of parity functions but no attribute efficient algorithm is known. [sent-47, score-0.849]

17 1 Our Results We give the first learning algorithm for decision lists that is subexponential in both sample complexity (in the relevant parameters k and log n) and running time (in the relevant parameter k). [sent-49, score-0.646]

18 Our results demonstrate for the first time that it is possible to simultaneously avoid the “worst case” in both sample complexity and running time, and thus suggest that it may perhaps be possible to learn decision lists attribute efficiently. [sent-50, score-0.816]

19 Our main learning result for decision lists is: Theorem 1 There is an algorithm which learns length-k decision lists over {0, 1}n with mistake ˜ 1/3 ˜ 1/3 bound 2O(k ) log n and time nO(k ) . [sent-51, score-1.129]

20 We prove Theorem 1 in two parts; first we generalize the Winnow algorithm for learning linear threshold functions to learn polynomial threshold functions (PTFs). [sent-54, score-0.479]

21 This reduces the decision list learning problem to a problem of representing decision lists with PTFs of low weight and low degree. [sent-60, score-0.894]

22 Then L is computed by a polynomial threshold ˜ 1/3 ˜ function of degree O(k1/3 ) and weight 2O(k ) . [sent-62, score-0.538]

23 In 1994 Beigel (1994) gave a lower bound showing that any degree d PTF for a certain 2 decision list must have weight 2Ω(n/d ) . [sent-66, score-0.584]

24 For parity functions, we give an O(n4 ) time algorithm which can PAC learn an unknown parity ˜ on k variables out of n using O(n1−1/k ) examples. [sent-68, score-0.894]

25 To our knowledge this is the first algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. [sent-69, score-0.511]

26 We also describe a simple algorithm, due to Dan Spielman, for learning an unknown parity ˜ on k variables using O(k log n) examples and O(nk/2 ) time. [sent-73, score-0.518]

27 The algorithm can learn any decision list of length k in O(kn2 ) time using O(kn) examples. [sent-79, score-0.467]

28 This “halving algorithm,” proposed in various forms in Angluin (1988); Barzdin and Freivald (1972); Mitchell (1982), can learn decision lists of length k using only O(k log n) examples, but the running time is nΘ(k) . [sent-81, score-0.692]

29 589 K LIVANS AND S ERVEDIO • Several researchers (Blum, 1996; Valiant, 1999) have observed that Winnow can learn lengthk decision lists from 2O(k) log n examples in time 2O(k) n log n. [sent-84, score-0.673]

30 This follows from the fact that any decision list of length k can be expressed as a linear threshold function with integer coefficients of magnitude 2Θ(k) . [sent-85, score-0.607]

31 • Finally, several researchers have considered the special case of learning a length-k decision list in which the output bits of the list have at most D alternations. [sent-86, score-0.525]

32 Little previous work has been published on learning parity functions attribute efficiently in the PAC model. [sent-90, score-0.692]

33 The standard PAC learning algorithm for parity (based on solving a system of linear equations) is due to Helmbold et al. [sent-91, score-0.415]

34 Several authors have considered learning parity attribute efficiently in a model where the learner is allowed to make membership queries. [sent-93, score-0.692]

35 (1995) give a randomized polynomial time membership-query algorithm for learning parity on k variables using only O(k log n) examples, and these results were later refined Uehara et al. [sent-96, score-0.729]

36 In Section 3 we show how to reduce the decision list learning problem to a problem of finding suitable PTF representations of decision lists (Theorem 2). [sent-100, score-0.732]

37 In Section 4 we give our PTF construction for decision lists (Theorem 3). [sent-101, score-0.453]

38 In Section 6 we give our results on learning parity functions, and we conclude in Section 7. [sent-103, score-0.415]

39 Our main interests are the classes of decision lists and parity functions. [sent-110, score-0.818]

40 A parity function of length k is defined by a set of variables S ⊂ {x1 , . [sent-119, score-0.489]

41 The parity function χS (x) takes value 1 (−1) on inputs which set an even (odd) number of variables in S to 1. [sent-123, score-0.415]

42 We say that a learning algorithm A for C in the mistake-bound model is attribute efficient if the mistake bound of A on any concept f ∈ C is polynomial in size( f ). [sent-125, score-0.697]

43 In particular, the description length of a length k decision list (parity) is O(k log n), and thus we would ideally like to have poly(n)-time algorithms which learn decision lists (parities) of length k with a mistake bound of poly(k, log n). [sent-126, score-1.396]

44 These transformations essentially preserve the running time of the mistake bound algorithm, and the sample size required by the PAC algorithm is essentially the mistake bound. [sent-129, score-0.458]

45 Thus, positive results for mistake bound learning, such as those we give for decision lists in this paper, directly yield corresponding positive results for the PAC model. [sent-130, score-0.599]

46 Finally, our results for decision lists are achieved by a careful analysis of polynomial threshold functions. [sent-131, score-0.716]

47 If the sign of p(x) equals f (x) for every x ∈ {0, 1}n , then we say that p is a polynomial threshold function (PTF) of degree d and weight W for f . [sent-134, score-0.566]

48 Expanded-Winnow: Learning Polynomial Threshold Functions Littlestone (1988) introduced the online Winnow algorithm and showed that it can attribute efficiently learn Boolean conjunctions, disjunctions, and low weight linear threshold functions. [sent-136, score-0.579]

49 The following theorem due to Littlestone (1988) gives a mistake bound for Winnow for linear threshold functions: Theorem 4 Let f (x) be the linear threshold function sign(∑n wi xi − θ) over inputs x ∈ {0, 1}n i=1 where θ and w1 , . [sent-140, score-0.494]

50 We will use a generalization of the Winnow algorithm, which we call Expanded-Winnow, to learn polynomial threshold functions of degree at most d. [sent-146, score-0.468]

51 Thus 591 K LIVANS AND S ERVEDIO the hypothesis which Winnow maintains – a linear threshold function over the space of expanded features – is a polynomial threshold function of degree d over the original n variables x1 , . [sent-155, score-0.598]

52 Theorem 2, which follows directly from Theorem 4, summarizes the performance of ExpandedWinnow: Theorem 2 Let C be a class of Boolean functions over {0, 1}n with the property that each f ∈ C has a polynomial threshold function of degree at most d and weight at most W. [sent-159, score-0.538]

53 Theorem 2 shows that the degree of a polynomial threshold function strongly affects ExpandedWinnow’s running time, and the weight of a polynomial threshold function strongly affects its sample complexity. [sent-161, score-0.923]

54 We first give a simple construction of a degree h, weight 2O(k/h+h) PTF for L which is based on breaking the list L into sublists. [sent-174, score-0.447]

55 The set B h of modified decision lists is defined as follows: each function in B h is a decision list (ℓ1 , b1 ), (ℓ2 , b2 ), . [sent-177, score-0.732]

56 Thus the only difference between a modified decision list f ∈ B h and a normal decision list of length h is that the final output value is 0 rather than bh+1 ∈ {−1, +1}. [sent-184, score-0.732]

57 For any h < k we have that L is computed by a polynomial threshold function of degree h and weight 2O(k/h+h) . [sent-209, score-0.538]

58 Then L is computed by a polynomial threshold 1/2 function of degree k1/2 and weight 2O(k ) . [sent-222, score-0.538]

59 2 Inner Approximator In this section we construct low degree, low weight polynomials which approximate (in the L∞ norm) the modified decision lists from the previous subsection. [sent-225, score-0.623]

60 Moreover, the polynomials we construct are exactly correct on inputs which “fall off the end”: 593 K LIVANS AND S ERVEDIO √ Theorem 8 Let f ∈ B h be a modified decision list of length h. [sent-226, score-0.461]

61 As in Klivans and Servedio (2004), here Cd (x) is the dth Chebyshev polynomial of the first kind (a univariate polynomial of degree d). [sent-246, score-0.489]

62 Thus each P (x) can be written written as an integer polynomial (of weight h i √ √ ˜ ˜i (x)/(h h N1 )2 log h where Pi (x) is an integer polynomial of weight hO( h log h) . [sent-268, score-0.9]

63 It follows that as P 2 1/2 p(x) equals p(x)/C, where C is an integer which is at most 2O(h log h) and p is a polynomial with ˜ ˜ 2 O(h1/2 log h) . [sent-269, score-0.443]

64 We thus have integer coefficients and weight 2 Corollary 9 Let f ∈ B h be a modified decision list of length h. [sent-270, score-0.563]

65 Then there is an integer polynomial √ 2 2 1/2 1/2 p(x) of degree 2 h log h and weight 2O(h log h) and an integer C = 2O(h log h) such that • for every input x ∈ {0, 1}h we have |p(x) −C f (x)| ≤ C/h. [sent-271, score-0.821]

66 3 Composing the Constructions In this section we combine the two constructions from the previous subsections to obtain our main polynomial threshold construction: Theorem 10 Let L be a decision list of length k. [sent-275, score-0.751]

67 Then for any h < k, L is computed by a polynomial 2 1/2 threshold function of degree O(h1/2 log h) and weight 2O(k/h+h log h) . [sent-276, score-0.744]

68 We begin with the outer construction: from the note following Claim 5 we have that k/h L(x) = sign C ∑ 3k/h−i+1 fi (x) + bk+1 i=1 where C is the value from Corollary 9 and each fi is a modified decision list of length h computing the restriction of L to its ith block as defined in Subsection 4. [sent-281, score-0.526]

69 Finally, our degree and weight bounds from Corollary 9 imply that the degree of H(x) is O(h1/2 log h) and the 2 1/2 weight of H(x) is 2O(k/h)+O(h log h) , and the theorem is proved. [sent-293, score-0.702]

70 Taking h = k2/3 / log4/3 k in the above theorem we obtain our main result on representing decision lists as polynomial threshold functions: Theorem 3 Let L be a decision list of length k. [sent-294, score-1.165]

71 Then L is computed by a polynomial threshold 4/3 1/3 function of degree k1/3 log1/3 k and weight 2O(k log k) . [sent-295, score-0.641]

72 Theorem 3 immediately implies that Expanded-Winnow can learn decision lists of length k ˜ 1/3 ˜ 1/3 using 2O(k ) log n examples and time nO(k ) . [sent-296, score-0.644]

73 An r-decision list is like a standard decision list but each pair is now of the form (Ci , bi ) where Ci is a conjunction of at most r literals and as before bi = ±1. [sent-302, score-0.621]

74 Then for any h < k, L is computed by a 2 1/2 polynomial threshold function of degree O(rh1/2 log h) and weight 2r+O(k/h+h log h) . [sent-305, score-0.744]

75 By Theorem 10 there is a polynomial 2 1/2 threshold function of degree O(h1/2 log h) and weight 2O(k/h+h log h) over the variables C1 , . [sent-310, score-0.744]

76 Each such interpolating polynomial has degree r and integer coefficients of total magnitude at most 2r , and the corollary follows. [sent-315, score-0.443]

77 596 T OWARD ATTRIBUTE E FFICIENT L EARNING Corollary 12 There is an algorithm for learning r-decision lists over {0, 1}n which, when learning 1/3 ˜ ˜ 1/3 an r-decision list of length k, has mistake bound 2O(r+k ) log n and runs in time nO(rk ) . [sent-316, score-0.815]

78 Now we can apply Corollary 12 to obtain a tradeoff between running time and sample complexity for learning decision trees: Theorem 13 Let D be a decision tree of size s over n variables. [sent-317, score-0.438]

79 Lower Bounds for Decision Lists Here we observe that our construction from Theorem 10 is essentially optimal in terms of the tradeoff it achieves between polynomial threshold function degree and weight. [sent-322, score-0.506]

80 At the heart of his construction is a proof that any low degree PTF for a particular decision list called the ODDMAXBITn function must have large weights: Definition 14 The ODDMAXBITn function on input x = x1 , . [sent-324, score-0.52]

81 Our Theorem 10 gives an upper bound which matches Beigel’s lower bound (up to logarithmic factors) for all d = O(n1/3 ): 2 Observation 16 For any d = O(n1/3 ) there is a polynomial threshold function of degree d and 2 ˜ weight 2O(n/d ) which computes ODDMAXBITn . [sent-338, score-0.598]

82 597 d K LIVANS AND S ERVEDIO Note that since the ODDMAXBITn function has a polynomial size DNF, Beigel’s lower bound ˜ gives a polynomial size DNF f such that any degree O(n1/3 ) polynomial threshold function for f ˜ Ω(n1/3 ) . [sent-341, score-0.832]

83 Learning Parity Functions Recall that the standard algorithm for learning parity functions works by viewing a set of m labelled examples as a set of m linear equations over GF(2). [sent-344, score-0.443]

84 Even though there exists a solution of weight at most k (since the target parity is of length k), Gaussian elimination applied to a system of m equations in n variables over GF(2) may yield a solution of weight as large as min(m, n). [sent-346, score-0.738]

85 Thus this standard algorithm and analysis give an O(n) sample complexity bound for learning a parity of length at most k. [sent-347, score-0.543]

86 1 A Polynomial Time Algorithm We now describe a simple poly(n)-time algorithm for PAC learning an unknown length-k parity ˜ using O(n1−1/k ) examples (for a formal definition of the PAC model we refer the reader to the book by Kearns and Vazirani, 1994). [sent-349, score-0.415]

87 Theorem 17 The class of all parity functions on at most k variables is PAC learnable in O(n4 ) time using O(n1−1/k log n) examples. [sent-351, score-0.603]

88 The hypothesis output by the learning algorithm is a parity function on O(n1−1/k ) variables. [sent-352, score-0.415]

89 Let H be the set of all parity functions of length at most ℓ. [sent-356, score-0.489]

90 If the system has a solution, output the corresponding parity (of length at most ℓ = n1−1/k ) as the hypothesis. [sent-365, score-0.489]

91 598 T OWARD ATTRIBUTE E FFICIENT L EARNING Let V be the set of k relevant variables on which the unknown parity function depends. [sent-371, score-0.415]

92 The hypothesis output by the learning algorithm is a parity function on at most k variables. [sent-377, score-0.415]

93 Proof By Occam’s Razor we need only show that given a set of m = O(k log n) labelled examples, ˜ a consistent length-k parity can be found in O(nk/2 ) time. [sent-378, score-0.546]

94 ˜ After sorting these two lists of vectors, which takes O(nk/2 ) time, we scan through them in parallel in time linear in the length of the lists and find a pair of vectors vS1 from the first list and vS2 ∪{xn+1 } from the second list which are the same. [sent-404, score-0.934]

95 The set S1 ∪ S2 is then a consistent parity of length k. [sent-406, score-0.489]

96 As a first step, one might attempt to extend the tradeoffs we achieve: is it possible to learn decision 1/2 lists of length k in nk time from poly(k, log n) examples? [sent-409, score-0.644]

97 599 K LIVANS AND S ERVEDIO Another goal is to extend our results for decision lists to broader concept classes. [sent-410, score-0.44]

98 (1992) have given a linear threshold function over {−1, 1}n for which any polynomial threshold function must have 1/2 weight 2Ω(n ) regardless of its degree. [sent-413, score-0.549]

99 Moreover Krause and Pudlak (1998) have shown that any Boolean function which has a polynomial threshold function over {0, 1}n of weight w has a polynomial threshold function over {−1, 1}n of weight n2 w4 . [sent-414, score-0.846]

100 For parity functions many questions remain as well: can we learn parity functions on k = Θ(log n) variables in polynomial time using a sublinear number of examples? [sent-416, score-1.109]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('parity', 0.415), ('attribute', 0.277), ('klivans', 0.261), ('lists', 0.246), ('ptf', 0.218), ('winnow', 0.21), ('polynomial', 0.187), ('servedio', 0.185), ('list', 0.172), ('mistake', 0.166), ('decision', 0.157), ('blum', 0.151), ('boolean', 0.135), ('bk', 0.132), ('threshold', 0.126), ('beigel', 0.123), ('pac', 0.121), ('ervedio', 0.116), ('livans', 0.116), ('ptfs', 0.116), ('degree', 0.115), ('weight', 0.11), ('log', 0.103), ('oddmaxbitn', 0.102), ('oward', 0.102), ('dnf', 0.099), ('bh', 0.077), ('fficient', 0.077), ('poly', 0.075), ('length', 0.074), ('valiant', 0.066), ('learnable', 0.061), ('polynomials', 0.058), ('littlestone', 0.058), ('donnell', 0.058), ('parities', 0.058), ('spielman', 0.058), ('cd', 0.055), ('approximator', 0.051), ('integer', 0.05), ('construction', 0.05), ('xn', 0.048), ('running', 0.048), ('ti', 0.047), ('theorem', 0.046), ('bi', 0.046), ('pi', 0.045), ('expanded', 0.044), ('literal', 0.044), ('dhagat', 0.044), ('myhill', 0.044), ('nevo', 0.044), ('subexponential', 0.044), ('superconstant', 0.044), ('learn', 0.04), ('corollary', 0.038), ('rocco', 0.037), ('conjunctions', 0.037), ('hellerstein', 0.037), ('krause', 0.037), ('gf', 0.037), ('concept', 0.037), ('fi', 0.036), ('constructions', 0.035), ('chebyshev', 0.033), ('coef', 0.031), ('xh', 0.03), ('earning', 0.03), ('cients', 0.03), ('bound', 0.03), ('barzdin', 0.029), ('cbk', 0.029), ('expandedwinnow', 0.029), ('golding', 0.029), ('goldmann', 0.029), ('kautz', 0.029), ('negated', 0.029), ('uehara', 0.029), ('elimination', 0.029), ('magnitude', 0.028), ('sign', 0.028), ('xik', 0.028), ('literals', 0.028), ('sublinear', 0.028), ('tradeoff', 0.028), ('qi', 0.028), ('labelled', 0.028), ('claim', 0.027), ('br', 0.026), ('low', 0.026), ('trees', 0.025), ('disjunctions', 0.025), ('interpolating', 0.025), ('intersections', 0.025), ('adam', 0.025), ('vazirani', 0.025), ('sample', 0.024), ('time', 0.024), ('bits', 0.024), ('outer', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

2 0.2084886 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

3 0.057595219 53 jmlr-2006-Learning a Hidden Hypergraph

Author: Dana Angluin, Jiang Chen

Abstract: We consider the problem of learning a hypergraph using edge-detecting queries. In this model, the learner may query whether a set of vertices induces an edge of the hidden hypergraph or not. We show that an r-uniform hypergraph with m edges and n vertices is learnable with O(2 4r m · poly(r, log n)) queries with high probability. The queries can be made in O(min(2 r (log m + r)2 , (log m + r)3 )) rounds. We also give an algorithm that learns an almost uniform hypergraph of ∆ ∆ dimension r using O(2O((1+ 2 )r) · m1+ 2 · poly(log n)) queries with high probability, where ∆ is the difference between the maximum and the minimum edge sizes. This upper bound matches our ∆ lower bound of Ω(( m∆ )1+ 2 ) for this class of hypergraphs in terms of dependence on m. The 1+ 2 queries can also be made in O((1 + ∆) · min(2r (log m + r)2 , (log m + r)3 )) rounds. Keywords: query learning, hypergraph, multiple round algorithm, sampling, chemical reaction network

4 0.048091862 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers

5 0.046741452 25 jmlr-2006-Distance Patterns in Structural Similarity

Author: Thomas Kämpke

Abstract: Similarity of edge labeled graphs is considered in the sense of minimum squared distance between corresponding values. Vertex correspondences are established by isomorphisms if both graphs are of equal size and by subisomorphisms if one graph has fewer vertices than the other. Best fit isomorphisms and subisomorphisms amount to solutions of quadratic assignment problems and are computed exactly as well as approximately by minimum cost flow, linear assignment relaxations and related graph algorithms. Keywords: assignment problem, best approximation, branch and bound, inexact graph matching, model data base

6 0.046035111 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

7 0.042625658 39 jmlr-2006-Inductive Synthesis of Functional Programs: An Explanation Based Generalization Approach     (Special Topic on Inductive Programming)

8 0.040187288 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

9 0.039419342 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

10 0.037871782 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

11 0.036805011 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

12 0.035757892 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

13 0.034635525 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

14 0.034207184 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

15 0.032170046 45 jmlr-2006-Learning Coordinate Covariances via Gradients

16 0.031794928 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

17 0.029760465 12 jmlr-2006-Active Learning with Feedback on Features and Instances

18 0.028918082 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

19 0.028823335 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

20 0.028139636 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.166), (1, 0.003), (2, -0.047), (3, -0.072), (4, -0.074), (5, 0.13), (6, 0.144), (7, -0.126), (8, -0.159), (9, 0.08), (10, 0.034), (11, 0.006), (12, 0.226), (13, 0.051), (14, -0.51), (15, 0.028), (16, -0.105), (17, 0.154), (18, 0.003), (19, -0.032), (20, -0.024), (21, 0.008), (22, -0.06), (23, -0.039), (24, 0.078), (25, 0.189), (26, 0.081), (27, -0.021), (28, 0.073), (29, 0.027), (30, 0.054), (31, 0.011), (32, 0.099), (33, 0.078), (34, 0.058), (35, -0.103), (36, -0.068), (37, -0.013), (38, -0.022), (39, -0.004), (40, 0.037), (41, 0.013), (42, -0.068), (43, -0.002), (44, -0.016), (45, -0.093), (46, 0.086), (47, -0.09), (48, 0.055), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96623921 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

2 0.80328673 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

3 0.2695227 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng

Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks

4 0.26375565 53 jmlr-2006-Learning a Hidden Hypergraph

Author: Dana Angluin, Jiang Chen

Abstract: We consider the problem of learning a hypergraph using edge-detecting queries. In this model, the learner may query whether a set of vertices induces an edge of the hidden hypergraph or not. We show that an r-uniform hypergraph with m edges and n vertices is learnable with O(2 4r m · poly(r, log n)) queries with high probability. The queries can be made in O(min(2 r (log m + r)2 , (log m + r)3 )) rounds. We also give an algorithm that learns an almost uniform hypergraph of ∆ ∆ dimension r using O(2O((1+ 2 )r) · m1+ 2 · poly(log n)) queries with high probability, where ∆ is the difference between the maximum and the minimum edge sizes. This upper bound matches our ∆ lower bound of Ω(( m∆ )1+ 2 ) for this class of hypergraphs in terms of dependence on m. The 1+ 2 queries can also be made in O((1 + ∆) · min(2r (log m + r)2 , (log m + r)3 )) rounds. Keywords: query learning, hypergraph, multiple round algorithm, sampling, chemical reaction network

5 0.25581571 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

6 0.22942013 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

7 0.22022583 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

8 0.19104967 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

9 0.18664809 39 jmlr-2006-Inductive Synthesis of Functional Programs: An Explanation Based Generalization Approach     (Special Topic on Inductive Programming)

10 0.17345196 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

11 0.17168306 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

12 0.1691341 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

13 0.15875131 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

14 0.15834218 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

15 0.15703197 49 jmlr-2006-Learning Parts-Based Representations of Data

16 0.14896019 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

17 0.14667587 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

18 0.14065525 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

19 0.13847999 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations

20 0.13428402 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.013), (36, 0.127), (45, 0.02), (50, 0.055), (63, 0.04), (66, 0.422), (68, 0.011), (76, 0.032), (78, 0.017), (79, 0.01), (81, 0.028), (84, 0.019), (90, 0.04), (91, 0.026), (96, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75519294 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

2 0.35030109 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

3 0.34032011 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

Author: Magnus Ekdahl, Timo Koski

Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning

4 0.33160639 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification

5 0.32958078 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

Author: Janez Demšar

Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests

6 0.32847688 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving     (Special Topic on Inductive Programming)

7 0.32713801 53 jmlr-2006-Learning a Hidden Hypergraph

8 0.3263545 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.32544166 70 jmlr-2006-Online Passive-Aggressive Algorithms

10 0.32494831 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

11 0.32391369 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

12 0.32147878 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting     (Special Topic on Inductive Programming)

13 0.32098919 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

14 0.31780803 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

15 0.31765637 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

16 0.31701449 66 jmlr-2006-On Model Selection Consistency of Lasso

17 0.31659672 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

18 0.31640032 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

19 0.31634402 44 jmlr-2006-Large Scale Transductive SVMs

20 0.31571001 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation