jmlr jmlr2006 jmlr2006-81 knowledge-graph by maker-knowledge-mining

81 jmlr-2006-Some Discriminant-Based PAC Algorithms


Source: pdf

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. [sent-6, score-0.264]

2 For each class, find a discriminant function that maps elements of the input domain to real values. [sent-18, score-0.195]

3 These functions can be used to label any element x of the domain with the class label whose associated discriminant function takes the largest value on x. [sent-19, score-0.409]

4 The discriminant functions are usually estimates of the probability densities of points in each class, weighted by the class prior (relative frequency of that class), in which case we are using the Bayes classifier. [sent-20, score-0.26]

5 If it is possible to obtain good estimates of the probability distributions that generated the label classes, then (for reasons we explain below) these are often more useful than just an accurate classification rule. [sent-21, score-0.172]

6 By contrast, discriminant functions are constructed from individual label classes in isolation. [sent-30, score-0.287]

7 It seems clearly “easier” to find a good classifier by considering all data together (so as to apply empirical risk minimisation), than by insisting that each label class must be independently converted into a discriminant function. [sent-31, score-0.317]

8 In contrast with decision boundaries, we obtain for a domain element x, the values of the probability densities of label classes at x, which provide a conditional distribution over the class label of x. [sent-34, score-0.249]

9 We usually cannot assign a sensible label to such instances, however they may at least be recognised as a result of all class label distributions having very small weight around such an instance. [sent-43, score-0.259]

10 It is not hard to verify (see (Palmer and Goldberg, 2005)) that when we have good estimates of the class label distributions (in a sense described below in Section 1. [sent-62, score-0.167]

11 We say that a set C of functions from X to {0, 1} (the “concept class”) is PAC learnable if there exists an algorithm A that for any t ∈ C , with probability at least 1 − δ outputs h : X −→ {0, 1} having misclassification rate at most ε. [sent-73, score-0.245]

12 Thus 1 Prx∼D (x) = p Prx∼D (x) for t(x) = Prx∼D (x) = 0 for t(x) = 1 − For any probability distribution P over X (for example D or D ), an algorithm with access to P may in unit time draw an unlabeled sample from P. [sent-81, score-0.252]

13 (The two-button version conceals the class priors and only gives the algorithm access to the distribution as restricted to each class label. [sent-84, score-0.18]

14 For x ∈ X let h(x) = 1 if h(x) = 0 if h(x) undefined if f1 (x) > f0 (x) f1 (x) < f0 (x) f1 (x) = f0 (x) If A takes time polynomial in ε−1 and δ−1 , and h is PAC with respect to ε and δ, then we will say that A PAC-learns via discriminant functions. [sent-90, score-0.269]

15 It is clear that if A can be found such that the resulting h is PAC, then we have PAC learnability in the two-button setting, and hence standard PAC learnability. [sent-91, score-0.267]

16 This is with a view to getting strong positive results, and also to maximizing the potential for negative results (PAC learnable problems that are hard within the restricted setting). [sent-93, score-0.244]

17 One could devise less severe restrictions to capture the general idea of learning via discriminant functions. [sent-94, score-0.292]

18 We also consider the following slightly less severe variant related to POSEX learnability as introduced in (Denis, 1998), in which A has access to D (in (Denis, 1998) the “EX” oracle), in addition to D1 (in (Denis, 1998) the “POS” oracle). [sent-96, score-0.415]

19 This is formalized as Definition 2, and it turns out that we can be much more specific about learnability in this sense, namely it is intermediate between PAC learnability with uniform misclassification noise and basic PAC learnability. [sent-97, score-0.597]

20 For x ∈ X let h(x) = 1 if h(x) = 0 if h(x) undefined if f1 (x) > f0 (x) f1 (x) < f0 (x) f1 (x) = f0 (x) If A takes time polynomial in ε−1 and δ−1 , and h is PAC with respect to ε and δ, then we will say that A PAC-learns via discriminant functions with access to D. [sent-101, score-0.364]

21 286 S OME D ISCRIMINANT-BASED PAC A LGORITHMS POSEX learnability requires that f1 be a 0/1-valued function that constitutes a PAC hypothesis. [sent-102, score-0.267]

22 Let us conclude this section by considering alternative restrictions to PAC learnability that capture the informal notion of learning via discriminant functions. [sent-108, score-0.506]

23 , 1991) showed that various alternative definitions of PAC learnability are equivalent in this sense. [sent-119, score-0.267]

24 Examples of distinctions that have been found between different restrictions to the framework include learning with a restricted focus of attention (Ben-David and Dichterman, 1998, 1994) which is shown to be more severe that learnability in the presence of uniform misclassification noise. [sent-120, score-0.395]

25 Learnability with Statistical Queries (Kearns, 1998) is also known to be at least as severe as learnability with uniform misclassification noise. [sent-121, score-0.351]

26 Perhaps the most important result of this kind is the equivalence between PAC learnability and “weak” PAC learnability found by (Schapire, 1990) which led to the development of boosting techniques. [sent-122, score-0.534]

27 The paper (Blum, 1994) exhibits a concept class that distinguishes PAC learnability from mistake-bound learning, and that is of interest here since we use the same concept class (in Section 3. [sent-123, score-0.479]

28 3) to show that our restriction of PAC learnability is likewise distinct from mistake-bound learnability. [sent-124, score-0.312]

29 Specifically we have: Observation 1 If a concept class and its complement are POSEX learnable, then they are learnable under Definition 2. [sent-126, score-0.327]

30 In Section 2 we show that if a concept class is learnable in the presence of noise, then it and its complement are POSEX learnable, and hence by the above observation, learnable under Definition 2. [sent-137, score-0.537]

31 Rather, it is to construct a discriminant function in such a way that we expect it to work well in conjunction with the corresponding discriminant function constructed on data with the opposite class label. [sent-150, score-0.42]

32 We show that PAC learnability with uniform misclassification noise implies PAC learnability with discriminant functions and access to D. [sent-154, score-0.887]

33 2 we distinguish the setting from learnability from positive data only (or negative data only) by studying the class of unions of intervals on the real line. [sent-161, score-0.4]

34 Learning under this sort of assumption is in a sense intermediate between learning with access to D (Definition 2) and learning via discriminant functions in the sense of Definition 1. [sent-169, score-0.29]

35 Given the class prior p (the probability that a random instance has label ) we can generate random samples from a fixed distribution that is a mixture of D and D , such that they have uniform misclassification rate, which allows A η to be used. [sent-178, score-0.188]

36 1 Proposition 3 Let A η be an algorithm with parameter η, 0 ≤ η < 2 , that has access to labeled data, where elements of X are distributed according to D, with a uniform label noise rate of η. [sent-184, score-0.284]

37 Proof We may assume that the concept class C is closed under complementation, since if C is learnable with misclassification noise then its closure under complementation is also learnable under misclassification noise. [sent-192, score-0.595]

38 , N do: (a) Let cm be a “fair coin” random variable; cm = 0 or 1 with probability 1 ; let m be a 0/1 random variable, m = 1 with probability 2 p . [sent-201, score-0.427]

39 2p +1 (b) If cm = 1, sample x from D and let (x, m) ∈S . [sent-202, score-0.224]

40 (c) If cm = 0, sample x from D and let (x, 1) ∈ S . [sent-203, score-0.224]

41 Note that Pr( j = 0 | t(x) = 1) = Pr( j = 0 | t(x) = 1 ∧ cm = 0) Pr(cm = 0 | t(x) = 1) + Pr( j = 0 | t(x) = 1 ∧ cm = 1) Pr(cm = 1 | t(x) = 1). [sent-211, score-0.332]

42 Pr( j = 0 | t(x) = 1 ∧ cm = 0) = 0, since if cm = 0 then Step (1c) assigns label 1. [sent-212, score-0.45]

43 Hence Pr( j = 0 | t(x) = 1) = Pr( j = 0 | t(x) = 1 ∧ cm = 1) Pr(cm = 1 | t(x) = 1). [sent-213, score-0.166]

44 When cm = 1, we have j = m where m = 1 with probability p1 /(2p1 + 1), so Pr( j = 0 | t(x) = 1 ∧ cm = 1) = 1 − Pr(cm = 1 | t(x) = 1) = (2) p1 p1 + 1 = . [sent-214, score-0.367]

45 2p1 + 1 2p1 + 1 (3) 1 Pr(cm = 1) Pr(t(x) = 1 | cm = 1) 2 p1 = . [sent-215, score-0.166]

46 1 + p1 Now consider the case that t(x) = 0 (where (x, j) is the labeled exampled constructed on the m-th iteration of A1 ); note that Pr( j = 1 | t(x) = 0) = Pr( j = 1 | t(x) = 0 ∧ cm = 0) Pr(cm = 0 | t(x) = 0) + Pr( j = 1 | t(x) = 0 ∧ cm = 1) Pr(cm = 1 | t(x) = 0). [sent-218, score-0.366]

47 When cm = 1 we have j = m where m (5) = 1 with probability p1 /(2p1 + 1), so Pr( j = 1 | t(x) = 0 ∧ cm = 1) = p1 . [sent-222, score-0.367]

48 p p1 (c) If t(x) = 1, then if rm < ( 2p11+1 )( 1+p1 ) label x incorrectly else label x correctly. [sent-230, score-0.237]

49 +1 (d) If t(x) = 0, then if rm < p1 2p1 +1 label x incorrectly else label x correctly. [sent-231, score-0.237]

50 (c) If rm < p1 2p1 +1 label x incorrectly else label x correctly. [sent-239, score-0.237]

51 For ∈ {0, 1}, the Algorithm of Figure 2 given access to D and D, with probability at least 1 − δ outputs (in polynomial time) f with Prx∼D ( f (x) = t(x)) ≤ ε for = 1, and Prx∼D ( f (x) = 1 − t(x)) ≤ ε for = 0. [sent-265, score-0.179]

52 As a consequence we have learnability in the sense of Definition 2, since when we derive classifier h from f1 and f0 , the error of h is at most the sum of the errors of f1 and f0 . [sent-267, score-0.267]

53 Learning via Discriminant Functions without Access to D We exhibit algorithms that show that learnability in the sense of Definition 1 is distinct from various well-known restrictions of PAC learnability. [sent-313, score-0.311]

54 1 Parity Functions The following result distinguishes our learning setting from learnability with uniform misclassification noise, or learnability with a restricted focus of attention. [sent-323, score-0.591]

55 294 S OME D ISCRIMINANT-BASED PAC A LGORITHMS Theorem 5 The class of parity functions is PAC learnable via discriminant functions. [sent-327, score-0.575]

56 We show that the class of all such functions, is learnable by discriminant functions in time polynomial in ε−1 , δ−1 and k. [sent-343, score-0.484]

57 Learnability via discriminant functions is thus distinct from learnability from positive examples only, or from negative examples only. [sent-345, score-0.496]

58 Theorem 6 The class of unions of k intervals on the real line is learnable via discriminant functions. [sent-346, score-0.504]

59 Proof Construct discriminant functions f0 and f1 as follows. [sent-347, score-0.195]

60 Given an (unlabeled) sample, and a point x ∈ IR, our discriminant function maps x to the negation of its distance to its nearest neighbor in the sample. [sent-348, score-0.22]

61 3 Distinguishing the Model from the Mistake-bound Setting In (Blum, 1994), Blum exhibits a concept class that is PAC learnable, but is not (in polynomial time) learnable using membership and equivalence queries, assuming that one-way functions exist. [sent-373, score-0.352]

62 In this section we show that the concept class is PAC learnable via discriminant functions in the sense of Definition 1. [sent-374, score-0.498]

63 For a bit string x let LSB[x] denote the rightmost bit of x. [sent-384, score-0.225]

64 For bit string s of length k, G(s) is a bit string of length 2k, and we define the following notation. [sent-386, score-0.238]

65 i • cs is the indicator function of {xs : i ∈ {0, 1}k and LSB[Gi1 ···ik (s)] = 1} where for i = i1 · · · ik , i • xs = i ◦ Gi1 (s) ◦ Gi2 (Gi1 (s)) ◦ Gi3 (Gi1 i2 (s)) ◦ . [sent-398, score-0.275]

66 √ i We use k = n − 1 so that the length of xs is always less that n, and we can then “pad it out” to a length of exactly n using the string 0w on the right-hand side. [sent-403, score-0.283]

67 i Note that for any fixed s, a bit string of length n of the form xs is determined entirely by i, its first k bits. [sent-404, score-0.34]

68 We will let the index of a bit string of length n refer to its first k bits, viewed as a binary number (to give the natural ordering on indices). [sent-405, score-0.173]

69 ) Algorithm Compute-Forward (Figure 3) shows how j j i to take any positive example xs , together with an index j > i, and construct the pair xs , cs (xs ) in polynomial time. [sent-410, score-0.545]

70 Let zi be the correctly labeled example xs , cs (xs ) . [sent-412, score-0.28]

71 Let zi be the incorrectly labeled example xs , 1 − cs (xs ) . [sent-414, score-0.305]

72 From (Blum, 1994) we know that C is not learnable (in time polynomial in n) in the mistakebound model. [sent-424, score-0.259]

73 i We noted that Algorithm Compute-Forward, given a positive example xs and j > i, produces a j j correctly-labeled example xs , cs (xs ) . [sent-426, score-0.467]

74 Theorem 9 The concept class of (Blum, 1994) is learnable via discriminant functions. [sent-428, score-0.498]

75 Proof We use the algorithm of Figure 4 to construct discriminant functions. [sent-429, score-0.195]

76 Specifically, A1 and A0 memorize 297 G OLDBERG Algorithm Compute-Forward (Blum, 1994) i On input xs and j > i, 1. [sent-434, score-0.221]

77 Extract from xs the portions: u = Gi1 (s) ◦ Gi2 (Gi1 (s)) ◦ Gi3 (Gi1 i2 (s)) ◦ . [sent-439, score-0.221]

78 M m In particular, A1 (and possibly also A0 ) just retains xs and xs , since for any unlabeled x, the m and xM . [sent-451, score-0.531]

79 (In the case of A , the label assigned to it is computed (in polynomial time) using xs 0 s sample S0 may fail the test Consistency-Check, in which case no examples are memorized. [sent-452, score-0.395]

80 j Suppose for a contradiction that A0 gives a value of 1 to positive example xs . [sent-457, score-0.221]

81 Then A0 must have j M M in its collection an unlabeled example xs and xs must predict xs as being positive. [sent-458, score-0.752]

82 that xs 0 1 A0 ensures that f0 (x) ≥ 2 for all x ∈ S0 . [sent-460, score-0.221]

83 Call m M these elements xs and xs respectively; since Consistency-Check has been passed, they are unique. [sent-468, score-0.442]

84 Otherwise, if xs , 1 =Compute-Forward(xs , j) and furthermore, j j M xs , 1 =Compute-Forward(xs , M) then let f (xs ) = 1. [sent-473, score-0.467]

85 Figure 4: Assigning values to unlabeled data for concept class of (Blum, 1994) 3. [sent-480, score-0.182]

86 Theorem 10 Linear separators in the plane are learnable via discriminant functions. [sent-488, score-0.494]

87 Figure 5: Assigning values to unlabeled data for linear separators in the plane Proof Figure 5 shows the algorithm we use to construct discriminant functions; it is not hard to check that the steps can be carried out in polynomial time. [sent-505, score-0.422]

88 Recall that a monomial is a boolean function consisting of the conjunction of a set of literals (where a literal is either a boolean attribute or its negation). [sent-546, score-0.21]

89 Despite the simplicity of this class of functions, we have not resolved its learnability under the restriction of Definition 1, even for monotone (negation-free) monomials. [sent-547, score-0.342]

90 In view of the importance of the concept class of monomials, we consider whether they are learnable given that the input distribution D belongs to a given class of probability distributions. [sent-553, score-0.368]

91 This situation is intermediate between knowing D exactly (in which case by Theorem 4 the problem would be solved since monomials are learnable in the presence of uniform misclassification noise (Angluin and Laird, 1988)) and the distribution-independent setting. [sent-554, score-0.346]

92 j=1 j Figure 7: Algorithm for learning monomials Theorem 11 Monomials over the boolean domain are learnable via discriminant functions, provided that the input distribution D is known to be a product distribution. [sent-561, score-0.546]

93 Proof We show that the algorithm given in Figure 7 constructs discriminant functions which, when combined to get h according to Definition 1, ensure that h is PAC. [sent-566, score-0.195]

94 j 303 G OLDBERG For j ∈ I , ψ1 (x) = ψ0 (x) (for a product distribution D, the value of an irrelevant attribute is selected independently of the label class of a bit string). [sent-608, score-0.227]

95 Despite those limitations, our positive results have distinguished learnability subject to this constraint from various other constraints on PAC learnability that have been studied in the past. [sent-622, score-0.534]

96 Clearly, the main open question raised by this paper is to elucidate the relationship between learnability via discriminant functions (Definition 1), and basic PAC learnability. [sent-623, score-0.462]

97 We have a relatively good understanding of learnability subject to the slightly less severe constraint of Definition 2. [sent-625, score-0.32]

98 Namely, it is intermediate between learnability with uniform misclassification noise, and standard PAC learnability. [sent-626, score-0.298]

99 1) how to learn parity functions using the more severe constraint of Definition 1. [sent-628, score-0.193]

100 Pac classification based on pac estimates of label class distributions. [sent-783, score-0.565]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('prx', 0.476), ('pac', 0.443), ('learnability', 0.267), ('xs', 0.221), ('learnable', 0.21), ('pr', 0.208), ('discriminant', 0.195), ('cm', 0.166), ('oldberg', 0.15), ('parity', 0.14), ('ome', 0.117), ('blum', 0.114), ('posex', 0.1), ('misclassi', 0.099), ('access', 0.095), ('jr', 0.095), ('label', 0.092), ('unlabeled', 0.089), ('lgorithms', 0.075), ('monomials', 0.073), ('boolean', 0.068), ('ir', 0.065), ('blumer', 0.064), ('concept', 0.063), ('string', 0.062), ('bit', 0.057), ('separators', 0.053), ('severe', 0.053), ('ordinary', 0.053), ('ap', 0.052), ('complementation', 0.05), ('dnn', 0.05), ('lsb', 0.05), ('kearns', 0.05), ('hull', 0.05), ('polynomial', 0.049), ('attribute', 0.048), ('ch', 0.046), ('distributions', 0.045), ('restriction', 0.045), ('restrictions', 0.044), ('polygon', 0.043), ('denis', 0.043), ('gf', 0.043), ('unions', 0.038), ('cryan', 0.038), ('csb', 0.038), ('gid', 0.038), ('letouzey', 0.038), ('warwick', 0.038), ('hypothesis', 0.036), ('haussler', 0.036), ('plane', 0.036), ('probability', 0.035), ('negative', 0.034), ('labeled', 0.034), ('sample', 0.033), ('unde', 0.032), ('intersects', 0.032), ('strings', 0.032), ('noise', 0.032), ('goldberg', 0.032), ('irn', 0.032), ('intervals', 0.031), ('uniform', 0.031), ('classi', 0.03), ('class', 0.03), ('nition', 0.03), ('index', 0.029), ('ik', 0.029), ('mansour', 0.029), ('generator', 0.029), ('allwein', 0.029), ('rm', 0.028), ('suppose', 0.028), ('bits', 0.027), ('union', 0.027), ('boundary', 0.026), ('monomial', 0.026), ('occam', 0.026), ('distinguishes', 0.026), ('id', 0.026), ('assigns', 0.026), ('eliminated', 0.026), ('cs', 0.025), ('frieze', 0.025), ('gir', 0.025), ('guruswami', 0.025), ('negation', 0.025), ('palmer', 0.025), ('let', 0.025), ('priors', 0.025), ('oracle', 0.025), ('sn', 0.025), ('incorrectly', 0.025), ('multiclass', 0.025), ('da', 0.024), ('rightmost', 0.024), ('complement', 0.024), ('regions', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

2 0.2084886 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

3 0.10617184 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

4 0.076871224 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

5 0.068949819 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

6 0.05755328 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

7 0.057135481 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

8 0.046377562 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

9 0.043942444 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

10 0.043587558 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

11 0.041578505 12 jmlr-2006-Active Learning with Feedback on Features and Instances

12 0.039834451 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

13 0.039609041 44 jmlr-2006-Large Scale Transductive SVMs

14 0.038844042 45 jmlr-2006-Learning Coordinate Covariances via Gradients

15 0.037037376 53 jmlr-2006-Learning a Hidden Hypergraph

16 0.032568384 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

17 0.031516086 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

18 0.031071465 25 jmlr-2006-Distance Patterns in Structural Similarity

19 0.030942826 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

20 0.0295191 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.199), (1, 0.003), (2, -0.048), (3, -0.117), (4, -0.071), (5, 0.056), (6, 0.102), (7, -0.054), (8, -0.228), (9, 0.043), (10, -0.081), (11, 0.077), (12, 0.231), (13, 0.039), (14, -0.53), (15, 0.015), (16, -0.109), (17, 0.165), (18, 0.001), (19, -0.019), (20, 0.084), (21, -0.073), (22, -0.12), (23, -0.024), (24, 0.166), (25, 0.087), (26, -0.006), (27, 0.105), (28, -0.015), (29, 0.065), (30, -0.028), (31, -0.023), (32, 0.032), (33, 0.027), (34, 0.101), (35, 0.008), (36, -0.082), (37, 0.07), (38, -0.013), (39, -0.022), (40, -0.048), (41, -0.02), (42, -0.053), (43, -0.022), (44, -0.078), (45, -0.087), (46, 0.017), (47, 0.095), (48, 0.007), (49, -0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95327353 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

2 0.78238469 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

Author: Adam R. Klivans, Rocco A. Servedio

Abstract: We consider two well-studied problems regarding attribute efficient learning: learning decision lists and learning parity functions. First, we give an algorithm for learning decision lists of length ˜ 1/3 ˜ 1/3 k over n variables using 2O(k ) log n examples and time nO(k ) . This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on k out of n variables using O(n1−1/k ) examples in poly(n) time. For k = o(log n) this yields a polynomial time algorithm with sample complexity o(n); this is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. We also give a simple algorithm for learning an unknown length-k parity using O(k log n) examples in nk/2 time, which improves on the naive nk time bound of exhaustive search. Keywords: PAC learning, attribute efficiency, learning parity, decision lists, Winnow

3 0.35222793 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

4 0.30324239 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: that is, the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious benefit of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random field, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

5 0.21217205 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.

6 0.17956465 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

7 0.16127452 75 jmlr-2006-Policy Gradient in Continuous Time

8 0.1596323 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

9 0.15599194 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

10 0.15098658 44 jmlr-2006-Large Scale Transductive SVMs

11 0.14975199 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

12 0.14802304 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

13 0.14375305 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

14 0.14309673 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

15 0.1430846 48 jmlr-2006-Learning Minimum Volume Sets

16 0.1427834 53 jmlr-2006-Learning a Hidden Hypergraph

17 0.1390966 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

18 0.13896616 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

19 0.13631974 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

20 0.1360185 12 jmlr-2006-Active Learning with Feedback on Features and Instances


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.018), (35, 0.01), (36, 0.596), (45, 0.014), (50, 0.044), (63, 0.038), (66, 0.019), (68, 0.012), (76, 0.013), (78, 0.012), (81, 0.028), (84, 0.011), (90, 0.027), (91, 0.022), (96, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98997921 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving     (Special Topic on Inductive Programming)

Author: Pat Langley, Dongkyu Choi

Abstract: In this paper, we propose a new representation for physical control – teleoreactive logic programs – along with an interpreter that uses them to achieve goals. In addition, we present a new learning method that acquires recursive forms of these structures from traces of successful problem solving. We report experiments in three different domains that demonstrate the generality of this approach. In closing, we review related work on learning complex skills and discuss directions for future research on this topic. Keywords: teleoreactive control, logic programs, problem solving, skill learning

2 0.98963428 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

Author: Janez Demšar

Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests

same-paper 3 0.9731006 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

4 0.67089283 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

Author: Magnus Ekdahl, Timo Koski

Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning

5 0.64637411 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

Author: Ting Liu, Andrew W. Moore, Alexander Gray

Abstract: This paper is about non-approximate acceleration of high-dimensional nonparametric operations such as k nearest neighbor classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the data points close to the query, but merely need to answer questions about the properties of that set of data points. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. This is applicable to many algorithms in nonparametric statistics, memory-based learning and kernel-based learning. But for clarity, this paper concentrates on pure k-NN classification. We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-fold to 100-fold compared against highly optimized traditional ball-tree-based k-NN . These results include data sets with up to 106 dimensions and 105 records, and demonstrate non-trivial speed-ups while giving exact answers. keywords: ball-tree, k-NN classification

6 0.63392973 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

7 0.63279969 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

8 0.62689799 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting     (Special Topic on Inductive Programming)

9 0.62675285 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

10 0.6165809 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

11 0.6118201 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming     (Special Topic on Machine Learning and Optimization)

12 0.60943884 47 jmlr-2006-Learning Image Components for Object Recognition

13 0.60781336 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

14 0.60553259 53 jmlr-2006-Learning a Hidden Hypergraph

15 0.59839654 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

16 0.5939039 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

17 0.5900113 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

18 0.57557946 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

19 0.57353961 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

20 0.57251507 49 jmlr-2006-Learning Parts-Based Representations of Data