jmlr jmlr2006 jmlr2006-73 knowledge-graph by maker-knowledge-mining

73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Source: pdf

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classiﬁcation), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically deﬁned labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We ﬁnd a broad class of learning algorithms for which estimations of the probability of the classiﬁcation error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). [sent-6, score-0.139]

2 Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. [sent-10, score-0.18]

3 A predictor is learning to classify the objects, based only on examples of labelled objects. [sent-21, score-0.182]

4 The task is to construct the best predictor for the labels, based on the data observed, i. [sent-32, score-0.167]

5 A predictor is constructed based on the ﬁrst set and then is used to classify the objects from the second. [sent-37, score-0.2]

6 In the online setting a predictor starts by classifying the ﬁrst object with zero knowledge; then it is 1. [sent-38, score-0.184]

7 in (Vapnik, 1998)) a more general situation is considered, the labels are drawn according to some probability distribution P(y|x), i. [sent-41, score-0.081]

8 The labels y ∈ Y are drawn according to some unknown (but ﬁxed) distribution over the set of all inﬁnite sequences of labels. [sent-78, score-0.081]

9 For each label y the corresponding object x ∈ X is generated according to some (unknown but ﬁxed) probability distribution P(x|y). [sent-81, score-0.102]

10 model is that in the conditional model we made the distribution of labels primal; having done that we can relax the requirement of independence of objects to the conditional independence. [sent-89, score-0.191]

11 Moreover, we provide a tool for obtaining estimations of probability of error of a predictor in the conditional model from estimations of the probability of error in the i. [sent-103, score-0.264]

12 The general theorems about extending results concerning performance of a predictor to the conditional model are illustrated on two classes of predictors. [sent-107, score-0.188]

13 First, we extend weak consistency results concerning partitioning and nearest neighbour estimates from the i. [sent-108, score-0.214]

14 646 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA Second, we use some results of Vapnik-Chervonenkis theory to estimate performance in the conditional model (on ﬁnite amount of data) of predictors minimizing empirical risk, and also obtain some strong consistency results. [sent-112, score-0.12]

15 The only assumption on a predictor under which it works in the conditional model as well as in the i. [sent-114, score-0.188]

16 This assumption on a predictor should be valid in the i. [sent-119, score-0.149]

17 In other words, we expect the distribution of symptoms to be determined only by the fact that the patient is ill or that (s)he is not. [sent-144, score-0.088]

18 Note however, that there are certain types of dependencies between sets of symptoms which can violate our condition, for example, if a family comes for diagnostics together; yet the conditional i. [sent-145, score-0.1]

19 Thus, pattern recognition methods are used in conjunction with sequence prediction algorithms, and here our results can be considered a further theoretical justiﬁcation of the use of the pattern recognition component. [sent-157, score-0.094]

20 There are also several approaches in which different types of assumptions on the joint distribution of objects and labels are made; then the authors construct a predictor or a class of predictors, to work well under the assumptions made. [sent-166, score-0.262]

21 ; each example zi := (xi , yi ) consists of an object xi ∈ X and a label yi := η(xi ) ∈ Y, where X is a measurable space called an object space, Y := {0, 1} is called a label space and η : X → Y is some deterministic function. [sent-186, score-0.209]

22 Note that the ﬁrst condition means that objects are conditionally independent given labels (on conditional independence see Dawid, 1979). [sent-216, score-0.184]

23 Under the conditions (1) and (2) we say that objects are conditionally independent and identically distributed (conditionally i. [sent-217, score-0.101]

24 Assume that we have some individual sequence (yn )n∈N of labels and two probability distributions P0 and P1 on X, such that there exists sets X0 and X1 in X such that P1 (X1 ) = P0 (X0 ) = 1 and P0 (X1 ) = P1 (X0 ) = 0 (i. [sent-226, score-0.086]

25 Each example xn ∈ X is drawn according to the distribution Pyn ; examples are drawn independently of each other. [sent-229, score-0.142]

26 A predictor is a measurable function Γn := Γ(x1 , y1 , . [sent-230, score-0.186]

27 , xn , yn , xn+1 ) taking values in Y (more formally, a family of functions indexed by n). [sent-233, score-0.185]

28 The probability of error of a predictor Γ on each step n is deﬁned as errn (Γ, P, z1 , . [sent-234, score-0.719]

29 , zn , x) (zi , 1 ≤ i ≤ n are ﬁxed and the probability is taken over zn+1 ). [sent-240, score-0.216]

30 We will sometimes omit some of the arguments of errn when it can cause no confusion; in particular, we will often use a short notation P(errn (Γ, Z1 , . [sent-241, score-0.551]

31 For a pair of distributions P0 and P1 and any δ ∈ (0, 1/2) deﬁne ▽δ (P0 , P1 , n, ε) := sup ∞ Pp (errn (Γ) > ε), (3) p∈[δ,1−δ] that is, we consider the supremum of the probability of error over all class label probabilities. [sent-251, score-0.127]

32 For a predictor Γ and a distribution P on X deﬁne ∆(P, n, z1 , . [sent-252, score-0.167]

33 (5) p∈[δ,1−δ] Tolerance to data means, in effect, that in any typical large portion of data there is no small portion that changes strongly the probability of error. [sent-273, score-0.083]

34 sup p∈[δ,1−δ] ¯ The same notational convention will be applied to ∆ and ∆ as to errn . [sent-301, score-0.606]

35 Naturally, such notions arise when there is a need to study the behaviour of a predictor when some of the training examples are removed. [sent-306, score-0.149]

36 A predictor developed to work in the off-line setting should be, loosely speaking, tolerant to small changes in a training sample. [sent-308, score-0.149]

37 The next theorem shows under which conditions this property of a predictor can be utilized. [sent-309, score-0.149]

38 For any predictor Γ and any ε > 0 we have −1 P(errn (Γ) > ε) ≤ Cn αn ▽δ (P0 , P1 , n + κn , δε/2) + ∆δ (P0 , P1 , n + κn , δε/2) + (1 −Cn ), (6) and −1 ¯ P(errn (Γ) > ε) ≤ Cn αn ▽δ (P0 , P1 , n, δε/2) + ∆δ (P0 , P1 , n, δε/2) + (1 −Cn ). [sent-318, score-0.149]

39 (7) The theorem says that if we know with some conﬁdence Cn that the rate of occurrence of each label is not less than some (small) δ, and have some bounds on the error rate and tolerance to data of a predictor in the i. [sent-319, score-0.266]

40 First we ﬁx some individual sample of n labels (without objects) and consider the behaviour of the predictor conditional on this sample. [sent-325, score-0.232]

41 Fixing the labels allows us to pass from the conditional i. [sent-326, score-0.083]

42 Then, using tolerance 650 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA to data, we compare the behaviour of the predictor on any two different, but typical for a certain i. [sent-333, score-0.205]

43 Thus we have a tool for estimating the performance of a predictor on each ﬁnite step n. [sent-341, score-0.149]

44 We say that the rates of occurrence of labels are bounded from below if there exist such δ, 0 < δ < 1/2 that lim P(p(n) ∈ [δ, 1 − δ]) = 1. [sent-346, score-0.095]

45 Let Γ be such a predictor that lim ▽δ (P0 , P1 , n, ε) = 0 (9) n→∞ and either lim ∆δ (P0 , P1 , n, ε) = 0 (10) ¯ lim ∆δ (P0 , P1 , n, ε) = 0 (11) n→∞ or n→∞ for any ε > 0. [sent-349, score-0.209]

46 In Section 3 we show how this statement can be applied to prove weak consistence of some classical nonparametric predictors in the conditional model. [sent-354, score-0.123]

47 Application to Classical Nonparametric Predictors In this section we will consider two types of classical nonparametric predictors: partitioning and nearest neighbour classiﬁers. [sent-356, score-0.217]

48 The nearest neighbour predictor assigns to a new object xn+1 the label of its nearest neighbour among x1 , . [sent-357, score-0.518]

49 , xn , yn , xn+1 ) := y j , where j := argmini=1,. [sent-363, score-0.185]

50 Theorem 3 Let Γ be the nearest neighbour classiﬁer. [sent-374, score-0.152]

51 A partitioning predictor on each step n partitions the object space X = Rd , d ∈ N into disjoint cells An , An , . [sent-378, score-0.247]

52 , zn , x) := 0 if ∑n Iyi =1 Ixi ∈A(x) ≤ ∑n Iyi =0 Ixi ∈A(x) i=1 i=1 1 otherwise, where A(x) denotes the cell containing x. [sent-384, score-0.215]

53 (Devroye, Gy¨ rﬁ, Lugosi, 1996)) that a partitioning predictor o is weakly consistent, provided certain regulatory conditions on the size of cells. [sent-388, score-0.188]

54 More precisely, let Γ be a partitioning predictor such that diam(A(X)) → 0 in probability and N(X) → ∞ in probability. [sent-389, score-0.207]

55 We generalise this result to the case of conditionally i. [sent-391, score-0.089]

56 Theorem 4 Let Γ be a partitioning predictor such that diam(A(X)) → 0 in probability and N(X) → ∞ in probability, for any distribution generating i. [sent-395, score-0.225]

57 Observe that we only generalise results concerning weak consistency of (one) nearest neighbour and non-data-dependent partitioning rules. [sent-400, score-0.253]

58 However, we do not aim to generalise state-of-the-art results in nonparametric classiﬁcation, but rather to illustrate that weak consistency results can be extended to the conditional model. [sent-406, score-0.127]

59 Application to Empirical Risk Minimisation In this section we show how to estimate the performance of a predictor minimising empirical risk (over certain class of functions) using Theorem 1. [sent-408, score-0.174]

60 652 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA In the theory of empirical risk minimisation this function is approximated by the function ϕ∗ := arg min errn (ϕ) n ϕ∈C where errn (ϕ) := ∑n Iϕ(Xi )=Yi is the empirical error functional, based on a sample (Xi ,Yi ), i = i=1 1, . [sent-417, score-1.119]

61 , zn , xn+1 ) := ϕ∗ (xn+1 ) is a predictor minimising empirical risk over the n class of functions C . [sent-424, score-0.371]

62 Theorem 5 Let C be a class of decision functions and let Γ be a predictor which for each n ∈ N minimises errn over C on the observed examples (z1 , . [sent-442, score-0.733]

63 (15) Thus, if we have bounds on the VC dimension of some class of classiﬁers, we can obtain bounds on the performance of predictors minimising empirical error for the conditional model. [sent-451, score-0.122]

64 where Γ is a predictor which in each trial n minimises empirical risk over C kn and P is any distribution satisfying (1), (2) and ∑∞ (1 −Cn ) < ∞. [sent-459, score-0.218]

65 Naturally, a question arises whether our conditions on the distributions and on predictors are necessary, or they can be yet more generalised in the same direction. [sent-473, score-0.098]

66 Cn = 1 for any δ ∈ (0, 1/2) and n > (1/2−δ) ) and a predictor Γ such that n Pp (errn > 0) ≤ 21−n for any p ∈ [δ, 1 − δ] and P(errn = 1) = 1 for n > 1. [sent-479, score-0.149]

67 1 0 We deﬁne the predictor Γ as follows Γn := 1 − xn if |#{i < n : yi = 0} − n/2| ≤ 1, xn otherwise. [sent-493, score-0.418]

68 In particular, the assumption (8) might appear redundant: if the rate of occurrence of some label tends to zero, can we just ignore this label without affecting the asymptotic? [sent-500, score-0.091]

69 Remark 9 There exist a distribution P on X∞ which satisﬁes (1) and (2) but for which the nearest neighbour predictor is not consistent, i. [sent-502, score-0.319]

70 ) The nearest neighbour predictor is consistent for any i. [sent-509, score-0.325]

71 that with probability at least ε we have max i ε) (16) ∆δ (P0 , P1 , n, ε) := sup ∆(Pq , n, ε), (17) q and q where the supremums are with respect to all distributions q such that mina∈Y q(a) ≥ δ. [sent-524, score-0.097]

72 lim {n|pn − p| ≤ κn } = 0 n→∞ almost surely for any p ∈ (0, 1) and any probability distribution P on X such that P(y = 1) = p, where pn := 1 #{i ≤ n : Yi = 0}. [sent-530, score-0.144]

73 , zn , Xn+1 )|Yn+1 = 1), n (with the same notational convention as used with the deﬁnition of errn (Γ)). [sent-549, score-0.748]

74 In words, for each y ∈ Y = {0, 1} we deﬁne erry as the probability of all x ∈ X, such that Γ makes an error on n’th trial, n given that Yn+1 = y and ﬁxed z1 , . [sent-550, score-0.255]

75 , yn ) and pn (y) := n #{i ≤ n : yi = 0}, for n > 1. [sent-560, score-0.169]

76 , xn , yn ) > ε ; n that is, having ﬁxed the labels, we consider probability over objects only. [sent-590, score-0.255]

77 Fix some n > 1, some y ∈ Y and such y1 ∈ Y∞ that δ ≤ pn (y1 ) ≤ 1 − δ and P((Y1 , . [sent-592, score-0.087]

78 , Xn , y1 , Xn+1 ) n 1 n > ε = n Pp erry (Γ, X1 , y1 , . [sent-607, score-0.236]

79 Clearly, for δ ≤ p ≤ 1 − δ we have errn (Γ, Pp ) ≤ maxy∈Y (erry (Γ, Pp )), and if errn (Γ, Pp ) < ε then n erry (Γ, Pp ) < ε/δ for each y ∈ Y. [sent-614, score-1.338]

80 , Xn , y1 ) > ε n 1 n m ′ ′ ′ ′ = Pp erry (X1 , y1 , . [sent-633, score-0.236]

81 , Xm , y2 ) > ε n 1 n n 1 m n 1 m m ≤ Pp ′ ′ erry (X1 , y2 , . [sent-642, score-0.236]

82 , Xn , y1 ) > ε/2 n 1 n n 1 n n ′ ′ + Pp erry (X1 , y2 , . [sent-648, score-0.236]

83 Thus the ﬁrst term is bounded by m Pp max | erry (Γ, Z1 , . [sent-659, score-0.236]

84 , Xn , y1 ) > ε n 1 n n+κ ≤ Pp n ′ ′ erry (X1 , y2 , . [sent-691, score-0.236]

85 , Xn , y1 ) > ε/2 n n 1 n 1 n n ′ ′ + Pp erry (X1 , y2 , . [sent-697, score-0.236]

86 (19) Finally, as y1 was chosen arbitrarily among sequences y ∈ Y∞ such that nδ ≤ pn (y1 ) ≤ n(1 − δ) from (18) and (19), we obtain (6) and (7). [sent-727, score-0.106]

87 , Zn )) → 0 for nearest neighbour and partitioning predictor, and apply Corollary 2. [sent-736, score-0.191]

88 , zn from the notation) errn (Γ, Pp ) ≤ err0 (Γ, Pp ) + err1 (Γ, Pp ) and n n ¯ ¯ ¯ ∆(Pp , n) ≤ ∆0 (Pp , n) + ∆1 (Pp , n). [sent-793, score-0.748]

89 So, E ∞ (err1 (Γ, Pp )) ≤ E ∞ (err1 (Γ, Pδ/2 |pn = δ/2)) + Pp (pn ≤ δ/2), n n (22) where as usual pn := 1 #{i ≤ n : yi = 1}. [sent-797, score-0.108]

90 Moreover, n it is easy to see that E ∞ (err1 (Γ, Pδ/2 )|pn = n(δ/2)) n ¯ ≤ E ∞ err1 (Γ, Pδ/2 ) |n(δ/2) − pn | ≤ κn /2 + E ∞ (∆1 (Pδ/2 , n)) n 1 ¯ √ E ∞ (err1 (Γ, Pδ/2 )) + E ∞ (∆1 (Pδ/2 , n)). [sent-799, score-0.087]

91 Again, we observe that D will not decrease if an arbitrary (possibly large) portion of training examples labelled with ones is replaced with an arbitrary (but consistent with η) portion of the same size of examples labelled with zeros. [sent-830, score-0.154]

92 , zn ) with κn in the deﬁnition replaced by 2 κn . [sent-837, score-0.197]

93 , xn implicit) n+1 Bn (x) := Pp {t ∈ X : t and x have the same nearest neighbour among x1 , . [sent-875, score-0.276]

94 , xn } and Bn := E(Bn (X)) Note that E ∞ (Bn ) = 1/n, where the expectation is taken over X1 , . [sent-878, score-0.124]

95 , xn ) for which a point x0 is the nearest neighbour is not greater than a constant γ which depends only the space X (see Devroye, Gy¨ rﬁ, o Lugosi, 1996, Corollary 11. [sent-924, score-0.276]

96 For any measurable sets B ⊂ Xn and A ⊂ X deﬁne D(B , A ) := E ∞ sup j≤κn ; π:{1,. [sent-940, score-0.092]

97 , xn ) be the union of all cells An for which E(|ηn (X) − η(X)||X ∈ An ) ≤ ε. [sent-990, score-0.148]

98 Clearly, |errn (ϕ× ) − errn (ϕ∗ )| ≤ κn , as κn is the maximal number of errors which can be made on the difference of the two samples. [sent-1028, score-0.551]

100 So far we have proven (12) and (13); (14) and (15) can be proven analogously, only for the case η ∈ C we have 1 Pn (sup | errn (ϕ) − err(ϕ)| > ε) ≤ S (C , n)e−nε ϕ∈C n instead of (27), and err(ϕPp , Pp ) = 0. [sent-1034, score-0.551]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pp', 0.581), ('errn', 0.551), ('erry', 0.236), ('err', 0.207), ('zn', 0.197), ('ryabko', 0.167), ('predictor', 0.149), ('xn', 0.124), ('neighbour', 0.096), ('ndependent', 0.089), ('onditionally', 0.089), ('pn', 0.087), ('ecognition', 0.075), ('cn', 0.069), ('kulkarni', 0.067), ('lugosi', 0.066), ('yn', 0.061), ('devroye', 0.059), ('predictors', 0.058), ('tolerance', 0.056), ('nearest', 0.056), ('sup', 0.055), ('objects', 0.051), ('nobel', 0.05), ('conditionally', 0.05), ('fix', 0.049), ('morvai', 0.049), ('gy', 0.047), ('labels', 0.044), ('bn', 0.042), ('algoet', 0.039), ('generalise', 0.039), ('partitioning', 0.039), ('conditional', 0.039), ('mp', 0.038), ('measurable', 0.037), ('symptoms', 0.037), ('object', 0.035), ('ill', 0.033), ('minimises', 0.033), ('vazirani', 0.033), ('labelled', 0.033), ('portion', 0.032), ('occurrence', 0.031), ('label', 0.03), ('daniil', 0.03), ('erryn', 0.03), ('ixi', 0.03), ('posner', 0.03), ('py', 0.03), ('nonparametric', 0.026), ('na', 0.026), ('diam', 0.025), ('minimising', 0.025), ('blumer', 0.025), ('consistent', 0.024), ('pattern', 0.024), ('cells', 0.024), ('np', 0.024), ('dependencies', 0.024), ('distributions', 0.023), ('consistency', 0.023), ('recognition', 0.023), ('ergodic', 0.022), ('letters', 0.022), ('vapnik', 0.021), ('yi', 0.021), ('text', 0.021), ('vc', 0.02), ('lim', 0.02), ('aldous', 0.02), ('errm', 0.02), ('gamarnik', 0.02), ('ghost', 0.02), ('ncn', 0.02), ('npn', 0.02), ('yakowitz', 0.02), ('kearns', 0.02), ('estimations', 0.019), ('valiant', 0.019), ('probability', 0.019), ('xm', 0.019), ('sequences', 0.019), ('markov', 0.019), ('distribution', 0.018), ('weaker', 0.018), ('cell', 0.018), ('kn', 0.018), ('task', 0.018), ('haussler', 0.017), ('mpm', 0.017), ('mina', 0.017), ('overviews', 0.017), ('iyi', 0.017), ('zeger', 0.017), ('generalised', 0.017), ('irrational', 0.017), ('minimisation', 0.017), ('reserved', 0.017), ('universal', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

2 0.23247567 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efﬁcient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a ﬁnal step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.

3 0.078990154 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

4 0.077830993 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

5 0.052675381 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efﬁciency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

6 0.043631442 66 jmlr-2006-On Model Selection Consistency of Lasso

7 0.039831974 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

8 0.038032282 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

9 0.033399034 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

10 0.031822473 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

11 0.031794928 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

12 0.0295191 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

13 0.028958317 48 jmlr-2006-Learning Minimum Volume Sets

14 0.027074656 83 jmlr-2006-Sparse Boosting

15 0.025751982 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

16 0.024505718 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.023749452 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

18 0.022316545 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

19 0.021357629 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

20 0.020579724 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.132), (1, -0.003), (2, -0.077), (3, -0.164), (4, -0.077), (5, 0.146), (6, -0.168), (7, 0.041), (8, -0.087), (9, 0.462), (10, 0.058), (11, -0.015), (12, 0.137), (13, -0.283), (14, 0.028), (15, -0.06), (16, 0.008), (17, -0.22), (18, 0.21), (19, -0.188), (20, 0.093), (21, 0.039), (22, 0.07), (23, 0.049), (24, 0.05), (25, -0.033), (26, -0.092), (27, -0.028), (28, 0.007), (29, 0.037), (30, -0.005), (31, 0.017), (32, 0.004), (33, -0.036), (34, -0.137), (35, -0.007), (36, -0.05), (37, 0.072), (38, -0.017), (39, -0.02), (40, 0.058), (41, 0.015), (42, 0.013), (43, 0.003), (44, 0.048), (45, -0.003), (46, 0.005), (47, 0.004), (48, -0.007), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95772529 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

2 0.79518807 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

3 0.29357976 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

4 0.22891325 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

5 0.21040756 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

6 0.19076771 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

7 0.18532667 66 jmlr-2006-On Model Selection Consistency of Lasso

8 0.13827161 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

9 0.12595062 48 jmlr-2006-Learning Minimum Volume Sets

10 0.12203688 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

11 0.11960774 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

12 0.11677614 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

13 0.11231966 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

14 0.10895696 83 jmlr-2006-Sparse Boosting

15 0.10432301 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

16 0.10329471 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

17 0.091234788 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

18 0.090980135 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

19 0.090084046 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

20 0.086115509 49 jmlr-2006-Learning Parts-Based Representations of Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.026), (36, 0.079), (42, 0.398), (45, 0.05), (50, 0.051), (61, 0.016), (63, 0.033), (78, 0.014), (81, 0.071), (84, 0.025), (90, 0.02), (91, 0.023), (96, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70878196 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

2 0.33681875 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

3 0.32789609 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classiﬁers for all feasible ratios of the costs associated with false positives and false negatives. For linear classiﬁers, the set of classiﬁers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classiﬁers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efﬁciently the entire ROC curve, and that has the same computational complexity as training a single classiﬁer. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classiﬁer and the cost model assumed in applying the classiﬁer. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantiﬁable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classiﬁcation

4 0.32773188 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

5 0.32452032 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing conﬁdence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classiﬁcation. As in classiﬁcation, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain ﬁnite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

6 0.31507605 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

7 0.31183416 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

8 0.30802155 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.30761909 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

10 0.30585617 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

11 0.30558133 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

12 0.30447575 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

13 0.29941273 53 jmlr-2006-Learning a Hidden Hypergraph

14 0.29894045 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

15 0.29808721 44 jmlr-2006-Large Scale Transductive SVMs

16 0.29781312 65 jmlr-2006-Nonparametric Quantile Estimation

17 0.29744345 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

18 0.29542387 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

19 0.29531044 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

20 0.29506576 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates