jmlr jmlr2006 jmlr2006-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniil Ryabko
Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.
Reference: text
sentIndex sentText sentNum sentScore
1 Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). [sent-6, score-0.139]
2 Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. [sent-10, score-0.18]
3 A predictor is learning to classify the objects, based only on examples of labelled objects. [sent-21, score-0.182]
4 The task is to construct the best predictor for the labels, based on the data observed, i. [sent-32, score-0.167]
5 A predictor is constructed based on the first set and then is used to classify the objects from the second. [sent-37, score-0.2]
6 In the online setting a predictor starts by classifying the first object with zero knowledge; then it is 1. [sent-38, score-0.184]
7 in (Vapnik, 1998)) a more general situation is considered, the labels are drawn according to some probability distribution P(y|x), i. [sent-41, score-0.081]
8 The labels y ∈ Y are drawn according to some unknown (but fixed) distribution over the set of all infinite sequences of labels. [sent-78, score-0.081]
9 For each label y the corresponding object x ∈ X is generated according to some (unknown but fixed) probability distribution P(x|y). [sent-81, score-0.102]
10 model is that in the conditional model we made the distribution of labels primal; having done that we can relax the requirement of independence of objects to the conditional independence. [sent-89, score-0.191]
11 Moreover, we provide a tool for obtaining estimations of probability of error of a predictor in the conditional model from estimations of the probability of error in the i. [sent-103, score-0.264]
12 The general theorems about extending results concerning performance of a predictor to the conditional model are illustrated on two classes of predictors. [sent-107, score-0.188]
13 First, we extend weak consistency results concerning partitioning and nearest neighbour estimates from the i. [sent-108, score-0.214]
14 646 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA Second, we use some results of Vapnik-Chervonenkis theory to estimate performance in the conditional model (on finite amount of data) of predictors minimizing empirical risk, and also obtain some strong consistency results. [sent-112, score-0.12]
15 The only assumption on a predictor under which it works in the conditional model as well as in the i. [sent-114, score-0.188]
16 This assumption on a predictor should be valid in the i. [sent-119, score-0.149]
17 In other words, we expect the distribution of symptoms to be determined only by the fact that the patient is ill or that (s)he is not. [sent-144, score-0.088]
18 Note however, that there are certain types of dependencies between sets of symptoms which can violate our condition, for example, if a family comes for diagnostics together; yet the conditional i. [sent-145, score-0.1]
19 Thus, pattern recognition methods are used in conjunction with sequence prediction algorithms, and here our results can be considered a further theoretical justification of the use of the pattern recognition component. [sent-157, score-0.094]
20 There are also several approaches in which different types of assumptions on the joint distribution of objects and labels are made; then the authors construct a predictor or a class of predictors, to work well under the assumptions made. [sent-166, score-0.262]
21 ; each example zi := (xi , yi ) consists of an object xi ∈ X and a label yi := η(xi ) ∈ Y, where X is a measurable space called an object space, Y := {0, 1} is called a label space and η : X → Y is some deterministic function. [sent-186, score-0.209]
22 Note that the first condition means that objects are conditionally independent given labels (on conditional independence see Dawid, 1979). [sent-216, score-0.184]
23 Under the conditions (1) and (2) we say that objects are conditionally independent and identically distributed (conditionally i. [sent-217, score-0.101]
24 Assume that we have some individual sequence (yn )n∈N of labels and two probability distributions P0 and P1 on X, such that there exists sets X0 and X1 in X such that P1 (X1 ) = P0 (X0 ) = 1 and P0 (X1 ) = P1 (X0 ) = 0 (i. [sent-226, score-0.086]
25 Each example xn ∈ X is drawn according to the distribution Pyn ; examples are drawn independently of each other. [sent-229, score-0.142]
26 A predictor is a measurable function Γn := Γ(x1 , y1 , . [sent-230, score-0.186]
27 , xn , yn , xn+1 ) taking values in Y (more formally, a family of functions indexed by n). [sent-233, score-0.185]
28 The probability of error of a predictor Γ on each step n is defined as errn (Γ, P, z1 , . [sent-234, score-0.719]
29 , zn , x) (zi , 1 ≤ i ≤ n are fixed and the probability is taken over zn+1 ). [sent-240, score-0.216]
30 We will sometimes omit some of the arguments of errn when it can cause no confusion; in particular, we will often use a short notation P(errn (Γ, Z1 , . [sent-241, score-0.551]
31 For a pair of distributions P0 and P1 and any δ ∈ (0, 1/2) define ▽δ (P0 , P1 , n, ε) := sup ∞ Pp (errn (Γ) > ε), (3) p∈[δ,1−δ] that is, we consider the supremum of the probability of error over all class label probabilities. [sent-251, score-0.127]
32 For a predictor Γ and a distribution P on X define ∆(P, n, z1 , . [sent-252, score-0.167]
33 (5) p∈[δ,1−δ] Tolerance to data means, in effect, that in any typical large portion of data there is no small portion that changes strongly the probability of error. [sent-273, score-0.083]
34 sup p∈[δ,1−δ] ¯ The same notational convention will be applied to ∆ and ∆ as to errn . [sent-301, score-0.606]
35 Naturally, such notions arise when there is a need to study the behaviour of a predictor when some of the training examples are removed. [sent-306, score-0.149]
36 A predictor developed to work in the off-line setting should be, loosely speaking, tolerant to small changes in a training sample. [sent-308, score-0.149]
37 The next theorem shows under which conditions this property of a predictor can be utilized. [sent-309, score-0.149]
38 For any predictor Γ and any ε > 0 we have −1 P(errn (Γ) > ε) ≤ Cn αn ▽δ (P0 , P1 , n + κn , δε/2) + ∆δ (P0 , P1 , n + κn , δε/2) + (1 −Cn ), (6) and −1 ¯ P(errn (Γ) > ε) ≤ Cn αn ▽δ (P0 , P1 , n, δε/2) + ∆δ (P0 , P1 , n, δε/2) + (1 −Cn ). [sent-318, score-0.149]
39 (7) The theorem says that if we know with some confidence Cn that the rate of occurrence of each label is not less than some (small) δ, and have some bounds on the error rate and tolerance to data of a predictor in the i. [sent-319, score-0.266]
40 First we fix some individual sample of n labels (without objects) and consider the behaviour of the predictor conditional on this sample. [sent-325, score-0.232]
41 Fixing the labels allows us to pass from the conditional i. [sent-326, score-0.083]
42 Then, using tolerance 650 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA to data, we compare the behaviour of the predictor on any two different, but typical for a certain i. [sent-333, score-0.205]
43 Thus we have a tool for estimating the performance of a predictor on each finite step n. [sent-341, score-0.149]
44 We say that the rates of occurrence of labels are bounded from below if there exist such δ, 0 < δ < 1/2 that lim P(p(n) ∈ [δ, 1 − δ]) = 1. [sent-346, score-0.095]
45 Let Γ be such a predictor that lim ▽δ (P0 , P1 , n, ε) = 0 (9) n→∞ and either lim ∆δ (P0 , P1 , n, ε) = 0 (10) ¯ lim ∆δ (P0 , P1 , n, ε) = 0 (11) n→∞ or n→∞ for any ε > 0. [sent-349, score-0.209]
46 In Section 3 we show how this statement can be applied to prove weak consistence of some classical nonparametric predictors in the conditional model. [sent-354, score-0.123]
47 Application to Classical Nonparametric Predictors In this section we will consider two types of classical nonparametric predictors: partitioning and nearest neighbour classifiers. [sent-356, score-0.217]
48 The nearest neighbour predictor assigns to a new object xn+1 the label of its nearest neighbour among x1 , . [sent-357, score-0.518]
49 , xn , yn , xn+1 ) := y j , where j := argmini=1,. [sent-363, score-0.185]
50 Theorem 3 Let Γ be the nearest neighbour classifier. [sent-374, score-0.152]
51 A partitioning predictor on each step n partitions the object space X = Rd , d ∈ N into disjoint cells An , An , . [sent-378, score-0.247]
52 , zn , x) := 0 if ∑n Iyi =1 Ixi ∈A(x) ≤ ∑n Iyi =0 Ixi ∈A(x) i=1 i=1 1 otherwise, where A(x) denotes the cell containing x. [sent-384, score-0.215]
53 (Devroye, Gy¨ rfi, Lugosi, 1996)) that a partitioning predictor o is weakly consistent, provided certain regulatory conditions on the size of cells. [sent-388, score-0.188]
54 More precisely, let Γ be a partitioning predictor such that diam(A(X)) → 0 in probability and N(X) → ∞ in probability. [sent-389, score-0.207]
55 We generalise this result to the case of conditionally i. [sent-391, score-0.089]
56 Theorem 4 Let Γ be a partitioning predictor such that diam(A(X)) → 0 in probability and N(X) → ∞ in probability, for any distribution generating i. [sent-395, score-0.225]
57 Observe that we only generalise results concerning weak consistency of (one) nearest neighbour and non-data-dependent partitioning rules. [sent-400, score-0.253]
58 However, we do not aim to generalise state-of-the-art results in nonparametric classification, but rather to illustrate that weak consistency results can be extended to the conditional model. [sent-406, score-0.127]
59 Application to Empirical Risk Minimisation In this section we show how to estimate the performance of a predictor minimising empirical risk (over certain class of functions) using Theorem 1. [sent-408, score-0.174]
60 652 PATTERN R ECOGNITION FOR C ONDITIONALLY I NDEPENDENT DATA In the theory of empirical risk minimisation this function is approximated by the function ϕ∗ := arg min errn (ϕ) n ϕ∈C where errn (ϕ) := ∑n Iϕ(Xi )=Yi is the empirical error functional, based on a sample (Xi ,Yi ), i = i=1 1, . [sent-417, score-1.119]
61 , zn , xn+1 ) := ϕ∗ (xn+1 ) is a predictor minimising empirical risk over the n class of functions C . [sent-424, score-0.371]
62 Theorem 5 Let C be a class of decision functions and let Γ be a predictor which for each n ∈ N minimises errn over C on the observed examples (z1 , . [sent-442, score-0.733]
63 (15) Thus, if we have bounds on the VC dimension of some class of classifiers, we can obtain bounds on the performance of predictors minimising empirical error for the conditional model. [sent-451, score-0.122]
64 where Γ is a predictor which in each trial n minimises empirical risk over C kn and P is any distribution satisfying (1), (2) and ∑∞ (1 −Cn ) < ∞. [sent-459, score-0.218]
65 Naturally, a question arises whether our conditions on the distributions and on predictors are necessary, or they can be yet more generalised in the same direction. [sent-473, score-0.098]
66 Cn = 1 for any δ ∈ (0, 1/2) and n > (1/2−δ) ) and a predictor Γ such that n Pp (errn > 0) ≤ 21−n for any p ∈ [δ, 1 − δ] and P(errn = 1) = 1 for n > 1. [sent-479, score-0.149]
67 1 0 We define the predictor Γ as follows Γn := 1 − xn if |#{i < n : yi = 0} − n/2| ≤ 1, xn otherwise. [sent-493, score-0.418]
68 In particular, the assumption (8) might appear redundant: if the rate of occurrence of some label tends to zero, can we just ignore this label without affecting the asymptotic? [sent-500, score-0.091]
69 Remark 9 There exist a distribution P on X∞ which satisfies (1) and (2) but for which the nearest neighbour predictor is not consistent, i. [sent-502, score-0.319]
70 ) The nearest neighbour predictor is consistent for any i. [sent-509, score-0.325]
71 that with probability at least ε we have max i ε) (16) ∆δ (P0 , P1 , n, ε) := sup ∆(Pq , n, ε), (17) q and q where the supremums are with respect to all distributions q such that mina∈Y q(a) ≥ δ. [sent-524, score-0.097]
72 lim {n|pn − p| ≤ κn } = 0 n→∞ almost surely for any p ∈ (0, 1) and any probability distribution P on X such that P(y = 1) = p, where pn := 1 #{i ≤ n : Yi = 0}. [sent-530, score-0.144]
73 , zn , Xn+1 )|Yn+1 = 1), n (with the same notational convention as used with the definition of errn (Γ)). [sent-549, score-0.748]
74 In words, for each y ∈ Y = {0, 1} we define erry as the probability of all x ∈ X, such that Γ makes an error on n’th trial, n given that Yn+1 = y and fixed z1 , . [sent-550, score-0.255]
75 , yn ) and pn (y) := n #{i ≤ n : yi = 0}, for n > 1. [sent-560, score-0.169]
76 , xn , yn ) > ε ; n that is, having fixed the labels, we consider probability over objects only. [sent-590, score-0.255]
77 Fix some n > 1, some y ∈ Y and such y1 ∈ Y∞ that δ ≤ pn (y1 ) ≤ 1 − δ and P((Y1 , . [sent-592, score-0.087]
78 , Xn , y1 , Xn+1 ) n 1 n > ε = n Pp erry (Γ, X1 , y1 , . [sent-607, score-0.236]
79 Clearly, for δ ≤ p ≤ 1 − δ we have errn (Γ, Pp ) ≤ maxy∈Y (erry (Γ, Pp )), and if errn (Γ, Pp ) < ε then n erry (Γ, Pp ) < ε/δ for each y ∈ Y. [sent-614, score-1.338]
80 , Xn , y1 ) > ε n 1 n m ′ ′ ′ ′ = Pp erry (X1 , y1 , . [sent-633, score-0.236]
81 , Xm , y2 ) > ε n 1 n n 1 m n 1 m m ≤ Pp ′ ′ erry (X1 , y2 , . [sent-642, score-0.236]
82 , Xn , y1 ) > ε/2 n 1 n n 1 n n ′ ′ + Pp erry (X1 , y2 , . [sent-648, score-0.236]
83 Thus the first term is bounded by m Pp max | erry (Γ, Z1 , . [sent-659, score-0.236]
84 , Xn , y1 ) > ε n 1 n n+κ ≤ Pp n ′ ′ erry (X1 , y2 , . [sent-691, score-0.236]
85 , Xn , y1 ) > ε/2 n n 1 n 1 n n ′ ′ + Pp erry (X1 , y2 , . [sent-697, score-0.236]
86 (19) Finally, as y1 was chosen arbitrarily among sequences y ∈ Y∞ such that nδ ≤ pn (y1 ) ≤ n(1 − δ) from (18) and (19), we obtain (6) and (7). [sent-727, score-0.106]
87 , Zn )) → 0 for nearest neighbour and partitioning predictor, and apply Corollary 2. [sent-736, score-0.191]
88 , zn from the notation) errn (Γ, Pp ) ≤ err0 (Γ, Pp ) + err1 (Γ, Pp ) and n n ¯ ¯ ¯ ∆(Pp , n) ≤ ∆0 (Pp , n) + ∆1 (Pp , n). [sent-793, score-0.748]
89 So, E ∞ (err1 (Γ, Pp )) ≤ E ∞ (err1 (Γ, Pδ/2 |pn = δ/2)) + Pp (pn ≤ δ/2), n n (22) where as usual pn := 1 #{i ≤ n : yi = 1}. [sent-797, score-0.108]
90 Moreover, n it is easy to see that E ∞ (err1 (Γ, Pδ/2 )|pn = n(δ/2)) n ¯ ≤ E ∞ err1 (Γ, Pδ/2 ) |n(δ/2) − pn | ≤ κn /2 + E ∞ (∆1 (Pδ/2 , n)) n 1 ¯ √ E ∞ (err1 (Γ, Pδ/2 )) + E ∞ (∆1 (Pδ/2 , n)). [sent-799, score-0.087]
91 Again, we observe that D will not decrease if an arbitrary (possibly large) portion of training examples labelled with ones is replaced with an arbitrary (but consistent with η) portion of the same size of examples labelled with zeros. [sent-830, score-0.154]
92 , zn ) with κn in the definition replaced by 2 κn . [sent-837, score-0.197]
93 , xn implicit) n+1 Bn (x) := Pp {t ∈ X : t and x have the same nearest neighbour among x1 , . [sent-875, score-0.276]
94 , xn } and Bn := E(Bn (X)) Note that E ∞ (Bn ) = 1/n, where the expectation is taken over X1 , . [sent-878, score-0.124]
95 , xn ) for which a point x0 is the nearest neighbour is not greater than a constant γ which depends only the space X (see Devroye, Gy¨ rfi, o Lugosi, 1996, Corollary 11. [sent-924, score-0.276]
96 For any measurable sets B ⊂ Xn and A ⊂ X define D(B , A ) := E ∞ sup j≤κn ; π:{1,. [sent-940, score-0.092]
97 , xn ) be the union of all cells An for which E(|ηn (X) − η(X)||X ∈ An ) ≤ ε. [sent-990, score-0.148]
98 Clearly, |errn (ϕ× ) − errn (ϕ∗ )| ≤ κn , as κn is the maximal number of errors which can be made on the difference of the two samples. [sent-1028, score-0.551]
99 Moreover, Pn | err(ϕ∗ ) − err(ϕ× )| > ε n 1 1 n ≤ P | err(ϕ∗ ) − errn (ϕ∗ )| > ε/2 + Pn | errn (ϕ× ) − err(ϕ× )| > ε/2 − κn /n n n n Observe that 2 1 Pn (sup | errn (ϕ) − err(ϕ)| > ε) ≤ 8S (C , n)e−nε /32 , ϕ∈C n (27) see (Devroye, Gy¨ rfi, Lugosi, 1996, Theorem 12. [sent-1029, score-1.653]
100 So far we have proven (12) and (13); (14) and (15) can be proven analogously, only for the case η ∈ C we have 1 Pn (sup | errn (ϕ) − err(ϕ)| > ε) ≤ S (C , n)e−nε ϕ∈C n instead of (27), and err(ϕPp , Pp ) = 0. [sent-1034, score-0.551]
wordName wordTfidf (topN-words)
[('pp', 0.581), ('errn', 0.551), ('erry', 0.236), ('err', 0.207), ('zn', 0.197), ('ryabko', 0.167), ('predictor', 0.149), ('xn', 0.124), ('neighbour', 0.096), ('ndependent', 0.089), ('onditionally', 0.089), ('pn', 0.087), ('ecognition', 0.075), ('cn', 0.069), ('kulkarni', 0.067), ('lugosi', 0.066), ('yn', 0.061), ('devroye', 0.059), ('predictors', 0.058), ('tolerance', 0.056), ('nearest', 0.056), ('sup', 0.055), ('objects', 0.051), ('nobel', 0.05), ('conditionally', 0.05), ('fix', 0.049), ('morvai', 0.049), ('gy', 0.047), ('labels', 0.044), ('bn', 0.042), ('algoet', 0.039), ('generalise', 0.039), ('partitioning', 0.039), ('conditional', 0.039), ('mp', 0.038), ('measurable', 0.037), ('symptoms', 0.037), ('object', 0.035), ('ill', 0.033), ('minimises', 0.033), ('vazirani', 0.033), ('labelled', 0.033), ('portion', 0.032), ('occurrence', 0.031), ('label', 0.03), ('daniil', 0.03), ('erryn', 0.03), ('ixi', 0.03), ('posner', 0.03), ('py', 0.03), ('nonparametric', 0.026), ('na', 0.026), ('diam', 0.025), ('minimising', 0.025), ('blumer', 0.025), ('consistent', 0.024), ('pattern', 0.024), ('cells', 0.024), ('np', 0.024), ('dependencies', 0.024), ('distributions', 0.023), ('consistency', 0.023), ('recognition', 0.023), ('ergodic', 0.022), ('letters', 0.022), ('vapnik', 0.021), ('yi', 0.021), ('text', 0.021), ('vc', 0.02), ('lim', 0.02), ('aldous', 0.02), ('errm', 0.02), ('gamarnik', 0.02), ('ghost', 0.02), ('ncn', 0.02), ('npn', 0.02), ('yakowitz', 0.02), ('kearns', 0.02), ('estimations', 0.019), ('valiant', 0.019), ('probability', 0.019), ('xm', 0.019), ('sequences', 0.019), ('markov', 0.019), ('distribution', 0.018), ('weaker', 0.018), ('cell', 0.018), ('kn', 0.018), ('task', 0.018), ('haussler', 0.017), ('mpm', 0.017), ('mina', 0.017), ('overviews', 0.017), ('iyi', 0.017), ('zeger', 0.017), ('generalised', 0.017), ('irrational', 0.017), ('minimisation', 0.017), ('reserved', 0.017), ('universal', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
Author: Daniil Ryabko
Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.
2 0.23247567 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efficient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.
3 0.078990154 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
Author: Rémi Munos
Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.
4 0.077830993 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
Author: Andrea Caponnetto, Alexander Rakhlin
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes
5 0.052675381 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
Author: Ron Begleiter, Ran El-Yaniv
Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds
6 0.043631442 66 jmlr-2006-On Model Selection Consistency of Lasso
7 0.039831974 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
8 0.038032282 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
9 0.033399034 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
10 0.031822473 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
11 0.031794928 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
12 0.0295191 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
13 0.028958317 48 jmlr-2006-Learning Minimum Volume Sets
14 0.027074656 83 jmlr-2006-Sparse Boosting
15 0.025751982 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
16 0.024505718 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
17 0.023749452 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
18 0.022316545 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
19 0.021357629 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
20 0.020579724 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
topicId topicWeight
[(0, 0.132), (1, -0.003), (2, -0.077), (3, -0.164), (4, -0.077), (5, 0.146), (6, -0.168), (7, 0.041), (8, -0.087), (9, 0.462), (10, 0.058), (11, -0.015), (12, 0.137), (13, -0.283), (14, 0.028), (15, -0.06), (16, 0.008), (17, -0.22), (18, 0.21), (19, -0.188), (20, 0.093), (21, 0.039), (22, 0.07), (23, 0.049), (24, 0.05), (25, -0.033), (26, -0.092), (27, -0.028), (28, 0.007), (29, 0.037), (30, -0.005), (31, 0.017), (32, 0.004), (33, -0.036), (34, -0.137), (35, -0.007), (36, -0.05), (37, 0.072), (38, -0.017), (39, -0.02), (40, 0.058), (41, 0.015), (42, 0.013), (43, 0.003), (44, 0.048), (45, -0.003), (46, 0.005), (47, 0.004), (48, -0.007), (49, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.95772529 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
Author: Daniil Ryabko
Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.
2 0.79518807 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efficient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.
3 0.29357976 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
Author: Andrea Caponnetto, Alexander Rakhlin
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes
4 0.22891325 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
Author: Ron Begleiter, Ran El-Yaniv
Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds
5 0.21040756 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
Author: Rémi Munos
Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.
6 0.19076771 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
7 0.18532667 66 jmlr-2006-On Model Selection Consistency of Lasso
8 0.13827161 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
9 0.12595062 48 jmlr-2006-Learning Minimum Volume Sets
10 0.12203688 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
11 0.11960774 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
12 0.11677614 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
13 0.11231966 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
14 0.10895696 83 jmlr-2006-Sparse Boosting
16 0.10329471 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
17 0.091234788 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
18 0.090980135 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection
19 0.090084046 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
20 0.086115509 49 jmlr-2006-Learning Parts-Based Representations of Data
topicId topicWeight
[(8, 0.026), (36, 0.079), (42, 0.398), (45, 0.05), (50, 0.051), (61, 0.016), (63, 0.033), (78, 0.014), (81, 0.071), (84, 0.025), (90, 0.02), (91, 0.023), (96, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.70878196 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
Author: Daniil Ryabko
Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.
Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor
Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs
3 0.32789609 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
Author: Francis R. Bach, David Heckerman, Eric Horvitz
Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification
4 0.32773188 66 jmlr-2006-On Model Selection Consistency of Lasso
Author: Peng Zhao, Bin Yu
Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency
5 0.32452032 48 jmlr-2006-Learning Minimum Volume Sets
Author: Clayton D. Scott, Robert D. Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency
6 0.31507605 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
7 0.31183416 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
8 0.30802155 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
9 0.30761909 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
10 0.30585617 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
11 0.30558133 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
12 0.30447575 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
13 0.29941273 53 jmlr-2006-Learning a Hidden Hypergraph
14 0.29894045 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
15 0.29808721 44 jmlr-2006-Large Scale Transductive SVMs
16 0.29781312 65 jmlr-2006-Nonparametric Quantile Estimation
17 0.29744345 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
19 0.29531044 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
20 0.29506576 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates