jmlr jmlr2006 jmlr2006-48 knowledge-graph by maker-knowledge-mining

48 jmlr-2006-Learning Minimum Volume Sets


Source: pdf

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. [sent-7, score-0.257]

2 We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. [sent-10, score-0.283]

3 The proposed estimators are illustrated with histogram and decision tree set estimation rules. [sent-14, score-0.203]

4 Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency 1. [sent-15, score-0.277]

5 Introduction Given a probability measure P and a reference measure µ, the minimum volume set (MV-set) with mass at least 0 < α < 1 is G∗ = arg min{µ(G) : P(G) ≥ α, G measurable}. [sent-16, score-0.329]

6 Applications of minimum volume sets include outlier/anomaly detection, determining highest posterior density or multivariate confidence regions, tests for multimodality, and clustering. [sent-20, score-0.242]

7 Section 5 introduces a tuning parameter to the proposed rules that allows the user to affect the tradeoff between volume error and mass error without sacrificing theoretical properties. [sent-36, score-0.287]

8 Using empirical process theory, he establishes consistency results and rates of convergence for minimum volume sets which depend on the entropy of the class of candidate sets. [sent-46, score-0.26]

9 2 Connection to Density Level Sets The MV-set estimation problem is closely related to density level set estimation (Tsybakov, 1997; Ben-David and Lindenbaum, 1997; Cuevas and Rodriguez-Casal, 2003; Steinwart et al. [sent-65, score-0.19]

10 , 2005; Vert and Vert, 2005) and excess mass estimation problems (Nolan, 1991; M¨ ller and Sawitzki, 1991; u Polonik, 1995). [sent-66, score-0.224]

11 Indeed, it is well known that density level sets are minimum volume sets (NunezGarcia et al. [sent-67, score-0.276]

12 The main difference between density level sets and MV-sets is that the former require the specification of a density level of interest, rather than the specification of the mass α to be enclosed. [sent-69, score-0.365]

13 Since the density is in general unknown, it seems that specifying α is much more reasonable and intuitive than setting a density level for problems like anomaly detection. [sent-70, score-0.194]

14 The choice of c does not change the minimum volume set, but it does affect the γ level set. [sent-72, score-0.196]

15 Algorithms for density level set estimation can be split into two categories, implicit plug-in methods and explicit set estimation methods. [sent-77, score-0.19]

16 (2001) considers plug-in rules for density level set estimation problems and establishes upper bounds on the rate of convergence for such estimators in certain cases. [sent-80, score-0.256]

17 The problem of estimating a density support set, the zero level set, is a special minimum volume set (i. [sent-81, score-0.276]

18 While consistency and rate of convergence results for plug-in methods typically make global smoothness assumptions on the density, explicit methods make assumptions on the density at or near the level of interest. [sent-85, score-0.208]

19 Vert and Vert (2005) study the one-class support vector machine (SVM) and show that it produces a consistent density level set estimator, based on the fact that consistent density estimators produce consistent plug-in level set estimators. [sent-90, score-0.277]

20 Willett and Nowak (2005, 2006) propose a level set estimator based on decision trees, which is applicable to density level set estimation as well as regression level set estimation, and related dyadic partitioning schemes are developed by Klemel¨ (2004) to estimate the support set of a density. [sent-91, score-0.428]

21 Denote by f the density of P with respect to µ (when it exists), γ > 0 a level of the density, and α ∈ (0, 1) a user-specified mass constraint. [sent-114, score-0.251]

22 A minimum volume set, G∗ , is a minimizer of (1) when it α exists. [sent-116, score-0.18]

23 Definition 2 We say φ is a (distribution free) complexity penalty for G if and only if for all distributions P and all δ ∈ (0, 1), Pn S : sup G∈G P(G) − P(G) − φ(G, S, δ) > 0 ≤ δ. [sent-134, score-0.214]

24 Theorem 3 If φ is a complexity penalty for G , then Pn P(GG ,α ) < α − 2φ(GG ,α , S, δ) or µ(GG ,α ) > µG ,α ≤ δ. [sent-140, score-0.174]

25 , 1996), we know that φ is a complexity penalty for G , and therefore Theorem 3 applies. [sent-157, score-0.174]

26 Thus, for any fixed ε > 0, the probability of being within 2ε of the target mass α and being less than the target volume µG ,α approaches one exponentially fast as the sample size increases. [sent-161, score-0.287]

27 2n (7) By Chernoff’s bound together with the union bound, φ is a penalty for G . [sent-169, score-0.198]

28 3 The Rademacher Penalty for Sets The Rademacher penalty was originally studied in the context of classification by Koltchinskii (2001) and Bartlett et al. [sent-176, score-0.174]

29 Thus we are able to define the conditional Rademacher penalty φ(G, S, δ) = 2ρ(G , S) + 2 log(2/δ) . [sent-197, score-0.174]

30 n By the above Proposition, this is a complexity penalty according to Definition 2. [sent-198, score-0.174]

31 The conditional Rademacher penalty is studied further in Section 7 and in Appendix E, where it is shown that ρ(G , S) can be computed efficiently for sets based on a fixed partition of X (such as histograms and trees). [sent-199, score-0.24]

32 Yet one may wonder whether convergence of the volume and mass to their optimal values implies convergence to the MV-set (when it is unique) in any sense. [sent-213, score-0.307]

33 Consistency A minimum volume set estimator is consistent if its volume and mass tend to the optimal values µ∗ α and α as n → ∞. [sent-225, score-0.452]

34 We refer to the left-hand side of (8) as the excess volume of the class G and the left-hand side of (9) as the missing mass of GG ,α . [sent-234, score-0.306]

35 The penalty for G ∈ G k is then φk (G, S, δ) = kd log 2 + log(2/δ) . [sent-269, score-0.23]

36 Thus the conditions for consistency of histograms for minimum volume set estimation are exactly parallel to the conditions for consistency of histogram rules for classification (Devroye et al. [sent-272, score-0.396]

37 First, unlike with density level sets, there may not be a unique MV-set (imagine the case where the density of P has a plateau). [sent-279, score-0.194]

38 γα 1−α 1−α This follows from the bound 1 − α ≤ γα · (µ(X ) − µ∗ ) on the mass outside the minimum volume set. [sent-321, score-0.299]

39 The oracle inequality says that MV-SRM performs about as well as the set chosen by an oracle to optimize the tradeoff between excess volume and missing mass. [sent-323, score-0.34]

40 k=1 A penalty for G can be obtained by defining, for G ∈ G k , φ(G, S, δ) = φk (G, S, δ2−k ), where φk is the penalty from Equation (13). [sent-326, score-0.348]

41 Then φ is a penalty for G because φk is a penalty for G k , and by applying the union bound and the fact ∑k≥1 2−k ≤ 1. [sent-327, score-0.372]

42 As with VC classes, we obtain a penalty for G by k=1 defining, for G ∈ G k , φ(G, S, δ) = φk (G, S, δ2−k ), where φk is the penalty from Equation (14). [sent-336, score-0.348]

43 Damping the Penalty In Theorem 3, the reader may have noticed that MV-ERM does not equitably balance the excess volume (µ(GG ,α ) relative to its optimal value) with the missing mass (P(GG ,α ) relative to α). [sent-340, score-0.306]

44 In this section we introduce variants of MV-ERM and MV-SRM that allow the total error to be shared between the volume and mass, instead of all of the error residing in the mass term. [sent-344, score-0.257]

45 Assume that the penalty is independent of G ∈ G and of the sample S, although it can depend on n and δ. [sent-353, score-0.204]

46 Observe that computing Gν ,α is equivalent to computing the MV-ERM G k at the level α(k, ν) = α + (1 − ν)ε (n, δ), and then minimizing the penalized estimate on each G k volume over these MV-ERM estimates. [sent-372, score-0.179]

47 Here γα(k,ν) is the density level corresponding to the MV-set with mass α(k, ν). [sent-379, score-0.251]

48 Furthermore, damping the 678 L EARNING M INIMUM VOLUME S ETS penalty by ν has the effect of decreasing the stochastic mass error and adding a stochastic error to the volume. [sent-388, score-0.344]

49 The improved balancing of volume and mass error is confirmed by our experiments in Section 7. [sent-390, score-0.257]

50 To obtain these rates we apply MV-SRM to sets based on a special family of decision trees called dyadic decision trees (DDTs) (Scott and Nowak, 2006). [sent-397, score-0.313]

51 The estimators proposed therein are based on dyadic partitioning schemes similar in spirit to the DDTs studied here. [sent-401, score-0.201]

52 In contrast, DDT methods are capable of attaining near minimax rates for all density level sets whose boundaries belong to certain H¨ lder smoothness classes, regardo less of whether or not there is a discontinuity at the given level. [sent-403, score-0.237]

53 Assumption A3 essentially requires the boundary of the minimum volume set G∗ to have Lipschitz smoothness, and thus one would expect the optimal rate α 679 S COTT AND N OWAK Figure 2: A dyadic decision tree (right) with the associated recursive dyadic partition (left) in d = 2 dimensions. [sent-415, score-0.552]

54 Scott and Nowak (2006) demonstrate that dyadic decision trees (DDTs) offer a computationally feasible classifier that also achieves optimal rates of convergence (for standard classification) under a wide range of conditions. [sent-424, score-0.267]

55 A dyadic decision tree is a decision tree that divides the input space by means of axis-orthogonal dyadic splits. [sent-429, score-0.432]

56 Since G L is countable (in fact, finite), one approach is to devise a prefix code for G L and apply the penalty in Section 2. [sent-453, score-0.197]

57 Instead, we employ a different penalty which has the advantage that it leads to minimax optimal rates of convergence. [sent-455, score-0.263]

58 Introduce the notation A = (3 + log2 d) j(A), which may be thought of as the codelength of A in a prefix code for A L , and define the minimax penalty φ(GT ) := ∑ 8 max P(A), A∈π(T ) A log 2 + log(2/δ) n A log 2 + log(2/δ) . [sent-456, score-0.318]

59 The MV-SRM procedure over G L with the above penalty leads to an optimal rate of convergence for the box-counting class. [sent-467, score-0.199]

60 Then, begin incorporating cells into the estimate one cell at a time, starting with the most populated, until the empirical mass constraint is satisfied. [sent-524, score-0.231]

61 Both penalties are defined via φ(G, S, δ) = φk (G, S, δ2−k ) for G ∈ G k , where φk is a penalty for G k . [sent-527, score-0.22]

62 Figure 6 depicts the penalized volume of the MV-ERM estimates (ν = 1) as a function of the resolution k, where 1/k is the sidelength of the histogram cell. [sent-554, score-0.305]

63 2 Dyadic Decision Trees Implementing MV-SRM for dyadic decision trees is much more challenging than for histograms. [sent-562, score-0.223]

64 In this case the error is more evenly distributed between mass and volume, whereas in the former case all the error is in the mass term. [sent-593, score-0.274]

65 1 occam rademacher average penalized volume of MV−ERM solution 1 0. [sent-595, score-0.417]

66 4 0 5 10 15 20 25 resolution parameter k 30 35 40 Figure 6: The penalized volume of the MV-ERM estimates Gk ,α , as a function of k, where 1/k is G the sidelength of the histogram cell. [sent-601, score-0.305]

67 Clearly, the Occam’s razor bound is smaller than the Rademacher penalty (look at the right side of the plot), to which we may attribute its improved performance (see Figure 5). [sent-604, score-0.198]

68 687 S COTT AND N OWAK 14 average resolution (1/binwidth) of mv−srm estimate occam rademacher 12 10 8 6 4 2 100 1000 10000 sample size 100000 1000000 symmetric difference as a function of sample size 0. [sent-605, score-0.362]

69 Or, suppose that instead of binary dyadic splits with arbitrary orientation, one only considers “quadsplits” whereby every parent node has 2d children (in fact, this is the tree structure used for our experiments below). [sent-628, score-0.214]

70 We refer to the first penalty as the minimax penalty, since it is inspired by the minimax optimal penalty in (19): ψmm (A) := (0. [sent-645, score-0.488]

71 01, since otherwise it is too large to yield meaningful results:3 The second penalty is based on the Rademacher penalty (see Section 2. [sent-648, score-0.348]

72 To obtain a penalty for all G L = ∪π∈ΠL G π , we apply the union bound over all π ∈ ΠL and replace δ by δ|ΠL |−1 . [sent-654, score-0.198]

73 n ψRad (A) = (24) The third penalty is referred to as the modified Rademacher penalty and is given by ψmRad (A) = P(A) + µ(A) . [sent-658, score-0.348]

74 The basic Rademacher is proportional to the square-root of the empirical P mass and the modified Rademacher is proportional to the square-root of the total mass (empirical 3. [sent-660, score-0.293]

75 Figures 8, 9, and 10 depict the minimum volume set estimates based on each of the three penalties, and for sample sizes of 100, 1000, and 10000. [sent-674, score-0.214]

76 In addition to the minimum volume set estimates based on a single tree, we also show the estimates based on voting over shifted partitions. [sent-676, score-0.266]

77 Visual inspection of the resulting minimum volume set estimates (which were “typical” results selected at random) reveals some of the characteristics of the different penalties and their behaviors as a function of the sample size. [sent-681, score-0.26]

78 The (down-weighted) minimax penalty results in set estimates quite similar to those resulting from the modified Rademacher. [sent-684, score-0.266]

79 Overall, the modified Rademacher penalty coupled with voting over multiple shifted trees appears to perform best in our experiments. [sent-689, score-0.282]

80 Conclusions In this paper we propose two rules, MV-ERM and MV-SRM, for estimation of minimum volume sets. [sent-692, score-0.2]

81 These theoretical results are illustrated with histograms and dyadic decision trees. [sent-697, score-0.219]

82 Our estimators, results, and proof techniques for minimum volume sets bear a strong resemblance to existing estimators, results, and proof techniques for supervised classification. [sent-698, score-0.196]

83 Assume 691 S COTT AND N OWAK (MM) (Rad) (mRad) (MM’) (Rad’) (mRad’) Figure 8: Minimum volume set estimates based on dyadic quadtrees for α = 0. [sent-701, score-0.318]

84 Reconstructions based on MM = minimax penalty (23), Rad = Rademacher penalty (24), and mRad = modified Rademacher penalty (25), and MM’, Rad’, and mRad’ denote the analogous estimates based on voting over multiple trees at different shifts. [sent-703, score-0.701]

85 692 L EARNING M INIMUM VOLUME S ETS (MM) (Rad) (mRad) (MM’) (Rad’) (mRad’) Figure 9: Minimum volume set estimates based on dyadic quadtrees for α = 0. [sent-704, score-0.318]

86 Reconstructions based on MM = minimax penalty (23), Rad = Rademacher penalty (24), and mRad = modified Rademacher penalty (25), and MM’, Rad’, and mRad’ denote the analogous estimates based on voting over multiple trees at different shifts. [sent-706, score-0.701]

87 693 S COTT AND N OWAK (MM) (Rad) (mRad) (MM’) (Rad’) (mRad’) Figure 10: Minimum volume set estimates based on dyadic quadtrees for α = 0. [sent-707, score-0.318]

88 Reconstructions based on MM = minimax penalty (23), Rad = Rademacher penalty (24), and mRad = modified Rademacher penalty (25), and MM’, Rad’, and mRad’ denote the analogous estimates based on voting over multiple trees at different shifts. [sent-709, score-0.701]

89 Then the minimum volume set with mass α is the acceptance region of the most powerful test of size 1 − α for testing H0 : X ∼ P versus H1 : X ∼ µ. [sent-711, score-0.299]

90 The problem of learning minimum volume sets stands halfway between these two: For one class the true distribution is known (the reference measure), but for the other only training samples are available. [sent-713, score-0.192]

91 Minimum volume set estimation based on Neyman-Pearson classification offers a distinct advantage over the rules studied in this paper. [sent-720, score-0.188]

92 Indeed, our algorithms for histograms and dyadic decision trees take advantage of the fact that the reference measure µ is easily evaluated for these special types of sets. [sent-721, score-0.297]

93 (2005), who sample from µ so as to reduce density level set estimation to cost-sensitive classification. [sent-726, score-0.182]

94 The conditional Rademacher penalty may be rewritten as follows: n 2 E(σi ) sup ∑ σi I (Xi ∈ G) n G∈G i=1 = = =: 2 E n (σi ) i=1 2 ∑ E sup ∑ σi ℓ(A) n A∈π (σi ) ℓ(A) i:Xi ∈A ∑ ψ(A). [sent-832, score-0.254]

95 A∈π 700 n ∑ σi ℓ(A) ℓ(A) : A∈π sup L EARNING M INIMUM VOLUME S ETS Thus the penalty is additive (modulo the delta term). [sent-833, score-0.235]

96 This is the penalty used in the histogram experiments (after the delta term is included). [sent-837, score-0.247]

97 A more convenient and intuitive penalty may be obtained by bounding ψ(A) = ≤ = = 1 E n (σi ) ∑  1 E  n (σi ) 1 E n (σi ) σi i:Xi ∈A 2 ∑ σi i:Xi ∈A ∑ 1 2 1 2  σ2 i i:Xi ∈A P(A) , n where the inequality is Jensen’s. [sent-838, score-0.203]

98 This is the “Rademacher” penalty employed in the dyadic decision tree experiments. [sent-843, score-0.39]

99 Level sets and minimum volume sets of probability density functions. [sent-1033, score-0.242]

100 Measuring mass concentrations and estimating density contour cluster–an excess mass approach. [sent-1038, score-0.403]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gg', 0.746), ('rademacher', 0.211), ('penalty', 0.174), ('cott', 0.16), ('owak', 0.16), ('dyadic', 0.152), ('mass', 0.137), ('rad', 0.136), ('inimum', 0.129), ('mrad', 0.128), ('volume', 0.12), ('ets', 0.115), ('gn', 0.112), ('scott', 0.103), ('nowak', 0.102), ('vc', 0.087), ('density', 0.08), ('ddts', 0.08), ('polonik', 0.08), ('mm', 0.073), ('oracle', 0.071), ('minimax', 0.07), ('gt', 0.065), ('pn', 0.063), ('occam', 0.061), ('sidelength', 0.056), ('histogram', 0.052), ('excess', 0.049), ('estimators', 0.049), ('cells', 0.049), ('trees', 0.048), ('penalties', 0.046), ('earning', 0.045), ('histograms', 0.044), ('minimum', 0.042), ('tree', 0.041), ('sup', 0.04), ('voting', 0.039), ('estimation', 0.038), ('log', 0.037), ('inf', 0.037), ('hn', 0.036), ('risk', 0.036), ('consistency', 0.035), ('level', 0.034), ('smoothness', 0.034), ('universally', 0.033), ('damping', 0.033), ('estimator', 0.033), ('steinwart', 0.032), ('cannon', 0.032), ('cuevas', 0.032), ('klemel', 0.032), ('willett', 0.032), ('lebesgue', 0.032), ('resolution', 0.03), ('vert', 0.03), ('erm', 0.03), ('sample', 0.03), ('rules', 0.03), ('reference', 0.03), ('classi', 0.03), ('inequality', 0.029), ('lagrangian', 0.028), ('uniform', 0.026), ('cell', 0.026), ('convergence', 0.025), ('penalized', 0.025), ('union', 0.024), ('razor', 0.024), ('ddt', 0.024), ('infg', 0.024), ('quadtrees', 0.024), ('walther', 0.024), ('devroye', 0.024), ('leaf', 0.024), ('universal', 0.024), ('decision', 0.023), ('countable', 0.023), ('pre', 0.023), ('estimates', 0.022), ('blanchard', 0.022), ('partition', 0.022), ('en', 0.021), ('converse', 0.021), ('shifted', 0.021), ('delta', 0.021), ('node', 0.021), ('theorem', 0.02), ('reconstructions', 0.02), ('uniqueness', 0.02), ('box', 0.02), ('quantile', 0.019), ('kd', 0.019), ('rates', 0.019), ('empirical', 0.019), ('adaptively', 0.018), ('minimizer', 0.018), ('bisection', 0.018), ('proof', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

2 0.075329408 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

3 0.063774057 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

4 0.050006259 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

Author: Guillaume Lecué

Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence

5 0.043089963 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

6 0.042102054 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

7 0.04092114 65 jmlr-2006-Nonparametric Quantile Estimation

8 0.038204417 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

9 0.038045559 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

10 0.037343685 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

11 0.036774404 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

12 0.036672533 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

13 0.035158642 12 jmlr-2006-Active Learning with Feedback on Features and Instances

14 0.032638449 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

15 0.032481473 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

16 0.030197058 93 jmlr-2006-Universal Kernels

17 0.029587453 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization     (Special Topic on Machine Learning and Optimization)

18 0.028958317 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

19 0.028246084 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

20 0.027443329 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints     (Special Topic on Machine Learning and Optimization)


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.153), (1, -0.054), (2, -0.031), (3, -0.114), (4, -0.085), (5, 0.136), (6, -0.077), (7, -0.011), (8, -0.005), (9, -0.006), (10, -0.024), (11, -0.041), (12, 0.033), (13, 0.037), (14, 0.021), (15, -0.132), (16, -0.022), (17, 0.028), (18, -0.02), (19, 0.071), (20, -0.228), (21, -0.201), (22, 0.036), (23, -0.089), (24, 0.005), (25, -0.086), (26, -0.02), (27, 0.159), (28, -0.14), (29, -0.042), (30, 0.071), (31, 0.06), (32, -0.073), (33, -0.259), (34, -0.153), (35, -0.011), (36, 0.146), (37, -0.089), (38, -0.182), (39, -0.113), (40, -0.161), (41, -0.098), (42, -0.014), (43, 0.229), (44, -0.08), (45, -0.145), (46, 0.06), (47, -0.125), (48, -0.083), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94378132 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

2 0.41240579 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

3 0.41176397 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

4 0.31528953 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

Author: Michael Schmitt, Laura Martignon

Abstract: Fast and frugal heuristics are well studied models of bounded rationality. Psychological research has proposed the take-the-best heuristic as a successful strategy in decision making with limited resources. Take-the-best searches for a sufficiently good ordering of cues (or features) in a task where objects are to be compared lexicographically. We investigate the computational complexity of finding optimal cue permutations for lexicographic strategies and prove that the problem is NP-complete. It follows that no efficient (that is, polynomial-time) algorithm computes optimal solutions, unless P = NP. We further analyze the complexity of approximating optimal cue permutations for lexicographic strategies. We show that there is no efficient algorithm that approximates the optimum to within any constant factor, unless P = NP. The results have implications for the complexity of learning lexicographic strategies from examples. They show that learning them in polynomial time within the model of agnostic probably approximately correct (PAC) learning is impossible, unless RP = NP. We further consider greedy approaches for building lexicographic strategies and determine upper and lower bounds for the performance ratio of simple algorithms. Moreover, we present a greedy algorithm that performs provably better than take-the-best. Tight bounds on the sample complexity for learning lexicographic strategies are also given in this article. Keywords: bounded rationality, fast and frugal heuristic, PAC learning, NP-completeness, hardness of approximation, greedy method

5 0.2821584 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

6 0.27408248 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

7 0.24216011 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization     (Special Topic on Machine Learning and Optimization)

8 0.22805069 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

9 0.22274636 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

10 0.21300939 12 jmlr-2006-Active Learning with Feedback on Features and Instances

11 0.21290378 16 jmlr-2006-Bounds for Linear Multi-Task Learning

12 0.21098451 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

13 0.21097076 65 jmlr-2006-Nonparametric Quantile Estimation

14 0.19604282 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

15 0.19223551 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

16 0.1911259 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

17 0.19026493 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

18 0.17869899 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

19 0.17524977 93 jmlr-2006-Universal Kernels

20 0.162071 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting     (Special Topic on Inductive Programming)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (35, 0.012), (36, 0.072), (45, 0.422), (50, 0.087), (61, 0.02), (63, 0.042), (68, 0.015), (76, 0.022), (78, 0.016), (79, 0.01), (81, 0.051), (84, 0.015), (90, 0.034), (91, 0.024), (96, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96958166 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

Author: Shalabh Bhatnagar, Vivek S. Borkar, Madhukar Akarapu

Abstract: We study the problem of long-run average cost control of Markov chains conditioned on a rare event. In a related recent work, a simulation based algorithm for estimating performance measures associated with a Markov chain conditioned on a rare event has been developed. We extend ideas from this work and develop an adaptive algorithm for obtaining, online, optimal control policies conditioned on a rare event. Our algorithm uses three timescales or step-size schedules. On the slowest timescale, a gradient search algorithm for policy updates that is based on one-simulation simultaneous perturbation stochastic approximation (SPSA) type estimates is used. Deterministic perturbation sequences obtained from appropriate normalized Hadamard matrices are used here. The fast timescale recursions compute the conditional transition probabilities of an associated chain by obtaining solutions to the multiplicative Poisson equation (for a given policy estimate). Further, the risk parameter associated with the value function for a given policy estimate is updated on a timescale that lies in between the two scales above. We briefly sketch the convergence analysis of our algorithm and present a numerical application in the setting of routing multiple flows in communication networks. Keywords: Markov decision processes, optimal control conditioned on a rare event, simulation based algorithms, SPSA with deterministic perturbations, reinforcement learning

same-paper 2 0.84819865 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

3 0.52018803 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

4 0.51871651 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

5 0.48752114 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

6 0.48240903 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

7 0.45836452 66 jmlr-2006-On Model Selection Consistency of Lasso

8 0.44307339 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

9 0.43819329 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

10 0.43613976 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

11 0.43435317 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.41206563 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

13 0.41091764 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

14 0.40995112 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

15 0.39485911 65 jmlr-2006-Nonparametric Quantile Estimation

16 0.3947171 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

17 0.38992909 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

18 0.38757125 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

19 0.38564876 83 jmlr-2006-Sparse Boosting

20 0.38444105 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs