nips nips2009 nips2009-94 knowledge-graph by maker-knowledge-mining

94 nips-2009-Fast Learning from Non-i.i.d. Observations

Source: pdf

Author: Ingo Steinwart, Andreas Christmann

Abstract: We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes. [sent-7, score-0.291]

2 To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. [sent-8, score-0.286]

3 Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i. [sent-9, score-0.234]

4 ) processes, it turns out that these learning rates are close to the optimal rates known in the i. [sent-12, score-0.186]

5 A set of natural and widely accepted notions for describing such weak dependencies1 are mixing concepts such as α-, β-, and φ-mixing, since a) they offer a generalization to i. [sent-24, score-0.101]

6 processes that is satisﬁed by various types of stochastic processes including Markov chains and many time series models, and b) they quantify the dependence in a conceptionally simple way that is accessible to various types of analysis. [sent-27, score-0.13]

7 Because of these features, the machine learning community is currently in the process of appreciating and accepting these notions as the increasing number of articles in this direction shows. [sent-28, score-0.047]

8 Probably the ﬁrst work in this direction goes back to Yu [20], whose techniques for β-mixing processes inspired subsequent work such as [18, 10, 11], while the analysis of speciﬁc learning algorithms probably started with [9, 5, 8]. [sent-29, score-0.073]

9 More recently, [7] established consistency of regularized boosting algorithms learning from β-mixing processes, while [15] established consistency of support vector machines (SVMs) learning from α-mixing processes, which constitute the largest class of mixing processes. [sent-30, score-0.314]

10 For the latter, [21] established generalization bounds for empirical risk minimization (ERM) and [19, 17] analyzed least squares support vector machines (LS-SVMs). [sent-31, score-0.186]

11 In this work, we establish a general oracle inequality for generic regularized learning algorithms and α-mixing observations by combining a Bernstein inequality for such processes [9] with localization ideas for i. [sent-32, score-0.448]

12 To illustrate this oracle inequality, we then use it to show learning rates for some algorithms including ERM over ﬁnite sets and LSSVMs. [sent-38, score-0.209]

13 case if one replaces the number of observations with the “effective number of observations”, while, for LS-SVMs, our rates are at least quite close to the recently obtained optimal rates [16] for i. [sent-42, score-0.205]

14 However, the latter difference is not surprising, when considering the fact that [16] used heavy machinery from 1 For example, [4] write on page 71: “. [sent-46, score-0.059]

15 ” 1 empirical process theory such as Talagrand’s inequality and localized Rademacher averages, while our results only use a light-weight argument based on Bernstein’s inequality. [sent-53, score-0.084]

16 2 Deﬁnitions, Results, and Examples Let X be a measurable space and Y ⊂ R be closed. [sent-54, score-0.043]

17 , for all measurable A ⊂ X × Y , we have (1) P (A) = µ {ω ∈ Ω : Zi (ω) ∈ A} . [sent-77, score-0.043]

18 To learn from stationary processes whose components are not independent, [15] suggests that it is necessary to replace the independence assumption by a notion that still guarantees certain concentration inequalities. [sent-78, score-0.083]

19 Throughout this work, we assume that the process Z is geometrically α-mixing, that is α(Z, µ, n) ≤ c exp(−bnγ ) , n ≥ 1, (2) for some constants b > 0, c ≥ 0, and γ > 0. [sent-86, score-0.068]

20 processes satisfy (2) for c = 0 and all b, γ > 0. [sent-90, score-0.076]

21 An extensive and thorough account on mixing concepts including stronger mixing notions such as β- and φ-mixing is provided by [3]. [sent-99, score-0.142]

22 To this end, we assume that we have a hypothesis set F consisting of bounded measurable functions f : X → R that is pre-compact with respect to the supremum norm · ∞ , i. [sent-101, score-0.043]

23 , for all ε > 0, the covering numbers n N (F, · ∞ , ε) := inf n ≥ 1 : ∃f1 , . [sent-103, score-0.048]

24 , fn ∈ F such that F ⊂ B(fi , ε) i=1 are ﬁnite, where B(fi , ε) := {f ∈ ∞ (X) : f − fi ∞ ≤ ε} denotes the ε-ball with center fi in the space ∞ (X) of bounded functions f : X → R. [sent-106, score-0.058]

25 For example, if Y := {−1, 1} and L is a convex, margin-based loss represented by ϕ : R → [0, ∞), that is L(y, t) = ϕ(yt) for all y ∈ Y and t ∈ R, then L can be clipped, if and only if ϕ has a global minimum, see [13, Lemma 2. [sent-111, score-0.071]

26 In particular, the hinge loss, the least squares loss for classiﬁcation, and the squared hinge loss can be clipped, but the logistic loss for classiﬁcation and the AdaBoost loss cannot be clipped. [sent-113, score-0.429]

27 On the other hand, [12] established a simple technique, which is similar to inserting a small amount of noise into the labeling process, to construct a clippable modiﬁcation of an arbitrary convex, margin-based loss. [sent-114, score-0.104]

28 Moreover, if Y := [−M, M ] and L is a convex, distance-based loss represented by some ψ : R → [0, ∞), that is L(y, t) = ψ(y − t) for all y ∈ Y and t ∈ R, then L can be clipped whenever ψ(0) = 0, see again [13, Lemma 2. [sent-115, score-0.196]

29 In particular, the least squares loss and the pinball loss used for quantile regression can be clipped, if the space of labels Y is bounded. [sent-117, score-0.3]

30 Given a loss function L and an f : X → R, we often use the notation L ◦ f for the function (x, y) → L(x, y, f (x)). [sent-118, score-0.071]

31 i=1 Given a regularizer Υ : F → [0, ∞), a clippable loss, and an accuracy δ ≥ 0, we consider learning methods that, for all n ≥ 1, produce a decision function fDn ,Υ ∈ F satisfying ¯ Υ(fDn ,Υ ) + RL,Dn (fDn ,Υ ) ≤ inf Υ(f ) + RL,Dn (f ) + δ . [sent-122, score-0.132]

32 The following theorem, which is our main result, establishes an oracle inequality for methods (4), when the training data is generated by Z. [sent-124, score-0.217]

33 1 Let L : X × Y × R → [0, ∞) be a loss that can be clipped at M > 0 and that satisﬁes L(x, y, 0) ≤ 1, L(x, y, t) ≤ B, and L(x, y, t) − L(x, y, t ) ≤ |t − t | (5) for all (x, y) ∈ X × Y and t, t ∈ [−M, M ], where B > 0 is some constant. [sent-126, score-0.176]

34 Assume that ∗ there exist a Bayes decision function fL,P and constants ϑ ∈ [0, 1] and V ≥ B 2−ϑ such that ∗ ¯ EP L ◦ f − L ◦ fL,P 2 ∗ ¯ ≤ V · EP (L ◦ f − L ◦ fL,P ) ϑ , f ∈ F, (6) where F is a hypothesis set and L ◦ f denotes the function (x, y) → L(x, y, f (x)). [sent-128, score-0.032]

35 Before we illustrate this theorem by a few examples, let us brieﬂy discuss the variance bound (6). [sent-131, score-0.087]

36 For example, if Y = [−M, M ] and L is the least squares loss, then it is well-known that (6) is satisﬁed for V := 16M 2 and ϑ = 1, see e. [sent-132, score-0.077]

37 Moreover, under some assumptions on the distribution P , [14] established a variance bound of the form (6) for the so-called pinball loss used for quantile regression. [sent-136, score-0.23]

38 In addition, for the hinge loss, (6) is satisﬁed for ϑ := q/(q + 1), if Tsybakov’s noise assumption holds for q, see [13, Theorem 8. [sent-137, score-0.052]

39 Finally, based on [2], [12] established a variance bound with ϑ = 1 for the earlier mentioned clippable modiﬁcations of strictly convex, twice continuously differentiable margin-based loss functions. [sent-139, score-0.2]

40 However, a closer look reveals that the constant B only bounds functions of the ¯ form L ◦ f , while B0 bounds the function L ◦ f0 for an unclipped f0 ∈ F. [sent-142, score-0.04]

41 1 provides, by some simple estimates, the oracle inequality RL,P (fDn ,Υ ) − R∗ < 3 inf RL,P (f ) − R∗ L,P L,P + f ∈F 3 36cσ V (τ + ln |F|) nα 1/(2−ϑ) + 4BcB τ . [sent-154, score-0.368]

42 nα Besides constants, this oracle inequality is an exact analogue to the standard oracle inequality for ERM learning from i. [sent-155, score-0.4]

43 1 and additionally assume that there exist constants a > 0 and p ∈ (0, 1] such that ln N (F, · ∞ , ε) ≤ a ε−2p , ε > 0. [sent-164, score-0.167]

44 Then there is cp,ϑ > 0 only depending on p and ϑ such that the inequality of Theorem 2. [sent-165, score-0.084]

45 nα For the learning rates considered in the following examples, the exact value of cp,ϑ is of no importance. [sent-167, score-0.093]

46 SVMs with the hinge loss or the pinball loss, and regularized boosting algorithms. [sent-172, score-0.248]

47 case and to [7] for a consistency result in the case of geometrically β-mixing observations. [sent-178, score-0.061]

48 Unfortunately, a detailed exposition of the learning rates resulting from Corollary 2. [sent-179, score-0.093]

49 3 for all these algorithms is clearly out of scope this paper, and hence we will only discuss learning rates for LS-SVMs. [sent-180, score-0.093]

50 However, the only reason we picked LS-SVMs is that they are one of the few methods for which both rates for learning from α-mixing processes and optimal rates in the i. [sent-181, score-0.243]

51 Given a regularization parameter λ > 0 and the least squares loss L(y, t) := (y − t)2 , the LS-SVM ﬁnds the unique solution fDn ,λ = arg min λ f f ∈H 2 H + RL,Dn (f ) . [sent-188, score-0.148]

52 To describe the approximation properties of H, we further need the approximation error function A(λ) := inf λ f f ∈H 2 H + RL,P (f ) − R∗ L,P , λ > 0. [sent-189, score-0.033]

53 4 (Rates for least squares SVMs) Let X be a compact metric space, Y = [−1, 1], and Z and P as above. [sent-191, score-0.077]

54 Furthermore, let L be the least squares loss and H be the RKHS of a continuous kernel k over X. [sent-192, score-0.163]

55 Assume that the closed unit ball BH of H satisﬁes ln N (BH , · ∞ , ε) ≤ a ε−2p , ε > 0, (7) where a > 0 and p ∈ (0, 1] are some constants. [sent-193, score-0.135]

56 3 applied to F := λ−1/2 BH shows that the LS-SVM using λn := n−αρ/β learns with rate n−αρ . [sent-197, score-0.027]

57 Let us compare this rate with other recent results: [17] establishes the learning rate 2β n− β+3 , whenever (2) is satisﬁed for some α. [sent-198, score-0.091]

58 However, a closer look shows that it depends on the conﬁdence level 1 − 3Ce−τ by a factor of eτ rather than by the factor of τ appearing in our analysis, and hence these rates are not comparable. [sent-200, score-0.093]

59 Moreover, in the case α = 1, our rates are still faster whenever p ∈ (0, 1/3], which is e. [sent-201, score-0.113]

60 Moreover, [19] has recently established the rate αβ n− 2p+1 , (8) 1+p which is faster than ours, if and only if β > 1+2p . [sent-207, score-0.08]

61 In particular, for highly smooth kernels such as the Gaussian RBF kernels, where p can be chosen arbitrarily close to 0, their rate is never faster. [sent-208, score-0.042]

62 1] for a generic description of this technique, can also be applied to our oracle inequality. [sent-212, score-0.138]

63 Due to space constraints and the fact that these rates require knowing α and β, we skip a detailed exposition. [sent-214, score-0.093]

64 5 (Almost optimal rates for least squares SVMs) Consider the situation of Example 2. [sent-218, score-0.17]

65 (9) α As in [16], we can then bound B0 ≤ λ(β−1)p , and hence the SVM using λn := n− β+2pβ+p learns with rate αβ n− β+2pβ+p , β compared to the optimal rate n− β+p in the i. [sent-220, score-0.079]

66 Unfortunately, we do not know, whether the extra term 2ds/m is an artifact of our proof techniques, which are relatively light-weighted compared to the heavy machinery used in the i. [sent-231, score-0.06]

67 Similarly, we do not know, whether the used Bernstein inequality for α-mixing processes, see Theorem 3. [sent-235, score-0.084]

68 1, is optimal, but it is the best inequality we could ﬁnd in the literature. [sent-236, score-0.084]

69 However, if there is, or will be, a better version of this inequality, our oracle inequalities can be easily improved, since our techniques only require a generic form of Bernstein’s inequality. [sent-237, score-0.153]

70 6 In the examples above, the rates were achieved by picking particular regularization sequences that depend on both α and β, which in turn, are almost never known in practice. [sent-239, score-0.112]

71 Fortunately, there exists an easy way to achieve the above rates without such knowledge. [sent-240, score-0.093]

72 3 for LS-SVMs shows that the learning rates of the Examples 2. [sent-244, score-0.093]

73 3 Proofs In the following, t denotes the largest integer n satisfying n ≤ t, and similarly, t denotes the smallest integer n satisfying n ≥ t. [sent-249, score-0.048]

74 The key result we need to prove the oracle inequality of Theorem 2. [sent-250, score-0.2]

75 1 is the following Bernstein type inequality for geometrically α-mixing processes, which was established in [9, Theorem 4. [sent-251, score-0.173]

76 Furthermore, let h : X × Y → R be a bounded measurable function for which there exist constants B > 0 and σ ≥ 0 such that EP h = 0, EP h2 ≤ σ 2 , and h ∞ ≤ B. [sent-254, score-0.09]

77 Then, for all a, b ≥ 0, we have (qa)2/q (q b)2/q ≤ (a + b)2 and ab ≤ aq /q + bq /q . [sent-266, score-0.031]

78 1: For f : X → R we deﬁne hf := L ◦ f − L ◦ fL,P . [sent-268, score-0.165]

79 ¯ ¯ Inequality (11) applied to h := (hf0 − hf0 ) − EP (hf0 − hf0 ) thus shows that ¯ ¯ τ cσ B0 EP (hf0 − hf0 ) cB B0 τ ¯ + α n nα √ b holds with probability µ not less than 1 − Ce−τ . [sent-273, score-0.033]

80 Moreover, using ab ≤ a + 2 , we ﬁnd 2 EDn (hf0 − hf0 ) − EP (hf0 − hf0 ) < ¯ ¯ n−α τ cσ B0 EP (hf0 − hf0 ) ≤ EP (hf0 − hf0 ) + n−α cσ B0 τ /4 , ¯ ¯ 6 and consequently we have with probability µ not less than 1 − Ce−τ that 7cB B0 τ . [sent-274, score-0.07]

81 ¯ Since EP hf0 ≥ 0, this inequality also holds for ϑ = 0, and hence (11) shows that we have ¯ 1 2−ϑ cσ V τ nα EDn hf0 − EP hf0 < EP hf0 + ¯ ¯ ¯ 2cB Bτ nα + (15) with probability µ not less than 1 − Ce−τ . [sent-279, score-0.117]

82 By combining this estimate with (14) and (13), we now obtain that with probability µ not less than 1 − 2Ce−τ we have EDn hf0 − EP hf0 < EP hf0 + ¯ ¯ ¯ cσ V τ nα 1 2−ϑ + 2cB Bτ 7cB B0 τ + , nα 4nα (16) i. [sent-280, score-0.033]

83 , we have established a bound on the second term in (12). [sent-282, score-0.078]

84 Let us ﬁrst consider the case nα < 3cB (τ + ln |C|). [sent-284, score-0.135]

85 It thus remains to consider the case nα ≥ 3cB (τ + ln |C|). [sent-286, score-0.135]

86 To establish a non-trivial bound on the term EP hfD − EDn hfD in (12), we deﬁne functions ¯ ¯ gf,r := EP hf − hf ¯ ¯ , EP hf + r ¯ f ∈F, where r > 0 is a real number to be ﬁxed later. [sent-287, score-0.52]

87 For f ∈ F, we then have gf,r ∞ ≤ 2Br−1 , and for 2 2 ϑ > 0, q := 2−ϑ , q := ϑ , a := r, and b := EP hf = 0, the ﬁrst inequality of Lemma 3. [sent-288, score-0.249]

88 2 yields ¯ 2 EP gf,r ≤ EP h2 ¯ f (EP hf + r)2 ¯ ≤ (2 − ϑ)2−ϑ ϑϑ EP h2 ¯ f 4r2−ϑ (EP hf )ϑ ¯ ≤ V rϑ−2 . [sent-289, score-0.33]

89 (17) Moreover, for ϑ ∈ (0, 1] and EP hf = 0, we have EP h2 = 0 by the variance bound (6), which in ¯ ¯ f 2 ϑ−2 2 turn implies EP gf,r ≤ V r . [sent-290, score-0.19]

90 Combining this with (18), we obtain cσ V (τ + ln |C|) 2cB B(τ + ln |C|) + nα r2−ϑ nα r EP hfDn ,Υ − EDn hfDn ,Υ < EP hf + ε + r ¯ ¯ ¯ + 2ε with probability µ not less than 1 − C e−τ . [sent-295, score-0.45]

91 To this end, we ﬁrst observe that for 36cσ V (τ + ln |C|) nα r := 1/(2−ϑ) , we obtain, since 6 ≤ 361/(2−ϑ) , cσ V τ nα 1 2−ϑ ≤ r 6 cσ V (τ + ln |C|) 1 ≤ . [sent-298, score-0.27]

92 9 EP hfDn ,Υ + ε + r ¯ r 7cB B0 τ + + + 2ε + δ 3 4nα 3 holds with probability µ not less than 1 − 3Ce−τ . [sent-300, score-0.033]

93 Consequently, we have Υ(fDn ,Υ ) + EP hfDn ,Υ < 3Υ(f0 ) + 3EP hf0 + ¯ 36cσ V (τ + ln |C| nα 1/(2−ϑ) + 4cB B0 τ + 4ε + 2δ , nα i. [sent-301, score-0.135]

94 3: The result follows from minimizing the right-hand side of the oracle inequality of Theorem 2. [sent-305, score-0.2]

95 On the rate of convergence of regularized boosting classiﬁers. [sent-320, score-0.11]

96 Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. [sent-359, score-0.134]

97 Estimating conditional quantiles with the help of the pinball loss. [sent-418, score-0.06]

98 Learning rates of regularized regression for exponentially strongly mixing sequence. [sent-454, score-0.201]

99 Rates of convergence for empirical processes of stationary mixing sequences. [sent-461, score-0.141]

100 The performance bounds of learning machines based on exponentially strongly mixing sequences. [sent-468, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ep', 0.581), ('fdn', 0.462), ('edn', 0.359), ('hfdn', 0.325), ('hf', 0.165), ('ln', 0.135), ('oracle', 0.116), ('clipped', 0.105), ('rates', 0.093), ('inequality', 0.084), ('cb', 0.078), ('steinwart', 0.075), ('loss', 0.071), ('zi', 0.067), ('dn', 0.064), ('erm', 0.064), ('cp', 0.062), ('pinball', 0.06), ('squares', 0.058), ('mixing', 0.058), ('processes', 0.057), ('established', 0.053), ('clippable', 0.051), ('regularized', 0.05), ('bernstein', 0.049), ('theorem', 0.047), ('measurable', 0.043), ('moreover', 0.039), ('consequently', 0.039), ('satis', 0.038), ('svms', 0.037), ('bh', 0.036), ('geometrically', 0.036), ('bayreuth', 0.034), ('hfd', 0.034), ('hush', 0.034), ('ingo', 0.034), ('zin', 0.034), ('hinge', 0.034), ('corollary', 0.033), ('inf', 0.033), ('boosting', 0.033), ('constants', 0.032), ('ce', 0.032), ('rademacher', 0.03), ('alamos', 0.03), ('fi', 0.029), ('rate', 0.027), ('stationary', 0.026), ('notions', 0.026), ('consistency', 0.025), ('bound', 0.025), ('chapter', 0.025), ('los', 0.024), ('satisfying', 0.024), ('lemma', 0.024), ('regularizer', 0.024), ('machinery', 0.023), ('fd', 0.023), ('generic', 0.022), ('mohri', 0.021), ('quantile', 0.021), ('articles', 0.021), ('whenever', 0.02), ('heavy', 0.02), ('bounds', 0.02), ('satisfy', 0.019), ('rkhs', 0.019), ('picking', 0.019), ('risk', 0.019), ('least', 0.019), ('holds', 0.018), ('combining', 0.018), ('nancial', 0.018), ('px', 0.018), ('establishes', 0.017), ('accepted', 0.017), ('machines', 0.017), ('proof', 0.017), ('localization', 0.017), ('chains', 0.016), ('editors', 0.016), ('ab', 0.016), ('latter', 0.016), ('probably', 0.016), ('weakly', 0.016), ('bartlett', 0.016), ('brie', 0.016), ('kernels', 0.015), ('let', 0.015), ('blanchard', 0.015), ('sharpness', 0.015), ('aq', 0.015), ('christmann', 0.015), ('klivans', 0.015), ('lozano', 0.015), ('wonder', 0.015), ('inequalities', 0.015), ('less', 0.015), ('covering', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

2 0.10771108 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

3 0.10203169 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to ﬂag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.

4 0.10134234 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes

Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the speciﬁcation of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efﬁcient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1

5 0.098953158 78 nips-2009-Efficient Moments-based Permutation Tests

Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang

Abstract: In this paper, we develop an efficient moments-based permutation test approach to improve the test s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency. 1 In t ro d u c t i o n Permutation tests are flexible nonparametric alternatives to parametric tests in small samples, or when the distribution of a test statistic is unknown or mathematically intractable. In permutation tests, except exchangeability, no other statistical assumptions are required. The p-values can be obtained by using the permutation distribution. Permutation tests are appealing in many biomedical studies, which often have limited observations with unknown distribution. They have been used successfully in structural MR image analysis [1, 2, 3], in functional MR image analysis [4], and in 3D face analysis [5]. There are three common approaches to construct the permutation distribution [6, 7, 8]: (1) exact permutation enumerating all possible arrangements; (2) approximate permutation based on random sampling from all possible permutations; (3) approximate permutation using the analytical moments of the exact permutation distribution under the null hypothesis. The main disadvantage of the exact permutation is the computational cost, due to the factorial increase in the number of permutations with the increasing number of subjects. The second technique often gives inflated type I errors caused by random sampling. When a large number of repeated tests are needed, the random permutation strategy is also computationally expensive to achieve satisfactory accuracy. Regarding the third approach, the exact permutation distribution may not have moments or moments with tractability. In most applications, it is not the existence but the derivation of moments that limits the third approach. To the best of our knowledge, there is no systematic and efficient way to derive the moments of the permutation distribution. Recently, Zhou [3] proposed a solution by converting the permutation of data to that of the statistic coefficients that are symmetric to the permutation. Since the test statistic coefficients usually have simple presentations, it is easier to track the permutation of the test statistic coefficients than that of data. However, this method requires the derivation of the permutation for each specific test statistic, which is not accessible to practical users. In this paper, we propose a novel strategy by employing a general theoretical method to derive the moments of the permutation distribution of any weighted v-statistics, for both univariate and multivariate data. We note that any moments of the permutation distribution for weighted v-statistics [9] can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Our key idea is to divide the whole index set into several permutation equivalent (see Definition 2) index subsets such that the summation of the data/index function term over all permutations is invariant within each subset and can be calculated without conducting any permutation. Then we can obtain the moments by summing up several subtotals. The proposed method can be extended to equivalent weighted v-statistics by replacing them with monotonic weighted v-statistics. This is due to the fact that only the order of test statistics of all permutations matters for obtaining the p-values, so that the monotonic weighted v-statistics shares the same p-value with the original test statistic. Given the first four moments, the permutation distribution can be well fitted by Pearson distribution series. The p-values are then obtained without conducting any real permutation. For multiple comparison of two-group difference, given the sample size n1 = 21 and n2 = 21, the number of tests m = 2,000, we need to conduct m×(n1 + n2 )!/n1 !/n2 ! 1.1×1015 permutations for the exact permutation test. Even for 20,000 random permutations per test, we still need m×20,000 4×107 permutations. Alternatively, our moments-based permutation method using Pearson distribution approximation only involves the calculation of the first four analytically-derived moments of exact permutation distributions to achieve high accuracy (see section 3). Instead of calculating test statistics in factorial scale with exact permutation, our moments-based permutation only requires computation of polynomial order. For example, the computational cost for univariate mean difference test statistic and modified multivariate Hotelling's T2 test statistics [8] are O(n) and O(n3 ), respectively, where n = n 1 + n2 . 2 M e t h o d o lo g y In this section, we shall mainly discuss how to calculate the moments of the permutation distribution for weighted v-statistics. For other test statistics, a possible solution is to replace them with their equivalent weighted v-statistics by monotonic transforms. The detailed discussion about equivalent test statistics can be found in [7, 8, 10]. 2.1 C o m p ut a t i o n a l c h a l l e n g e Let us first look at a toy example. Suppose we have a two-group univariate data x = ( x1 , L , xn1 , xn1 +1 ,L , xn1 + n2 ) , where the first n1 elements are in group A and the rest, n2 ,are in group B. For comparison of the two groups, the hypothesis is typically constructed as: H 0 : m A = mB vs. H a : m A ¹ m B , where m A , m B are the population means of the groups A n1 n i =1 i = n1 +1 and B, respectively. Define x A = å xi / n1 and xB = å xi / n2 as the sample means of two groups, where n=n1+n2. We choose the univariate group mean difference statistic, i.e., n T ( x) = x A - xB = å w(i ) xi , i =1 where the index function as the test w(i ) = 1/ n1 , if i Î {1, L , n1} and w(i ) = -1/ n2 , if i Î {n1 + 1, L, n} . Then the total number of all possible permutations of {1, L, n} is n!. To calculate the fourth moment of the permutation distribution, n n n n n 1 1 4 å ( å w(i ) xp (i ) ) = å å å å å w(i1 ) w(i2 ) w(i3 ) w(i4 )xp ( i1 ) xp ( i2 ) xp ( i3 ) xp ( i4 ) , n ! p ÎS n i =1 n ! p ÎS n i1 =1 i2 =1 i3 =1 i4 =1 where is the permutation operator and the symmetric group Sn [11] includes all distinct permutations. The above example shows that the moment calculation can be considered as a summation over all possible permutations and a large index set. It is noticeable that the computational challenge here is to go through the factorial level permutations and polynomial level indices. Ep (T 4 ( x ))= 2.2 P a r t i t i o n t h e i n de x s e t In this paper, we assume that the test statistic T can be expressed as a weighted v-statistic of n n i1 =1 id =1 degree d [9], that is, T ( x) = å L å w(i1 , L , id ) h( xi1 ,L , xid ) , where x = ( x1 , x 2 , L , x n ) T is a data with n observations, and w is a symmetric index function. h is a symmetric data function, i.e., invariant under permutation of (i1 ,L , id ) . Though the symmetry property is not required for our method, it helps reduce the computational cost. Here, each observation xk can be either univariate or multivariate. In the above toy example, d=1 and h is the identity function. Therefore, the r-th moment of the test statistic from the permutated data is: Ep (T r ( x)) = Ep ( å w(i1 ,L , id )h( xp (i1 ) ,L , xp ( id ) ))r i1 ,i2 ,L,id = Ep [ å i1(1) ,L, id (1) , r r k =1 k =1 { Õ w(i1(k ) ,L, id ( k ) ) Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) )}] . L i1( r ) ,L,id ( r ) d 1 Then we can exchange the summation order of permutations and that of indices, Ep (T r ( x)) = å i1(1) ,L, id (1) , L i1( r ) ,L,id ( r ) r r k =1 k =1 {( Õ w(i1( k ) ,L , id ( k ) )) Ep ( Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) ))}. d 1 Thus any moment of permutation distribution can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Since all possible permutations map any index value between 1 and n to all possible index r values from 1 to n with equal probability, Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) , the summation of k =1 1 d data function over all permutations is only related to the equal/unequal relationship among indices. It is natural to divide the whole index set U = {i1 ,L , id }r = {(i1(1) , L , id (1) ), L (r ) , ( i1 , L , id (r ) r )} into the union of disjoint index subsets, in which Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) k =1 1 d is invariant. Definition 1. Since h is a symmetric function, two index elements (i1 ,L , id ) and ( j1 ,L , jd ) are said to be equivalent if they are the same up to the order. For example, for d = 3, (1, 4, 5) = (1,5,4) = (4,1,5) = (4,5,1) = (5,1,4) = (5,4,1). Definition 2. Two indices {(i1(1) , L , id (1) ), L , (i1( r ) , L , id ( r ) )} and {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} are said to be permutation equivalent/ if there exists a permutation p Î Sn such that {(p (i1(1) ), L , p (id (1) )), L , (p (i1( r ) ), L , p (id ( r ) ))} = {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} . Here

6 0.085911199 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

7 0.071686745 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

8 0.065448612 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

9 0.063114651 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

10 0.057151366 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

11 0.05581589 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

12 0.054321341 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

13 0.054088019 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

14 0.047956016 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

15 0.046150949 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

16 0.045623712 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

17 0.044705112 14 nips-2009-A Parameter-free Hedging Algorithm

18 0.04441946 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

19 0.044415928 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

20 0.042857748 112 nips-2009-Human Rademacher Complexity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.12), (1, 0.111), (2, 0.029), (3, 0.041), (4, 0.013), (5, -0.05), (6, -0.003), (7, -0.042), (8, 0.001), (9, -0.014), (10, 0.043), (11, -0.017), (12, -0.031), (13, -0.033), (14, 0.068), (15, -0.013), (16, -0.015), (17, -0.013), (18, 0.129), (19, -0.011), (20, 0.01), (21, 0.07), (22, -0.11), (23, -0.071), (24, -0.062), (25, -0.067), (26, -0.053), (27, 0.012), (28, 0.02), (29, 0.079), (30, 0.006), (31, -0.011), (32, -0.043), (33, -0.064), (34, -0.135), (35, -0.073), (36, 0.121), (37, -0.015), (38, 0.014), (39, -0.092), (40, -0.026), (41, -0.052), (42, -0.136), (43, -0.076), (44, -0.053), (45, 0.067), (46, -0.012), (47, -0.065), (48, -0.154), (49, 0.122)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93126673 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

2 0.50347412 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a uniﬁed asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

3 0.4719032 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

4 0.43786415 78 nips-2009-Efficient Moments-based Permutation Tests

Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang

5 0.4266938 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classiﬁcation, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

6 0.40635231 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

7 0.38693088 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

8 0.38526842 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

9 0.38314077 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

10 0.37616721 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

11 0.37425005 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

12 0.36906821 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

13 0.36681437 238 nips-2009-Submanifold density estimation

14 0.3526476 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

15 0.35029906 71 nips-2009-Distribution-Calibrated Hierarchical Classification

16 0.3481738 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

17 0.34734285 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

18 0.33555272 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

19 0.33376676 180 nips-2009-On the Convergence of the Concave-Convex Procedure

20 0.32592663 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.098), (25, 0.075), (35, 0.043), (36, 0.102), (39, 0.024), (58, 0.109), (61, 0.021), (71, 0.05), (86, 0.064), (91, 0.013), (96, 0.28)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81233525 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

2 0.7707001 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our ﬁrst contribution is to show how to efﬁciently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efﬁciently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.

same-paper 3 0.76806104 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

4 0.76073784 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

Author: Arno Onken, Steffen Grünewälder, Klaus Obermayer

Abstract: The linear correlation coefﬁcient is typically used to characterize and analyze dependencies of neural spike counts. Here, we show that the correlation coefﬁcient is in general insufﬁcient to characterize these dependencies. We construct two neuron spike count models with Poisson-like marginals and vary their dependence structure using copulas. To this end, we construct a copula that allows to keep the spike counts uncorrelated while varying their dependence strength. Moreover, we employ a network of leaky integrate-and-ﬁre neurons to investigate whether weakly correlated spike counts with strong dependencies are likely to occur in real networks. We ﬁnd that the entropy of uncorrelated but dependent spike count distributions can deviate from the corresponding distribution with independent components by more than 25 % and that weakly correlated but strongly dependent spike counts are very likely to occur in biological networks. Finally, we introduce a test for deciding whether the dependence structure of distributions with Poissonlike marginals is well characterized by the linear correlation coefﬁcient and verify it for different copula-based models. 1

5 0.70178974 154 nips-2009-Modeling the spacing effect in sequential category learning

Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille

Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We deﬁne conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efﬁciently perform model selection to decide whether observations are from one or multiple categories.

6 0.59968287 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

7 0.594414 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

8 0.58773881 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

9 0.58730644 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

10 0.58696079 147 nips-2009-Matrix Completion from Noisy Entries

11 0.58624423 55 nips-2009-Compressed Least-Squares Regression

12 0.5856207 122 nips-2009-Label Selection on Graphs

13 0.58521289 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

14 0.58442163 180 nips-2009-On the Convergence of the Concave-Convex Procedure

15 0.58438605 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

16 0.58298719 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.57959723 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

18 0.57920438 78 nips-2009-Efficient Moments-based Permutation Tests

19 0.57899493 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

20 0.57894486 97 nips-2009-Free energy score space