nips nips2010 nips2010-205 knowledge-graph by maker-knowledge-mining

205 nips-2010-Permutation Complexity Bound on Out-Sample Error


Source: pdf

Author: Malik Magdon-Ismail

Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. [sent-3, score-0.732]

2 The permutation complexity is based (like the maximum discrepancy) on dependent sampling. [sent-4, score-0.606]

3 We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated. [sent-5, score-0.616]

4 1 Introduction Assume a standard setting with data D = {(xi , yi )}n , where (xi , yi ) are sampled iid from the joint i=1 distribution p(x, y) on Rd × {±1}. [sent-6, score-0.32]

5 We assume the 0-1 loss, so the in-sample error is ein (h) = n 1 1 i=1 (1 − yi h(xi )). [sent-8, score-0.525]

6 The out-sample error eout (h) = 2 E [(1 − yh(x))]; the expectation is over 2n the joint distribution p(x, y). [sent-9, score-0.296]

7 To do so, we will bound |eout (h) − ein (h)| uniformly over H for all distributions p(x, y); however, the bound itself will depend on the data, and hence the distribution. [sent-11, score-0.482]

8 The data dependent permutation complexity1 for H is defined by: PH (n, D) = Eπ max h∈H 1 n n yπi h(xi ) . [sent-13, score-0.501]

9 i=1 Here, π is a uniformly random permutation on {1, . [sent-14, score-0.474]

10 PH (n, D) is an intuitively plausible measure of the complexity of a model, measuring its ability to correlate with a random permutation of the target values. [sent-18, score-0.557]

11 Analogously, we may define the bootstrap complexity, using the bootstrap B distribution B on y, where each sample yi is independent and uniformly random over y1 , . [sent-23, score-0.653]

12 , yn : BH (n, D) = EB max h∈H 1 n n B yi h(xi ) . [sent-26, score-0.187]

13 The maximum discrepancy complexity measure ∆H (n, D) is similar to the Rademacher complexity, with the expectation over r being restricted to n those r satisfying i=1 ri = 0, ∆H (n, D) = Er max h∈H 1 n n ri yi h(xi ) . [sent-32, score-0.839]

14 i=1 When y = 0, the permutation complexity is the maximum discrepancy; the permutation complexity ¯ is to maximum discrepancy as the bootstrap complexity is to the Rademacher complexity. [sent-33, score-1.535]

15 The permutation complexity maintains a little more information regarding the distribution. [sent-34, score-0.557]

16 Indeed we prove a uniform bound very similar to the uniform bound obtained using the Rademacher complexity: Theorem 1 With probability at least 1 − δ, for every h ∈ H, 6 1 ln . [sent-35, score-0.371]

17 2n δ eout (h) ≤ ein (h) + PH (n, D) + 13 The probability in this theorem is with respect to the data distribution. [sent-36, score-0.591]

18 Using our same proof technique, one can also obtain a similar uniform bound with the bootstrap complexity, where the samples are independent, but according to the data. [sent-38, score-0.363]

19 The proof starts with the standard ghost sample and symmetrization argument. [sent-39, score-0.207]

20 We then need to handle the data dependent sampling in the complexity measure, and this is done by introducing a second ghost data set to govern the sampling. [sent-40, score-0.32]

21 The crucial aspect about sampling according to a second ghost data set is that the samples are now independent of the data; this is acceptable, provided the two methods of sampling are close enough; this is what constitutes the meat of the proof given in Section 2. [sent-41, score-0.255]

22 n 1 For a given permutation π, one can compute maxh∈H n i=1 yπi h(xi ) using an empirical risk minimization; however, the computation of the expectation over permutations is an exponential task, which needless to say is not feasible. [sent-43, score-0.672]

23 Fortunately, we can establish that the permutation complexity is concentrated around its expectation, which means that in principle a single permutation suffices to compute the permutation complexity. [sent-44, score-1.484]

24 Theorem 2 For an absolute constant c ≤ 6 + 1 h∈H n PH (n, D) ≤ sup 2/ ln 2, with probability at least 1 − δ, n yπi h(xi ) + c i=1 3 1 ln . [sent-46, score-0.444]

25 It is easy to show concentration for the bootstrap complexity about its expectation – this follows from McDiarmid’s inequality because the samples are independent. [sent-50, score-0.475]

26 The complication with the permutation complexity is that the samples are not independent. [sent-51, score-0.557]

27 Nevertheless, we can show the concentration indirectly by first relating the two complexities for any data set, and then using the concentration of the bootstrap complexity (see Section 2. [sent-52, score-0.492]

28 For a single random permutation, with probability at least 1 − δ, 1 h∈H n eout (h) ≤ ein (h) + sup n yπi h(xi ) + O i=1 1 1 ln n δ . [sent-55, score-0.883]

29 Asymptotically, one random permutation suffices; in practice, one should average over a few. [sent-56, score-0.452]

30 Indeed, a permutation based validation estimate for model selection has been extensively tested (see Magdon-Ismail and Mertsalov (2010) for details); for classification, this permutation estimate is the permutation complexity after removing a bias term. [sent-57, score-1.529]

31 We restate those results here, comparing model selection using the permutation estimate versus using the Rademacher complexity (using real data sets from the UCI Machine Learning repository (Asuncion and Newman, 2007)). [sent-59, score-0.584]

32 03 The permutation complexity appears to dominate most of the time (especially when n is small); and, when it fails to dominate, it is as good or only slightly worse than the Rademacher estimate. [sent-146, score-0.557]

33 Similarly, (see Lemma 5), the bootstrap and permutation complexities are equal, asymptotically. [sent-149, score-0.721]

34 An intuition for why the permutation complexity performs relatively well is because it maintains more of the true data distribution. [sent-151, score-0.557]

35 Indeed, the permutation method for validation was found to work well empirically, even in regression (Magdon-Ismail and Mertsalov, 2010); however, our permutation complexity bound only applies to classification. [sent-152, score-1.109]

36 Can the permutation complexity bound be extended beyond classification to (for example) regression with bounded loss? [sent-154, score-0.616]

37 The permutation complexity displays a bias for severely unbalanced data; can this bias be removed. [sent-155, score-0.557]

38 The permutation complexity uses a “sampled” data set on which to compute the complexity; other than this superficial similarity, the estimates are inherently different. [sent-164, score-0.557]

39 The most celebrated uniform bound on generalization error is the distribution independent bound of Vapnik-Chervonenkis (VC-bound) (Vapnik and Chervonenkis, 1971). [sent-166, score-0.208]

40 Several data dependent bounds have already been mentioned, which can typically be estimated in-sample via optimization: maximum discrepancy (Bartlett et al. [sent-170, score-0.162]

41 In practice, this penalty for the complexity of model selection is ignored (as in Bartlett et al. [sent-176, score-0.154]

42 (2005) show concentration for a permutation based test of significance for the improved performance of a more complex model, using the Rademacher complexity. [sent-182, score-0.511]

43 We directly give a uniform bound for the out-sample error in terms of a permutation complexity, answering a question posed in (Golland et al. [sent-183, score-0.602]

44 , 2005) which asks whether there is a direct link between permutation statistics and generalization errors. [sent-184, score-0.452]

45 Indeed, Magdon-Ismail and Mertsalov (2010) construct a permutation estimate for validation which they empirically test in both classification and regression problems. [sent-185, score-0.493]

46 For classification, their estimate is related to the permutation complexity. [sent-186, score-0.452]

47 (2002) give a uniform bound using the maximum discrepancy which is in some sense a uniform bound based on a sampling without replacement (dependent sampling); however, the sampling distribution is fixed, independent of the data. [sent-189, score-0.435]

48 The discrepancy automatically drops out from using the ghost sample; this does not happen with data dependent permutation sampling, which is where the difficulty lies. [sent-193, score-0.723]

49 We will adapt the standard ghost sample approach in VC-type proofs and the symmetrization trick in (Gin´ and Zinn, 1984) which has greatly simplified VC-style e proofs. [sent-195, score-0.174]

50 Our main bounding tool will be McDiarmid’s inequality: Lemma 1 (McDiarmid (1989)) Let Xi ∈ Ai be independent; suppose f : sup Q (x1 ,. [sent-197, score-0.18]

51 Then, with probability at least 1 − δ, We also obtain Ef ≤ f + i n 2 i=1 ci 1 2 n 1 c2 ln . [sent-216, score-0.161]

52 1 Permutation Complexity The out-sample permutation complexity of a model is: PH (n) = ED PH (n, D) = ED,π max h∈H 1 n n yπi h(xi ) , i=1 where the expectation is over the data D = (x1 , y1 ), . [sent-219, score-0.602]

53 Proof: For any permutation π and every h ∈ H, the sum n yπi h(xi ) changes by at most 4 in i=1 going from D to D′ ; thus, the maximum over h ∈ H changes by at most 4. [sent-225, score-0.51]

54 Corollary 1 With probability at least 1 − δ, PH (n) ≤ PH (n, D) + 4 1 1 ln . [sent-227, score-0.161]

55 2n δ n 1 1 Since ein (h) = 2 (1 − n i=1 yi h(xi )), the empirical risk minimizer g π on the permuted targets yπ can be used to compute PH (n, D) for a particular permutation π. [sent-228, score-0.99]

56 2 Bounding the Out-Sample Error To bound suph∈H {eout (h) − ein (h)}, we first use the standard ghost sample and symmetrization arguments typical of modern generalization error proofs (see for example Bartlett and Mendelson ′′ ′′ (2002); Shawe-Taylor and Cristianini (2004)). [sent-230, score-0.598]

57 Lemma 3 With probability at least 1 − δ: sup {eout (h) − ein (h)} h∈H ≤ ED,D′ sup h∈H 1 2n n i=1 ′′ ′ ri (yi h(xi ) − yi h(x′ )) i + 1 1 ln . [sent-235, score-1.186]

58 2n δ Proof: We proceed as in the proof of the maximum discrepancy bound in Section 1. [sent-236, score-0.183]

59 1: (a) sup {eout (h) − ein (h)} h∈H ≤ (b) = ED,D′ sup h∈H ED,D′ sup h∈H 1 2n 1 2n n i=1 n i=1 ′ yi h(xi ) − yi h(x′ ) i + ′′ ′ ri (yi h(xi ) − yi h(x′ )) i 1 1 ln , 2n δ + 1 1 ln . [sent-237, score-1.759]

60 1 Generating Permutations with ±1 Sequences π Fix y; for a given permutation π, define a corresponding ±1 sequence rπ by ri = yπi yi ; then, π yπi = ri yi . [sent-242, score-1.21]

61 } (there may be repetitions as two different permutations may result in the same sequence of ±1 values); we thus have a mapping from permutations to the ±1 sequences in Sy . [sent-252, score-0.308]

62 y (componentwise product) is uniform over the permutations of y. [sent-254, score-0.185]

63 Similarly, we can define Sy′ , the generator of permutations on y′ . [sent-256, score-0.164]

64 We can overcome this by introducing a second ghost sample D′′ to “approximately” generate the permutations for y, y′ , ultimately allowing us to prove the main result. [sent-258, score-0.297]

65 Theorem 3 With probability at least 1 − 5δ, 1 1 ln , 2n δ sup {eout (h) − ein (h)} ≤ PH (n) + 9 h∈H We obtain Theorem 1 by combining Theorem 3 with Corollary 1. [sent-259, score-0.655]

66 2 Proof of Theorem 3 Let D′′ be a second, independent ghost sample, and Sy′′ the generator of permutations for y′′ . [sent-262, score-0.316]

67 1 h∈H 2n n sup π i=1 ′′ ′ ri (π)(yi h(xi ) − yi h(x′ )) , i (1) π where each permutation π induces a particular sequence r′′ (π) ∈ Sy′′ (previously we used ri which is now ri (π)). [sent-265, score-1.421]

68 Consider the sequences r, r′ corresponding to the permutations on y and y′ . [sent-266, score-0.169]

69 The next lemma will ultimately relate the expectation over permutations in the second ghost data set to the permutations over D, D′ . [sent-267, score-0.54]

70 This lemma says that the permutation generating sets Sy′′ , Sy′ , and Sy are essentially equivalent. [sent-271, score-0.53]

71 For a given permutation π, we map r′′ (π) ∈ Sy′′ to r(π) ∈ Sy . [sent-275, score-0.452]

72 This mapping is clearly bijective since every permutation corresponds uniquely to a sequence in Sy (and Sy′′ ). [sent-276, score-0.452]

73 ′′ ′′ ′′ ′′ ′′ ′′ Let ri = yπi yi and ri = yπi yi . [sent-277, score-0.758]

74 Thus, for any r′′ and any h ∈ H, n We observe that i=1 ′′ (ri − ri (r′′ ))yi h(xi ) n i=1 (yi 1 2n ≤ n i=1 − ′′ yi ) 1 2n = 1 2n 1 2n n i=1 n i=1 ′′ |ri − ri (r′′ )| |yi h(xi )| ′′ |ri − ri (r′′ )| ≤ 2|k − k ′′ | . [sent-280, score-0.817]

75 n ′′ = 2(k − k ) and so, ′′ (ri − ri (r′′ ))yi h(xi ) ≤ 1 n n i=1 ′′ (yi − yi ) = 1 n n zi , i=1 ′′ where zi = yi − yi . [sent-281, score-0.793]

76 Since zi ∈ {0, ±2}, if you change one of the 6 4 zi , f changes by at most n , and so the conditions hold to apply McDiarmid’s inequality to f . [sent-287, score-0.164]

77 Thus, using the symmetry of zi , with probability at least 1 − 2δ, n i=1 zi 8 n ≤ 1 2n ln 1 . [sent-288, score-0.255]

78 We can rewrite the internal summand in the expression of Equation (1) using the equality ′′ ′ ′′ ′′ ′ ′ ′ ri (yi h(xi ) − yi h(x′ )) = (ri − ri (r′′ ) + ri (r′′ ))yi h(xi ) − (ri − ri (r′′ ) + ri (r′′ ))yi h(x′ ). [sent-290, score-1.255]

79 i i ′′ Using Lemma 4, we can, with probability at least 1 − 2δ, bound the term which involves (ri − ′′ ri (r )) in Equation (1); and, similarly, with probability at least 1 − 2δ, we bound the term involving ′′ ′ (ri − ri (r′′ )). [sent-291, score-0.616]

80 1 h∈H 2n n sup π i=1 ′ ′ (ri (r′′ )yi h(xi ) − ri (r′′ )yi h(x′ )) + 2 i 8 1 ln , n δ ′′ where r (π) cycles through the sequences in Sy′′ . [sent-293, score-0.532]

81 1 2n h∈H n ′ yπi h(x′ ) + 2 i sup π i=1 8 1 ln ; n δ Using this in Lemma 3, with probability at least 1 − 5δ, sup {eout (h) − ein (h)} ≤ PH (n) + 9 h∈H 1 1 ln . [sent-299, score-0.938]

82 (ii) The same proof technique can be used to get a bootstrap complexity bound; the result is similar. [sent-302, score-0.363]

83 (iii) One could bound PH for VC function classes, showing that this data dependent bound is asymptotically no worse than a VC-type bound. [sent-303, score-0.204]

84 Bounding permutation complexity on specific domains could follow the methods in Bartlett and Mendelson (2002). [sent-304, score-0.557]

85 3 Estimating PH (n, D) Using a Single Permutation We now prove Theorem 2, which states that one can essentially estimate PH (n, D) (an average over 1 all permutations) by suph∈H n n yπi h(xi ), using just a single randomly selected permutation π. [sent-306, score-0.452]

86 i=1 Our proof is indirect: we will link PH to the bootstrap complexity BH . [sent-307, score-0.363]

87 The bootstrap complexity is concentrated via an easy application of McDiarmid’s inequality, which will ultimately allow us to conclude that the permutation estimate is also concentrated. [sent-308, score-0.832]

88 The bootstrap distribution B constructs a random sequence yB of n independent uniform samples from y1 , . [sent-309, score-0.292]

89 , yn ; the key requirement is B that yi are independent samples. [sent-312, score-0.208]

90 n Proof: Let k be the number of yi which are +1; we condition on κ, the number of +1 in the bootstrap sample. [sent-315, score-0.385]

91 ) Since yπi differs from yπi in exactly |k − κ| positions, 1 h∈H n n sup i=1 yπi h(xi ) − Thus, Since Eκ [|k − κ|] ≤ 2|k − κ| 1 ≤ sup n h∈H n n 1 h∈H n F yπi h(xi ) ≤ sup i=1 n yπi h(xi ) + i=1 2|k − κ| . [sent-319, score-0.456]

92 2 In addition to furthering our cause toward the proof of Theorem 2, Lemma 5 is interesting in its own right, because it says that permutation and bootstrap sampling are asymptotically similar. [sent-322, score-0.801]

93 The nice B B thing about the bootstrap estimate is that the expectation is over independent y1 , . [sent-323, score-0.291]

94 Since the 2 bootstrap complexity changes by at most n if you change one sample, by McDiarmid’s inequality, Lemma 6 For a random bootstrap sample B, with probability at least 1 − δ, 1 n h∈H BH (n, D) ≤ sup n B yi h(xi ) + 2 i=1 1 1 ln . [sent-327, score-1.057]

95 Now, generate a random permutation yπ , and flip (as appropriate) a randomly selected |k − κ| entries, where k is the number of +1’s in y. [sent-331, score-0.452]

96 If we apply McDiarmid’s inequality to the function which equals the number of 1 1 +1’s, we immediately get that with probability at least 1 − 2δ, |κ − k| ≤ ( 2 n ln δ )1/2 . [sent-332, score-0.202]

97 Thus, with 1 probability at least 1 − 2δ, yB differs from yπ in at most (2n ln δ )1/2 positions. [sent-333, score-0.161]

98 Each flip changes the complexity by at most 2, hence, with probability at least 1 − 2δ, 1 h∈H n n sup i=1 1 h∈H n B yi h(xi ) ≤ sup n yπi h(xi ) + 4 i=1 1 1 ln . [sent-334, score-0.759]

99 2n δ We conclude that for a random permutation π, with probability at least 1 − 3δ, 1 h∈H n BH (n, D) ≤ sup n yπi h(xi ) + 6 i=1 1 1 ln . [sent-335, score-0.765]

100 We have not only established that PH is concentrated, but we have also established a general connection between the permutation and bootstrap based estimates. [sent-337, score-0.677]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('permutation', 0.452), ('ein', 0.342), ('sy', 0.332), ('ph', 0.269), ('eout', 0.228), ('bootstrap', 0.225), ('ri', 0.219), ('mcdiarmid', 0.186), ('yi', 0.16), ('sup', 0.152), ('rademacher', 0.146), ('permutations', 0.139), ('ghost', 0.131), ('ln', 0.131), ('complexity', 0.105), ('bh', 0.092), ('discrepancy', 0.091), ('bartlett', 0.09), ('mendelson', 0.07), ('xi', 0.068), ('fromont', 0.065), ('mertsalov', 0.065), ('koltchinskii', 0.064), ('bound', 0.059), ('lemma', 0.059), ('concentration', 0.059), ('golland', 0.057), ('panchenko', 0.057), ('yb', 0.053), ('elomaa', 0.049), ('wherry', 0.049), ('dependent', 0.049), ('education', 0.047), ('zi', 0.047), ('uniform', 0.046), ('vapnik', 0.046), ('expectation', 0.045), ('complexities', 0.044), ('inen', 0.043), ('chervonenkis', 0.043), ('symmetrization', 0.043), ('replacement', 0.043), ('lugosi', 0.041), ('inequality', 0.041), ('validation', 0.041), ('penalties', 0.041), ('eb', 0.039), ('lozano', 0.039), ('nobel', 0.039), ('maxh', 0.039), ('massart', 0.039), ('asymptotically', 0.037), ('ari', 0.037), ('risk', 0.036), ('psychology', 0.036), ('sampling', 0.035), ('cross', 0.035), ('proof', 0.033), ('cureton', 0.033), ('katzell', 0.033), ('mosier', 0.033), ('wiklund', 0.033), ('zinn', 0.033), ('sequences', 0.03), ('least', 0.03), ('changes', 0.029), ('craven', 0.029), ('bounding', 0.028), ('supremum', 0.028), ('yn', 0.027), ('selection', 0.027), ('ultimately', 0.027), ('wahba', 0.026), ('suph', 0.026), ('gin', 0.026), ('ip', 0.025), ('larson', 0.025), ('generator', 0.025), ('negation', 0.025), ('akaike', 0.025), ('disagree', 0.023), ('newman', 0.023), ('error', 0.023), ('concentrated', 0.023), ('measurement', 0.022), ('et', 0.022), ('symposium', 0.022), ('uniformly', 0.022), ('stone', 0.021), ('iii', 0.021), ('theorem', 0.021), ('hypothesis', 0.021), ('independent', 0.021), ('asuncion', 0.021), ('episodes', 0.021), ('efron', 0.021), ('shen', 0.02), ('penalization', 0.019), ('says', 0.019), ('cristianini', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

Author: Malik Magdon-Ismail

Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.

2 0.14712951 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

3 0.13118608 9 nips-2010-A New Probabilistic Model for Rank Aggregation

Author: Tao Qin, Xiubo Geng, Tie-yan Liu

Abstract: This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-ofthe-art ranking accuracy, and are much more efficient than previous algorithms.

4 0.11077109 115 nips-2010-Identifying Dendritic Processing

Author: Aurel A. Lazar, Yevgeniy Slutskiy

Abstract: In system identification both the input and the output of a system are available to an observer and an algorithm is sought to identify parameters of a hypothesized model of that system. Here we present a novel formal methodology for identifying dendritic processing in a neural circuit consisting of a linear dendritic processing filter in cascade with a spiking neuron model. The input to the circuit is an analog signal that belongs to the space of bandlimited functions. The output is a time sequence associated with the spike train. We derive an algorithm for identification of the dendritic processing filter and reconstruct its kernel with arbitrary precision. 1

5 0.11040877 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Author: Martha White, Adam White

Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1

6 0.10907853 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

7 0.097016767 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

8 0.092625193 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

9 0.082299933 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

10 0.068538591 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

11 0.068260573 257 nips-2010-Structured Determinantal Point Processes

12 0.067114569 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

13 0.064694703 217 nips-2010-Probabilistic Multi-Task Feature Selection

14 0.064619489 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

15 0.061643403 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

16 0.058076195 191 nips-2010-On the Theory of Learnining with Privileged Information

17 0.0572046 27 nips-2010-Agnostic Active Learning Without Constraints

18 0.056441393 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

19 0.054688796 22 nips-2010-Active Estimation of F-Measures

20 0.053140547 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, 0.006), (2, 0.121), (3, 0.031), (4, 0.013), (5, 0.08), (6, -0.078), (7, -0.011), (8, -0.015), (9, 0.003), (10, -0.003), (11, -0.084), (12, -0.072), (13, 0.008), (14, -0.049), (15, 0.024), (16, 0.088), (17, -0.016), (18, 0.032), (19, -0.0), (20, 0.051), (21, -0.06), (22, -0.004), (23, -0.019), (24, 0.009), (25, 0.078), (26, -0.09), (27, -0.133), (28, 0.108), (29, 0.063), (30, 0.048), (31, -0.047), (32, -0.036), (33, -0.137), (34, 0.053), (35, 0.014), (36, 0.152), (37, -0.002), (38, 0.013), (39, 0.067), (40, -0.088), (41, 0.009), (42, -0.014), (43, -0.01), (44, -0.097), (45, -0.15), (46, -0.093), (47, 0.001), (48, -0.036), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93633699 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

Author: Malik Magdon-Ismail

Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.

2 0.74424136 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola

Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1

3 0.58086669 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1

4 0.57993877 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma

Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1

5 0.57564682 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

Author: Hariharan Narayanan, Sanjoy Mitter

Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1

6 0.57342303 191 nips-2010-On the Theory of Learnining with Privileged Information

7 0.55327612 243 nips-2010-Smoothness, Low Noise and Fast Rates

8 0.54528755 9 nips-2010-A New Probabilistic Model for Rank Aggregation

9 0.51452821 220 nips-2010-Random Projection Trees Revisited

10 0.44883588 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars

11 0.41317573 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

12 0.41025358 257 nips-2010-Structured Determinantal Point Processes

13 0.40800351 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

14 0.40638465 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

15 0.40551782 216 nips-2010-Probabilistic Inference and Differential Privacy

16 0.40049073 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

17 0.39808837 108 nips-2010-Graph-Valued Regression

18 0.39600539 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

19 0.39004004 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

20 0.38895771 142 nips-2010-Learning Bounds for Importance Weighting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.026), (27, 0.035), (30, 0.063), (45, 0.126), (50, 0.037), (52, 0.019), (60, 0.036), (77, 0.021), (90, 0.528)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79608625 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

Author: Malik Magdon-Ismail

Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.

2 0.77095079 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

Author: Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen

Abstract: In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classifiers already exist for many subtasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a ‘black-box’ interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about what error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classification. 1

3 0.7401576 250 nips-2010-Spectral Regularization for Support Estimation

Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo

Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.

4 0.73128772 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models

Author: Jun Zhu, Li-jia Li, Li Fei-fei, Eric P. Xing

Abstract: Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.

5 0.62216073 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1

6 0.61598337 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics

7 0.53201139 175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers

8 0.51598346 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

9 0.47522098 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers

10 0.47273493 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

11 0.45734364 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

12 0.44862974 282 nips-2010-Variable margin losses for classifier design

13 0.44605333 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

14 0.44207805 117 nips-2010-Identifying graph-structured activation patterns in networks

15 0.4339003 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces

16 0.43145251 24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification

17 0.43124995 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

18 0.43022308 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

19 0.42699057 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

20 0.42554474 220 nips-2010-Random Projection Trees Revisited