nips nips2013 nips2013-242 knowledge-graph by maker-knowledge-mining

242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

Source: pdf

Author: Ilya O. Tolstikhin, Yevgeny Seldin

Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is signiﬁcantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is signiﬁcantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is conﬁrmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. [sent-6, score-0.433]

2 We show that when the empirical variance is signiﬁcantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is signiﬁcantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. [sent-7, score-1.185]

3 The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. [sent-9, score-0.433]

4 PAC-Bayesian analysis provides concentration inequalities for the divergence between expected and empirical loss of randomized prediction rules. [sent-13, score-0.555]

5 For a hypothesis space H a randomized prediction rule associated with a distribution ρ over H operates by picking a hypothesis at random according to ρ from H each time it has to make a prediction. [sent-14, score-0.303]

6 If ρ is a delta-distribution we recover classical prediction rules that pick a single hypothesis h ∈ H. [sent-15, score-0.134]

7 Otherwise, the prediction strategy resembles Bayesian prediction from the posterior distribution, with a distinction that ρ does not have to be the Bayes posterior. [sent-16, score-0.176]

8 Importantly, many of PAC-Bayesian inequalities hold for all posterior distributions ρ simultaneously (with high probability over a random draw of a training set). [sent-17, score-0.351]

9 Ideally, we prefer to derive new algorithms that ﬁnd the posterior distribution ρ that minimizes the PAC-Bayesian bound on the expected loss. [sent-19, score-0.317]

10 However, we can also use PAC-Bayesian bounds in order to estimate the expected loss of posterior distributions ρ that were found by other algorithms, such as empirical risk minimization, regularized empirical risk minimization, Bayesian posteriors, and so forth. [sent-20, score-0.655]

11 There are two forms of PAC-Bayesian inequalities that are currently known to be the tightest depending on a situation. [sent-22, score-0.123]

12 One is the PAC-Bayes-kl inequality of Seeger [7] and the other is the PACBayes-Bernstein inequality of Seldin et. [sent-23, score-0.58]

13 However, the PAC-Bayes-Bernstein inequality is expressed in terms of the true expected variance, which is rarely accessible in practice. [sent-26, score-0.388]

14 Therefore, in order to apply the PAC-Bayes-Bernstein inequality we need an upper bound on the expected variance 1 (or, more precisely, on the average of the expected variances of losses of each hypothesis h ∈ H weighted according to the randomized prediction rule ρ). [sent-27, score-1.156]

15 In fact, for the binary loss this result cannot be signiﬁcantly improved (see Section 3). [sent-29, score-0.144]

16 However, when the loss is not binary it may be possible to obtain a tighter bound on the variance, which will lead to a tighter bound on the loss than the PAC-Bayes-kl inequality. [sent-30, score-0.906]

17 [6] a deterministic upper bound on the variance of importance-weighted sampling combined with PAC-Bayes-Bernstein inequality yielded an order of magnitude improvement relative to application of PAC-Bayes-kl inequality to the same problem. [sent-33, score-0.927]

18 We note that the bound on the variance used by Seldin et. [sent-34, score-0.304]

19 In this work we derive the PAC-Bayes-Empirical-Bernstein bound, in which the expected average variance of the loss weighted by ρ is replaced by the weighted average of the empirical variance of the loss. [sent-37, score-0.572]

20 Bounding the expected variance by the empirical variance is generally tighter than bounding it by the empirical loss. [sent-38, score-0.762]

21 Therefore, the PAC-Bayes-Empirical-Bernstein bound is generally tighter than the PAC-Bayes-kl bound, although the exact comparison also depends on the divergence between the posterior and the prior and the sample size. [sent-39, score-0.466]

22 In Section 5 we provide an empirical comparison of the two bounds on several synthetic and UCI datasets. [sent-40, score-0.213]

23 In the ﬁrst step we combine the PAC-Bayesian bounding technique with the Empirical Bernstein inequality [9] and derive a PACBayesian bound on the variance. [sent-42, score-0.604]

24 The PAC-Bayesian bound on the variance bounds the divergence between averages [weighted by ρ] of expected and empirical variances of the losses of hypotheses in H and holds with high probability for all averaging distributions ρ simultaneously. [sent-43, score-0.903]

25 In the second step the PAC-Bayesian bound on the variance is substituted into the PAC-Bayes-Bernstein inequality yielding the PAC-Bayes-Empirical-Bernstein bound. [sent-44, score-0.594]

26 training sample S = {(Xi , Yi )}n drawn according to an unknown distribution D on the product-space i=1 X × Y, a loss function : Y 2 → [0, 1], and a hypothesis class H. [sent-53, score-0.285]

27 We use h (X, Y ) = (Y, h(X)) to denote the loss of a hypothesis h on a pair (X, Y ). [sent-55, score-0.245]

28 For a ﬁxed hypothesis h ∈ H denote its expected loss by L(h) = E(X,Y )∼D [ h (X, Y )], n 1 the empirical loss Ln (h) = n i=1 h (Xi , Yi ), and the variance of the loss V(h) = Var(X,Y )∼D [ h (X, Y )] = E(X,Y )∼D h (X, Y ) − E(X,Y )∼D [ h (X, Y )] 2 . [sent-56, score-0.828]

29 We deﬁne Gibbs regression rule Gρ associated with a distribution ρ over H in the following way: for each point X Gibbs regression rule draws a hypothesis h according to ρ and applies it to X. [sent-57, score-0.279]

30 The expected loss of Gibbs regression rule is denoted by L(Gρ ) = Eh∼ρ [L(h)] and the empirical loss is ρ(h) denoted by Ln (Gρ ) = Eh∼ρ [Ln (h)]. [sent-58, score-0.539]

31 We use KL(ρ π) = Eh∼ρ ln π(h) to denote the KullbackLeibler divergence between two probability distributions [10]. [sent-59, score-0.501]

32 2 PAC-Bayes-kl bound Before presenting our results we review several existing PAC-Bayesian bounds. [sent-63, score-0.202]

33 The result in Theorem 1 was presented by Maurer [11, Theorem 5] and is one of the tightest known concentration bounds on the expected loss of Gibbs regression rule. [sent-64, score-0.434]

34 Theorem 1 generalizes (and slightly tightens) PAC-Bayes-kl inequality of Seeger [7, Theorem 1] from binary to arbitrary loss functions bounded in the [0, 1] interval. [sent-65, score-0.473]

35 For any ﬁxed probability distribution π over H, for any n ≥ 8 and δ > 0, with probability greater than 1 − δ over a random draw of a sample S, for all distributions ρ over H simultaneously: √ KL(ρ π) + ln 2 δ n kl Ln (Gρ ) L(Gρ ) ≤ . [sent-67, score-0.864]

36 (1) n kl(q p)/2, Theorem 1 directly implies (up to minor Since by Pinsker’s inequality |p − q| ≤ factors) the more explicit PAC-Bayesian bound of McAllester [12]: √ KL(ρ π) + ln 2 δ n L(Gρ ) ≤ Ln (Gρ ) + , (2) 2n which holds with probability greater than 1 − δ for all ρ simultaneously. [sent-68, score-0.964]

37 We note that kl is easy to invert numerically and for small values of Ln (Gρ ) (less than 1/4) the implicit bound in (1) is signiﬁcantly tighter than the explicit bound in (2). [sent-69, score-0.723]

38 This can be seen from another relaxation suggested by McAllester [2], which follows from (1) by the inequality p ≤ q + 2qkl(q p) + 2kl(q p) for p < q: √ L(Gρ ) ≤ Ln (Gρ ) + √ 2Ln (Gρ ) KL(ρ π) + ln 2 δ n 2 KL(ρ π) + ln 2 δ n + . [sent-70, score-1.108]

39 (3) n n From inequality (3) we clearly see that inequality (1) achieves “fast convergence rate” or, in other √ words, when L(Gρ ) is zero (or small compared to 1/ n) the bound converges at the rate of 1/n √ rather than 1/ n as a function of n. [sent-71, score-0.751]

40 [8] introduced a general technique for combining PAC-Bayesian analysis with concentration of measure inequalities and derived the PAC-Bayes-Bernstein bound cited below. [sent-75, score-0.364]

41 n ¯ Furthermore, the result holds if Eρ [V(h)] is replaced by an upper bound V (ρ), as long as ¯ (ρ) ≤ 1 for all ρ. [sent-84, score-0.241]

42 worked with cumulative losses and variances, whereas we work with normalized losses and variances, which means that their losses and variances differ by a multiplicative factor of n from our deﬁnitions. [sent-88, score-0.375]

43 Second, we note that the statement on the possibility of replacing Eρ [V(h)] by an upper bound is not part of [8, Theorem 8], but it is mentioned and analyzed explicitly in the text. [sent-89, score-0.214]

44 Since 4 is a trivial upper bound on the variance of a random variable bounded in the [0, 1] interval, the requirement is not a limitation. [sent-91, score-0.386]

45 Finally, we note that since we are working with “one-sided” variables (namely, the loss is bounded in the [0, 1] interval rather than “two-sided” [−1, 1] interval, which was considered in [8]) the variance is bounded by 1 (rather than 1), which leads to a slight improvement in the value of ν1 . [sent-92, score-0.396]

46 4 Since in reality we rarely have access to the expected variance Eρ [V(h)] the tightness of Theorem ¯ 2 entirely depends on the tightness of the upper bound V (ρ). [sent-93, score-0.525]

47 If we use the trivial upper bound Eρ [V(h)] ≤ 1 the result is roughly equivalent to (2), which is inferior to Theorem 1. [sent-94, score-0.247]

48 Design of a 4 tighter upper bound on Eρ [V(h)] that holds for all ρ simultaneously is the subject of the following section. [sent-95, score-0.463]

49 3 Main Results The key result of our paper is a PAC-Bayesian bound on the average expected variance Eρ [V(h)] given in terms of the average empirical variance Eρ [Vn (h)] = Eh∼ρ [Vn (h)], where Vn (h) = 1 n−1 n h (Xi , Yi ) − Ln (h) 2 (5) i=1 is an unbiased estimate of the variance V(h). [sent-96, score-0.732]

50 The bound is given in Theorem 3 and it holds with high probability for all distributions ρ simultaneously. [sent-97, score-0.252]

51 Substitution of this bound into Theorem 2 yields the PAC-Bayes-Empirical-Bernstein inequality given in Theorem 4. [sent-98, score-0.461]

52 Thus, the PAC-BayesEmpirical-Bernstein inequality is based on two subsequent applications of the PAC-Bayesian bounding technique. [sent-99, score-0.391]

53 1 PAC-Bayesian bound on the variance Theorem 3 is based on an application of the PAC-Bayesian bounding technique to the difference Eρ [V(h)] − Eρ [Vn (h)]. [sent-101, score-0.447]

54 We note that Vn (h) is a second-order U-statistics [13] and Theorem 3 provides an interesting example of combining PAC-Bayesian analysis with concentration inequalities for self-bounding functions. [sent-102, score-0.151]

55 Note that (6) closely resembles the explicit bound on L(Gρ ) in (3). [sent-105, score-0.202]

56 If the empirical variance Vn (h) √ is close to zero the impact of the second term of the bound (that scales with 1/ n) is relatively small and we obtain “fast convergence rate” of Eρ [Vn (h)] to Eρ [V(h)]. [sent-106, score-0.399]

57 Finally, we note that the impact of c2 on ln ν2 is relatively small and so c2 can be taken very close to 1. [sent-107, score-0.409]

58 2 PAC-Bayes-Empirical-Bernstein bound Theorem 3 controls the average variance Eρ [V(h)] for all posterior distributions ρ simultaneously. [sent-109, score-0.473]

59 δ By taking δ1 = δ2 = 2 we have the claims of Theorems 2 and 3 holding simultaneously with 4 probability greater than 1 − δ. [sent-110, score-0.151]

60 Substitution of the bound on Eρ [V(h)] from Theorem 3 into the PAC-Bayes-Bernstein inequality in Theorem 2 yields the main result of our paper, the PAC-BayesEmpirical-Bernstein inequality, that controls the loss of Gibbs regression rule Eρ [L(h)] for all posterior distributions ρ simultaneously. [sent-111, score-0.863]

61 We also note that when the loss is bounded in the [0,1] interval we have Vn (h) ≤ (n/(n − 1))Ln (h) (since h (X, Y )2 ≤ h (X, Y )). [sent-118, score-0.224]

62 Therefore, the PB-EB bound is never much worse than the PB-kl bound and if the empirical variance is small compared to the empirical loss it can be much tighter. [sent-119, score-0.809]

63 We note that for the binary loss ( (y, y ) ∈ {0, 1}) we have V(h) = L(h)(1 − L(h)) and in this case the empirical variance cannot be signiﬁcantly smaller than the empirical loss and PB-EB does not provide an advantage over PB-kl. [sent-120, score-0.611]

64 We also note that the unrelaxed version of the PB-kl inequality in equation (1) has better behavior for very small sample sizes and in such cases PB-kl can be tighter than PB-EB even when the empirical variance is small. [sent-121, score-0.732]

65 7Ln (Gρ ) the PB-EB inequality can be signiﬁcantly tighter than the PB-kl bound and otherwise it is comparable (except for very small sample sizes). [sent-123, score-0.639]

66 For any function fn : H × (X × Y) → R and for any distribution π over H, such that π is independent of S, with probability greater than 1 − δ over a random draw of S, for all distributions ρ over H simultaneously: 1 Eρ [fn (h, S)] ≤ KL(ρ π) + ln + ln Eπ ES ∼Dn efn (h,S ) . [sent-130, score-1.067]

67 (8) δ The smart part is to choose fn (h, S) so that we get the quantities of interest on the left hand side of (8) and at the same time are able to bound the last term on the right hand side of (8). [sent-131, score-0.248]

68 1 2 Average sample variance 1 2 2 Average sample variance 2. [sent-147, score-0.346]

69 1 Average empirical loss (a) n = 1000 (b) n = 4000 Figure 1: The Ratio of the gap between PB-EB and Ln (Gρ ) to the gap between PB-kl and Ln (Gρ ) for different values of n, Eρ [Vn (h)], and Ln (Gρ ). [sent-153, score-0.239]

70 PB-EB is tighter below the dashed line with label 1. [sent-154, score-0.138]

71 We take fn (h, S) = λ nV(h) − nVn (h) − λ n−1 V(h) 2 and substitute fn and the bound on its moment generating function in (9) into (8). [sent-157, score-0.365]

72 To complete the proof it is left to optimize the bound with respect to λ. [sent-158, score-0.206]

73 Since it is impossible to minimize the bound simultaneously for all ρ with a single value of λ, we follow the technique suggested by Seldin et. [sent-159, score-0.297]

74 and take a grid of λ-s in a form of a geometric progression and apply a union bound over this grid. [sent-161, score-0.209]

75 Then, for each ρ we pick a value of λ from the grid, which is the closest to the value of λ that minimizes the bound for the corresponding ρ. [sent-162, score-0.171]

76 (The approximation of the optimal λ by the closest λ from the grid is behind the factor c2 in the bound and the ln ν2 factor is the result of the union bound over the grid of λ-s. [sent-163, score-0.827]

77 By our choice of δ1 = δ2 = 2 the upper bounds of Theorems 2 and 3 hold simultaneously with probability greater than 1 − δ. [sent-166, score-0.261]

78 b we examine the ratio of the complexity parts of the two bounds PB-EB − Ln (Gρ ) , PB-kl − Ln (Gρ ) where PB-EB is used to denote the value of the PB-EB bound in equation (7) and PB-kl is used to denote the value of the PB-kl bound in equation (1). [sent-171, score-0.445]

79 As we already said, the advantage of the PB-EB inequality over the PB-kl inequality is most prominent in regression (for classiﬁcation with zero-one loss it is roughly comparable to PB-kl). [sent-182, score-0.772]

80 Below we provide regression experiments with L1 loss on synthetic data and three datasets from the UCI repository [15]. [sent-183, score-0.306]

81 We use the PB-EB and PB-kl bounds to bound the loss of a regularized empirical 6 risk minimization algorithm. [sent-184, score-0.504]

82 In all our experiments the inputs Xi lie in a d-dimensional unit ball centered at the origin ( Xi 2 ≤ 1) and the outputs Y take values in [−0. [sent-185, score-0.157]

83 The hypothesis class HW is deﬁned as HW = hw (X) = w, X : w ∈ Rd , w 2 ≤ 0. [sent-188, score-0.243]

84 This construction ensures that the L1 regression loss (y, y ) = |y − y | is bounded in the [0, 1] −1 interval. [sent-190, score-0.231]

85 The posterior distribution ρw is taken to ˆ ˆ be a uniform distribution on a d-dimensional ball of radius centered at the weight vector w, where ˆ w is the solution of the following minimization problem: 1 ˆ w = arg min w n n |Yi − w, Xi | + λ∗ w 2 . [sent-192, score-0.231]

86 We use binary search in order to ﬁnd the minimal (non-negative) ˆ λ∗ , such that the posterior ρw is supported by HW (meaning that the ball of radius around w is ˆ within the ball of radius 0. [sent-195, score-0.319]

87 The sigmoid function creates a mismatch between the data generating distribution and the linear hypothesis class. [sent-216, score-0.132]

88 1) this results in small empirical variance of the loss Vn (h) and medium to high empirical loss Ln (h). [sent-218, score-0.611]

89 We see that except for very small sample sizes (n < 1000) the PB-EB bound outperforms the PB-kl bound. [sent-229, score-0.247]

90 Inferior performance for very small sample sizes is a result of domination of the O(1/n) term in the PB-EB bound (7). [sent-230, score-0.247]

91 2 UCI datasets We compare our PAC-Bayes-Empirical-Bernstein inequality (7) with the PAC-Bayes-kl inequality (1) on three UCI regression datasets: Wine Quality, Parkinsons Telemonitoring, and Concrete Compressive Strength. [sent-233, score-0.657]

92 6 Conclusions and future work We derived a new PAC-Bayesian bound that controls the convergence of averages of empirical variances of losses of hypotheses in H to averages of expected variances of losses of hypothesis in H simultaneously for all averaging distributions ρ. [sent-238, score-1.078]

93 This bound is an interesting example of combination 7 0. [sent-239, score-0.171]

94 5 PB−EB PB−kl Train error Test error Expected loss 0. [sent-248, score-0.144]

95 5 PB−EB PB−kl Train error Test error Expected loss Expected loss 0. [sent-249, score-0.288]

96 2 1000 2000 3000 Sample size (d) d = 6, w0 0 =3 Figure 2: The values of the PAC-Bayes-kl and PAC-Bayes-Empirical-Bernstein bounds together with the test and train errors on synthetic data. [sent-256, score-0.19]

97 0011 of PAC-Bayesian bounding technique with concentration inequalities for self-bounding functions. [sent-282, score-0.294]

98 We also demonstrated an empirical advantage of the new PAC-BayesEmpirical-Bernstein inequality over the PAC-Bayes-kl inequality on several synthetic and real-life regression datasets. [sent-284, score-0.774]

99 Another interesting direction would be to decrease the last term in the bound in Theorem 3, as it is done in the PAC-Bayes-kl inequality. [sent-287, score-0.171]

100 This can probably be achieved by deriving a PAC-Bayes-kl inequality for the variance. [sent-288, score-0.29]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ln', 0.409), ('vn', 0.383), ('inequality', 0.29), ('seldin', 0.276), ('kl', 0.243), ('bound', 0.171), ('pb', 0.155), ('loss', 0.144), ('hw', 0.142), ('tighter', 0.138), ('variance', 0.133), ('eh', 0.112), ('hypothesis', 0.101), ('bounding', 0.101), ('uci', 0.099), ('losses', 0.096), ('empirical', 0.095), ('yevgeny', 0.094), ('nvn', 0.093), ('variances', 0.087), ('simultaneously', 0.084), ('inequalities', 0.083), ('theorem', 0.083), ('eb', 0.083), ('posterior', 0.079), ('fn', 0.077), ('seeger', 0.071), ('ball', 0.069), ('concentration', 0.068), ('expected', 0.067), ('bounds', 0.067), ('greater', 0.067), ('laviolette', 0.054), ('pacbayesian', 0.054), ('john', 0.054), ('distributions', 0.054), ('bernstein', 0.054), ('radius', 0.051), ('synthetic', 0.051), ('draw', 0.051), ('centred', 0.05), ('russian', 0.05), ('parkinsons', 0.05), ('francois', 0.05), ('xi', 0.048), ('regression', 0.048), ('maurer', 0.047), ('gibbs', 0.046), ('train', 0.045), ('yi', 0.045), ('upper', 0.043), ('technique', 0.042), ('substitution', 0.041), ('interval', 0.041), ('rule', 0.041), ('sample', 0.04), ('tightness', 0.04), ('tightest', 0.04), ('substitute', 0.04), ('nv', 0.04), ('bounded', 0.039), ('grid', 0.038), ('divergence', 0.038), ('averages', 0.036), ('ratio', 0.036), ('controls', 0.036), ('sizes', 0.036), ('australian', 0.036), ('proof', 0.035), ('illustrative', 0.035), ('shorthand', 0.035), ('repository', 0.034), ('brevity', 0.034), ('pac', 0.034), ('inferior', 0.033), ('prediction', 0.033), ('hypotheses', 0.032), ('centered', 0.032), ('resembles', 0.031), ('mcallester', 0.031), ('rarely', 0.031), ('sigmoid', 0.031), ('presenting', 0.031), ('datasets', 0.029), ('origin', 0.029), ('andreas', 0.028), ('dn', 0.028), ('risk', 0.027), ('joy', 0.027), ('osokin', 0.027), ('thankful', 0.027), ('telemonitoring', 0.027), ('laureate', 0.027), ('kullbackleibler', 0.027), ('cantly', 0.027), ('inputs', 0.027), ('holds', 0.027), ('randomized', 0.027), ('ai', 0.027), ('test', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

Author: Ilya O. Tolstikhin, Yevgeny Seldin

2 0.25345826 324 nips-2013-The Fast Convergence of Incremental PCA

Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund

Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give ﬁnite-sample convergence rates for both. 1

3 0.14810802 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

Author: Po-Ling Loh, Martin J. Wainwright

Abstract: We establish theoretical results concerning local optima of regularized M estimators, where both loss and penalty functions are allowed to be nonconvex. Our results show that as long as the loss satisﬁes restricted strong convexity and the penalty satisﬁes suitable regularity conditions, any local optimum of the composite objective lies within statistical precision of the true parameter vector. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for errors-in-variables linear models and regression in generalized linear models using nonconvex regularizers such as SCAD and MCP. On the optimization side, we show that a simple adaptation of composite gradient descent may be used to compute a global optimum up to the statistical precision stat in log(1/ stat ) iterations, the fastest possible rate for any ﬁrst-order method. We provide simulations to illustrate the sharpness of our theoretical predictions. 1

4 0.1443319 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

Abstract: Inference in general Ising models is difﬁcult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We ﬁnd that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.

5 0.12937176 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label signiﬁcantly increases the regret of the problem. 1

6 0.12513146 143 nips-2013-Integrated Non-Factorized Variational Inference

7 0.12400406 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

8 0.12235564 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

9 0.11512452 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

10 0.10164367 325 nips-2013-The Pareto Regret Frontier

11 0.098982722 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

12 0.096224546 269 nips-2013-Regression-tree Tuning in a Streaming Setting

13 0.09472049 171 nips-2013-Learning with Noisy Labels

14 0.093110286 158 nips-2013-Learning Multiple Models via Regularized Weighting

15 0.089334808 289 nips-2013-Scalable kernels for graphs with continuous attributes

16 0.086159185 98 nips-2013-Documents as multiple overlapping windows into grids of counts

17 0.085215062 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

18 0.080423005 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

19 0.079099372 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

20 0.077249661 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, 0.017), (2, 0.098), (3, -0.035), (4, 0.043), (5, 0.113), (6, 0.044), (7, 0.016), (8, -0.01), (9, 0.026), (10, 0.049), (11, 0.04), (12, 0.018), (13, -0.046), (14, 0.108), (15, -0.059), (16, -0.03), (17, 0.035), (18, -0.027), (19, 0.051), (20, -0.055), (21, 0.001), (22, 0.029), (23, 0.104), (24, -0.253), (25, 0.068), (26, -0.147), (27, 0.061), (28, -0.108), (29, -0.042), (30, -0.049), (31, -0.074), (32, -0.113), (33, -0.014), (34, -0.006), (35, -0.017), (36, -0.069), (37, -0.131), (38, 0.127), (39, -0.084), (40, 0.064), (41, 0.02), (42, 0.274), (43, -0.046), (44, 0.078), (45, -0.043), (46, 0.019), (47, 0.019), (48, 0.066), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96447861 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

Author: Ilya O. Tolstikhin, Yevgeny Seldin

2 0.76396561 324 nips-2013-The Fast Convergence of Incremental PCA

Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund

3 0.60362279 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

Author: Samory Kpotufe, Vikas Garg

Abstract: We present the ﬁrst result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1

4 0.59829324 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

Author: Matus Telgarsky, Sanjoy Dasgupta

Abstract: Suppose k centers are ﬁt to m points by heuristically minimizing the k-means cost; what is the corresponding ﬁt over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with reﬁned constants is provided for k-means instances possessing some cluster structure. 1

5 0.5707348 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

Author: Po-Ling Loh, Martin J. Wainwright

6 0.4967736 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

7 0.4927597 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

8 0.48513985 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

9 0.48398513 225 nips-2013-One-shot learning and big data with n=2

10 0.47077185 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

11 0.46396425 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

12 0.45353183 269 nips-2013-Regression-tree Tuning in a Streaming Setting

13 0.45158243 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

14 0.45132318 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

15 0.4479754 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits

16 0.44679096 98 nips-2013-Documents as multiple overlapping windows into grids of counts

17 0.44420573 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

18 0.43142068 252 nips-2013-Predictive PAC Learning and Process Decompositions

19 0.42966527 230 nips-2013-Online Learning with Costly Features and Labels

20 0.4273988 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (16, 0.026), (33, 0.184), (34, 0.108), (41, 0.016), (46, 0.149), (49, 0.019), (56, 0.198), (70, 0.044), (79, 0.014), (85, 0.052), (89, 0.03), (93, 0.054), (95, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9362452 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

Author: Abhradeep Guha Thakurta, Adam Smith

Abstract: We give differentially private algorithms for a large class of online learning algorithms, in both the full information and bandit settings. Our algorithms aim to minimize a convex loss function which is a sum of smaller convex loss terms, one for each data point. To design our algorithms, we modify the popular mirror descent approach, or rather a variant called follow the approximate leader. The technique leads to the ﬁrst nonprivate algorithms for private online learning in the bandit setting. In the full information setting, our algorithms improve over the regret bounds of previous work (due to Dwork, Naor, Pitassi and Rothblum (2010) and Jain, Kothari and Thakurta (2012)). In many cases, our algorithms (in both settings) match the dependence on the input length, T , of the optimal nonprivate regret bounds up to logarithmic factors in T . Our algorithms require logarithmic space and update time. 1

same-paper 2 0.92185074 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

Author: Ilya O. Tolstikhin, Yevgeny Seldin

3 0.91883129 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for ﬁxed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

4 0.88856953 249 nips-2013-Polar Operators for Structured Sparse Estimation

Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans

Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difﬁculties that entail sophisticated algorithms. Our ﬁrst contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efﬁciently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efﬁcient latent fused lasso. 1

5 0.88792253 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright

Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1

6 0.88325465 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

7 0.88275093 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

8 0.88201433 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

9 0.88195896 230 nips-2013-Online Learning with Costly Features and Labels

10 0.88168067 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

11 0.88051873 355 nips-2013-Which Space Partitioning Tree to Use for Search?

12 0.88039976 344 nips-2013-Using multiple samples to learn mixture models

13 0.88026953 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

14 0.8799718 224 nips-2013-On the Sample Complexity of Subspace Learning

15 0.87942123 95 nips-2013-Distributed Exploration in Multi-Armed Bandits

16 0.87867194 233 nips-2013-Online Robust PCA via Stochastic Optimization

17 0.87841117 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

18 0.87756383 137 nips-2013-High-Dimensional Gaussian Process Bandits

19 0.87565053 60 nips-2013-Buy-in-Bulk Active Learning

20 0.87536657 247 nips-2013-Phase Retrieval using Alternating Minimization