nips nips2007 nips2007-82 knowledge-graph by maker-knowledge-mining

82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization


Source: pdf

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization XuanLong Nguyen SAMSI & Duke University Martin J. [sent-1, score-0.579]

2 Jordan UC Berkeley Abstract We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. [sent-3, score-0.504]

3 Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. [sent-4, score-0.326]

4 We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. [sent-5, score-0.142]

5 Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. [sent-6, score-0.066]

6 1 Introduction An important class of “distances” between multivariate probability distributions P and Q are the AliSilvey or f -divergences [1, 6]. [sent-7, score-0.049]

7 These divergences, to be defined formally in the sequel, are all of the form Dφ (P, Q) = φ(dQ/dP)dP, where φ is a convex function of the likelihood ratio. [sent-8, score-0.099]

8 This family, including the Kullback-Leibler (KL) divergence and the variational distance as special cases, plays an important role in various learning problems, including classification, dimensionality reduction, feature selection and independent component analysis. [sent-9, score-0.259]

9 Our starting point is a variational characterization of f -divergences, which allows our problem to be tackled via an M -estimation procedure. [sent-15, score-0.083]

10 Specifically, the likelihood ratio function dP/dQ and the divergence functional Dφ (P, Q) can be estimated by solving a convex minimization problem over a function class. [sent-16, score-0.341]

11 In this paper, we estimate the likelihood ratio and the KL divergence by optimizing a penalized convex risk. [sent-17, score-0.403]

12 In particular, we restrict the estimate to a bounded subset of a reproducing kernel Hilbert Space (RKHS) [17]. [sent-18, score-0.099]

13 The resulting estimator is nonparametric, in that it entails no strong assumptions on the form of P and Q, except that the likelihood ratio function is assumed to belong to the RKHS. [sent-20, score-0.181]

14 The key to our analysis is a basic inequality relating a performance metric (the Hellinger distance) of our estimator to the suprema of two empirical processes (with respect to P and Q) defined on a function class of density ratios. [sent-22, score-0.205]

15 Convergence rates are then obtained using techniques for analyzing nonparametric M -estimators from empirical process theory [20]. [sent-23, score-0.074]

16 The variational representation of divergences has been derived independently and exploited by several authors [5, 11, 14]. [sent-25, score-0.138]

17 Broniatowski and Keziou [5] studied testing and estimation problems based on dual representations of f -divergences, but working in a parametric setting as opposed to the nonparametric framework considered here. [sent-26, score-0.146]

18 Another link is to the problem of estimating integral functionals of a single density, with the Shannon entropy being a well-known example, which has been studied extensively dating back to early 1 work [9, 13] as well as the more recent work [3, 4, 12]. [sent-29, score-0.232]

19 [22] proposed an algorithm for estimating the KL divergence for continuous distributions, which exploits histogram-based estimation of the likelihood ratio by building data-dependent partitions of equivalent (empirical) Q-measure. [sent-32, score-0.324]

20 The estimator was empirically shown to outperform direct plug-in methods, but no theoretical results on its convergence rate were provided. [sent-33, score-0.128]

21 3, we describe an estimation procedure based on penalized risk minimization and accompanying convergence rates analysis results. [sent-38, score-0.237]

22 6, we illustrate the behavior of our estimator and compare it to other methods via simulations. [sent-44, score-0.084]

23 2 Background We begin by defining f -divergences, and then provide a variational representation of the f divergence, which we later exploit to develop an M -estimator. [sent-45, score-0.062]

24 The class of Ali-Silvey or f -divergences [6, 1] are “distances” of the form: Dφ (P, Q) = p0 φ(q0 /p0 ) dµ, (1) ¯ where φ : R → R is a convex function. [sent-47, score-0.1]

25 Different choices of φ result in many divergences that play important roles in information theory and statistics, including the variational distance, Hellinger distance, KL divergence and so on (see, e. [sent-48, score-0.316]

26 As an important example, the Kullback-Leibler (KL) divergence between P and Q is given by DK (P, Q) = p0 log(p0 /q0 ) dµ, corresponding to the choice φ(t) = − log(t) for t > 0 and +∞ otherwise. [sent-51, score-0.148]

27 Variational representation: Since φ is a convex function, by Legendre-Fenchel convex duality [16] we can write φ(u) = supv∈R (uv − φ∗ (v)), where φ∗ is the convex conjugate of φ. [sent-52, score-0.247]

28 As a result, Dφ (P, Q) = p0 sup(f q0 /p0 − φ∗ (f )) dµ = sup f f f dQ − φ∗ (f ) dP , where the supremum is taken over all measurable functions f : X → R, and f dP denotes the expectation of f under distribution P. [sent-53, score-0.135]

29 Denoting by ∂φ the subdifferential [16] of the convex function φ, it can be shown that the supremum will be achieved for functions f such that q0 /p0 ∈ ∂φ∗ (f ), where q0 , p0 and f are evaluated at any x ∈ X . [sent-54, score-0.149]

30 By convex duality [16], this is true if f ∈ ∂φ(q0 /p0 ) for any x ∈ X . [sent-55, score-0.097]

31 Letting F be any function class in X → R, there holds: f dQ − φ∗ (f ) dP, Dφ (P, Q) ≥ sup f ∈F (2) with equality if F ∩ ∂φ(q0 /p0 ) = ∅. [sent-57, score-0.106]

32 The convex dual of φ is φ∗ (v) = supu (uv−φ(u)) = −1 − log(−v) if u < 0 and +∞ otherwise. [sent-59, score-0.123]

33 By Lemma 1, DK (P, Q) = sup f <0 f dQ − (−1 − log(−f )) dP = sup g>0 log g dP − gdQ + 1. [sent-60, score-0.244]

34 (3) In addition, the supremum is attained at g = p0 /q0 . [sent-61, score-0.092]

35 3 Penalized M-estimation of KL divergence and the density ratio Let X1 , . [sent-62, score-0.266]

36 Our goal is to develop an estimator of the KL divergence and the density ratio g0 = p0 /q0 based on the samples {Xi }n and {Yi }n . [sent-75, score-0.35]

37 i=1 i=1 2 The variational representation in Lemma 1 motivates the following estimator of the KL divergence. [sent-76, score-0.146]

38 We then compute ˆ DK = sup g∈G log g dPn − gdQn + 1, (4) where dPn and dQn denote the expectation under empirical measures Pn and Qn , respectively. [sent-78, score-0.163]

39 If the supremum is attained at gn , then gn serves as an estimator of the density ratio g0 = p0 /q0 . [sent-79, score-1.096]

40 The estimation procedure involves solving the following program: gn = argming∈G ˆ gdQn − log g dPn + λn 2 I (g), 2 (6) where λn > 0 is a regularization parameter. [sent-84, score-0.532]

41 The minimizing argument gn is plugged into (4) to ˆ obtain an estimate of the KL divergence DK . [sent-85, score-0.572]

42 For estimating the density ratio, various metrics are possible. [sent-87, score-0.095]

43 Viewing g0 = p0 /q0 as a density function with respect to Q measure, one useful metric is the (generalized) Hellinger distance: h2 (g0 , g) := Q 1 2 1/2 (g0 − g 1/2 )2 dQ. [sent-88, score-0.072]

44 (9) g∈GM Finally, on the bracket entropy of G [21]: For some 0 < γ < 2, B Hδ (GM , L2 (Q)) = O(M/δ)γ for any δ > 0. [sent-94, score-0.06]

45 (a) Under assumptions (8), (9) and (10), and letting λn → 0 so that: λ−1 = OP (n2/(2+γ) )(1 + I(g0 )), n then under P: hQ (g0 , gn ) = OP (λ1/2 )(1 + I(g0 )), ˆ n I(ˆn ) = OP (1 + I(g0 )). [sent-96, score-0.452]

46 Our algorithm involves solving program (6), for some choice of function class G. [sent-99, score-0.051]

47 In our implementation, relevant function classes are taken to be a reproducing kernel Hilbert space induced by a Gaussian kernel. [sent-100, score-0.076]

48 As a reproducing kernel Hilbert space, any function g ∈ H can be expressed as an inner product g(x) = w, Φ(x) , where g H = w H . [sent-106, score-0.076]

49 A kernel used in our simulation is the Gaussian kernel: K(x, y) := e− x−y 2 /σ , where . [sent-107, score-0.057]

50 Let G := H, and let the complexity measure be I(g) = g min J := min w w 1 n n i=1 w, Φ(xi ) − 1 n H. [sent-109, score-0.066]

51 (6) becomes: n log w, Φ(yj ) + j=1 λn w 2 2 H, (12) where {xi } and {yj } are realizations of empirical data drawn from Q and P, respectively. [sent-111, score-0.082]

52 The log function is extended take value −∞ for negative arguments. [sent-112, score-0.082]

53 minw J has the following dual form: n −min α>0 1 1 1 − − log nαj + n n 2λn j=1 Proof. [sent-114, score-0.159]

54 Let ψi (w) := min J w 1 n αi αj K(yi , yj )+ i,j 1 2λn n2 K(xi , xj )− i,j 1 w, Φ(xi ) , ϕj (w) := − n log w, Φ(yj ) , and Ω(w) = 1 λn n λn 2 w αj K(xi , yj ). [sent-115, score-0.301]

55 We have = − max( 0, w − J(w)) = −J ∗ (0) w n = − min ui ,vj n ∗ ψi (ui ) + i=1 n ϕ∗ (vj ) + Ω∗ (− j j=1 i=1 n ui − vj ), j=1 where the last line is due to the inf-convolution theorem [16]. [sent-117, score-0.139]

56 Simple calculations yield: 1 1 − log nαj if v = −αj Φ(yj ) and + ∞ otherwise n n 1 ∗ ψi (u) = 0 if u = Φ(xi ) and + ∞ otherwise n 1 ∗ 2 Ω (v) = v H. [sent-118, score-0.082]

57 2λn ϕ∗ (v) j = − n 1 1 1 So, minw J = − minαi j=1 (− n − n log nαj )+ 2λn implies the lemma immediately. [sent-119, score-0.163]

58 n j=1 1 αj Φ(yj )− n n i=1 Φ(xi ) 2 H, which If α is solution of the dual formulation, it is not difficult to show that the optimal w is attained at ˆ ˆ n n 1 w = λ1 ( j=1 αj Φ(yj ) − n i=1 Φ(xi )). [sent-120, score-0.086]

59 ˆ ˆ n For an RKHS based on a Gaussian kernel, the entropy condition (10) holds for any γ > 0 [23]. [sent-121, score-0.064]

60 Thus, by Theorem 2(a), w H = gn H = OP ( g0 H ), ˆ ˆ so the penalty term λn w 2 vanishes at the same rate as λn . [sent-123, score-0.401]

61 We have arrived at the following estiˆ mator for the KL divergence: n ˆ DK = 1 + (− j=1 1 1 − log nˆ j ) = α n n n j=1 − 1 log nˆ j . [sent-124, score-0.204]

62 Alternatively, we could set log G to be the RKHS, letting g(x) = exp w, Φ(x) , and letting I(g) = log g H = w H . [sent-126, score-0.266]

63 Theorem 2 is not applicable in this case, because condition (9) no longer holds, but this choice nonetheless seems reasonable and worth investigating, because in effect we have a far richer function class which might improve the bias of our estimator when the true density ratio is not very smooth. [sent-127, score-0.248]

64 4 A derivation similar to the previous case yields the following convex program: min J w := min w 1 n n e w, Φ(xi ) i=1 n − 1 n n w, Φ(yj ) + j=1 2 H n 1 αi Φ(xi ) − n i=1 1 − min αi log(nαi ) − αi + α>0 2λn i=1 = λn w 2 n Φ(yj ) 2 H. [sent-128, score-0.198]

65 j=1 Letting α be the solution of the above convex program, the KL divergence can be estimated by: ˆ n ˆ DK = 1 + n . [sent-129, score-0.223]

66 e αi log αi + αi log ˆ ˆ ˆ i=1 5 Proof of Theorem 2 We now sketch out the proof of the main theorem. [sent-130, score-0.194]

67 If gn is an estimate of g using (6), then: ˆ λn 2 1 2 h (g0 , gn ) + ˆ I (ˆn ) ≤ − g 4 Q 2 (ˆn − g0 )d(Qn − Q) + g Proof. [sent-132, score-0.825]

68 Define dl (g0 , g) = (g − g0 )dQ − log log g g0 −1/2 dP ≤ 2 (g 1/2 g0 gn + g0 ˆ λn 2 I (g0 ). [sent-133, score-0.629]

69 d(Pn − P) + 2g0 2 Note that for x > 0, 1 2 log x ≤ √ x − 1. [sent-134, score-0.082]

70 As a result, for any g, dl is related to hQ as follows: −1/2 ≥ (g − g0 ) dQ − 2 (g 1/2 g0 = dl (g0 , g) g g0 dP. [sent-136, score-0.128]

71 2 log (g − g0 ) dQ − 2 (g 1/2 g0 1/2 − 1) dP 1/2 (g 1/2 − g0 )2 dQ = 2h2 (g0 , g). [sent-137, score-0.082]

72 Q − g0 ) dQ = By the definition (6) of our estimator, we have: gn dQn − ˆ log gn dPn + ˆ λn 2 I (ˆn ) ≤ g 2 g0 dQn − log g0 dPn + λn 2 I (g0 ). [sent-138, score-0.966]

73 2 Both sides (modulo the regularization term I 2 ) are convex functionals of g. [sent-139, score-0.215]

74 By Jensen’s inequality, if F is a convex function, then F ((u + v)/2) − F (v) ≤ (F (u) − F (v))/2. [sent-140, score-0.075]

75 We obtain: gn + g0 ˆ dQn − 2 Rearranging, log gn −g0 ˆ d(Qn 2 gn + g0 ˆ λn 2 dPn + I (ˆn ) ≤ g 2 4 − Q) − log gn +g0 ˆ 2g0 d(Pn g0 dQn − − P) + λn 2 g 4 I (ˆn ) log g0 dPn + λn 2 I (g0 ). [sent-141, score-1.85]

76 4 ≤ gn − g0 ˆ λn 2 g0 + gn ˆ λn 2 dQ + I (g0 ) = −dl (g0 , )+ I (g0 ) 2 4 2 4 g0 + gn ˆ λn 2 1 λn 2 ≤ −2h2 (g0 , )+ I (g0 ) ≤ − h2 (g0 , gn ) + ˆ I (g0 ), Q 2 4 8 Q 4 where the last inequality is a standard result for the (generalized) Hellinger distance (cf. [sent-142, score-1.657]

77 log gn + g0 ˆ dP − 2g0 Let us now proceed to part (a) of the theorem. [sent-144, score-0.483]

78 Define fg := log g+g0 , and let FM := {fg |g ∈ GM }. [sent-145, score-0.15]

79 2g0 Since fg is a Lipschitz function of g, conditions (8) and (10) imply that B Hδ (FM , L2 (P)) = O(M/δ)γ . [sent-146, score-0.068]

80 14 of [20] using distance metric d2 (g0 , g) = g − g0 L2 (Q) , the following is true under Q (and so true under P as well, since dP/dQ is bounded from above), sup g∈G n−1/2 d 2 (g0 , g) 1−γ/2 | (g − g0 )d(Qn − Q)| 2 (1 + I(g) + I(g0 ))γ/2 ∨ n− 2+γ (1 + I(g) + I(g0 )) 5 = OP (1). [sent-148, score-0.137]

81 (14) In the same vein, we obtain that under P measure: sup g∈G n−1/2 d 2 (g0 , g) 1−γ/2 | fg d(Pn − P)| 2 (1 + I(g) + I(g0 ))γ/2 ∨ n− 2+γ (1 + I(g) + I(g0 )) = OP (1). [sent-149, score-0.149]

82 (15), (14), we obtain the following: 1 2 λn 2 hQ (g0 , gn ) + I (ˆn ) ≤ λn I(g0 )2 /2+ g ˆ 4 2 2 OP n−1/2 hQ (g0 , g)1−γ/2 (1 + I(g) + I(g0 ))1/2+γ/4 ∨ n− 2+γ (1 + I(g) + I(g0 )) . [sent-152, score-0.401]

83 To simplify notation, let ˆ h = hQ (g0 , gn ), I = I(ˆn ), and I0 = I(g0 ). [sent-154, score-0.401]

84 Both scenarios conclude the proof if we set λ−1 = OP (n2/(γ+2) (1 + I0 )). [sent-158, score-0.05]

85 Both scenarios conclude the proof if we set λ−1 = OP (n2/(γ+2) (1 + I0 )). [sent-162, score-0.05]

86 6 Simulation results In this section, we describe the results of various simulations that demonstrate the practical viability of our estimators, as well as their convergence behavior. [sent-172, score-0.064]

87 We experimented with our estimators using various choices of P and Q, including Gaussian, beta, mixture of Gaussians, and multivariate Gaussian distributions. [sent-173, score-0.108]

88 We use M1 to denote the method in which G is the RKHS, and M2 for the method in which log G is the RKHS. [sent-180, score-0.082]

89 Results of estimating KL divergences for various choices of probability distributions. [sent-244, score-0.156]

90 In all plots, the X-axis is the number of data points plotted on a log scale, and the Y-axis is the estimated value. [sent-245, score-0.082]

91 In the first two, our estimators M 1 and M 2 appear to have faster convergence rate than WKV. [sent-255, score-0.078]

92 The WKV estimator performs very well in the third example, but rather badly in the fourth example. [sent-256, score-0.084]

93 Again, M1 has the best convergence rates in all examples. [sent-258, score-0.069]

94 The M2 estimator does not converge in the last example, suggesting that the underlying function class exhibits very strong bias. [sent-259, score-0.109]

95 The WKV methods have weak convergence rates despite different choices of the partition sizes. [sent-260, score-0.099]

96 A general class of coefficients of divergence of one distribution from another. [sent-267, score-0.173]

97 Estimating integrated squared density derivatives: Sharp best order of convergence estimates. [sent-284, score-0.089]

98 Estimation of entropy and other functionals of a multivariate density. [sent-329, score-0.204]

99 Nonparametric estimation of the likelihood ratio and divergence functionals. [sent-370, score-0.294]

100 Some inequalities for information divergence and related measures of discrimination. [sent-387, score-0.148]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wkv', 0.55), ('gn', 0.401), ('op', 0.382), ('kl', 0.207), ('dq', 0.164), ('dpn', 0.16), ('divergence', 0.148), ('functionals', 0.14), ('dk', 0.123), ('dqn', 0.115), ('hq', 0.109), ('dp', 0.095), ('yj', 0.093), ('hellinger', 0.092), ('gm', 0.088), ('estimator', 0.084), ('log', 0.082), ('rkhs', 0.081), ('sup', 0.081), ('nguyen', 0.08), ('divergences', 0.076), ('convex', 0.075), ('ratio', 0.073), ('qn', 0.068), ('fg', 0.068), ('dl', 0.064), ('variational', 0.062), ('penalized', 0.06), ('supremum', 0.054), ('lemma', 0.052), ('letting', 0.051), ('nonparametric', 0.049), ('estimation', 0.049), ('dual', 0.048), ('pn', 0.047), ('broniatowski', 0.046), ('gdqn', 0.046), ('density', 0.045), ('convergence', 0.044), ('surrogate', 0.044), ('reproducing', 0.041), ('entropy', 0.04), ('hilbert', 0.039), ('risk', 0.038), ('attained', 0.038), ('xi', 0.037), ('kernel', 0.035), ('estimators', 0.034), ('min', 0.033), ('wang', 0.033), ('wainwright', 0.032), ('shannon', 0.032), ('replicating', 0.032), ('uc', 0.031), ('fm', 0.031), ('proof', 0.03), ('estimating', 0.03), ('ui', 0.03), ('choices', 0.03), ('minw', 0.029), ('uv', 0.029), ('distance', 0.029), ('metric', 0.027), ('der', 0.027), ('berkeley', 0.027), ('program', 0.026), ('princeton', 0.025), ('class', 0.025), ('rates', 0.025), ('vj', 0.025), ('beta', 0.025), ('likelihood', 0.024), ('multivariate', 0.024), ('van', 0.024), ('inequality', 0.024), ('derivation', 0.024), ('holds', 0.024), ('estimate', 0.023), ('integral', 0.022), ('duality', 0.022), ('simulation', 0.022), ('minimization', 0.021), ('theorem', 0.021), ('characterization', 0.021), ('noting', 0.021), ('worth', 0.021), ('plots', 0.02), ('allocated', 0.02), ('studia', 0.02), ('subdifferential', 0.02), ('decentralized', 0.02), ('argming', 0.02), ('arrived', 0.02), ('bracket', 0.02), ('bulk', 0.02), ('duke', 0.02), ('marie', 0.02), ('mator', 0.02), ('scenarios', 0.02), ('various', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

2 0.10605985 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui

Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1

3 0.07934998 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

4 0.068759367 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis

Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

5 0.063649744 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1

6 0.062498763 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

7 0.060204625 99 nips-2007-Hierarchical Penalization

8 0.055510141 47 nips-2007-Collapsed Variational Inference for HDP

9 0.055227365 213 nips-2007-Variational Inference for Diffusion Processes

10 0.053999819 159 nips-2007-Progressive mixture rules are deviation suboptimal

11 0.053778201 214 nips-2007-Variational inference for Markov jump processes

12 0.050802723 43 nips-2007-Catching Change-points with Lasso

13 0.050748073 185 nips-2007-Stable Dual Dynamic Programming

14 0.050702434 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

15 0.049845267 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

16 0.049619742 84 nips-2007-Expectation Maximization and Posterior Constraints

17 0.047200058 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models

18 0.047132373 40 nips-2007-Bundle Methods for Machine Learning

19 0.043856513 187 nips-2007-Structured Learning with Approximate Inference

20 0.042012118 61 nips-2007-Convex Clustering with Exemplar-Based Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.144), (1, 0.002), (2, -0.047), (3, 0.043), (4, 0.0), (5, -0.058), (6, -0.013), (7, -0.02), (8, -0.111), (9, 0.033), (10, 0.073), (11, 0.005), (12, 0.041), (13, 0.005), (14, 0.029), (15, -0.02), (16, 0.114), (17, 0.135), (18, -0.042), (19, -0.046), (20, 0.058), (21, -0.049), (22, 0.028), (23, -0.055), (24, 0.041), (25, 0.007), (26, -0.008), (27, -0.073), (28, 0.033), (29, -0.015), (30, -0.033), (31, 0.081), (32, 0.099), (33, 0.006), (34, 0.022), (35, 0.176), (36, -0.047), (37, 0.005), (38, 0.055), (39, -0.092), (40, -0.082), (41, -0.011), (42, 0.013), (43, 0.049), (44, 0.054), (45, -0.178), (46, 0.045), (47, -0.014), (48, -0.063), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93631548 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

2 0.70247555 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.

3 0.59748614 159 nips-2007-Progressive mixture rules are deviation suboptimal

Author: Jean-yves Audibert

Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1

4 0.59366125 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

Abstract: Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian model averaging distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient algorithm. 1

5 0.5880785 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1

6 0.55558372 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

7 0.46975017 43 nips-2007-Catching Change-points with Lasso

8 0.45830214 179 nips-2007-SpAM: Sparse Additive Models

9 0.4522379 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

10 0.40881452 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

11 0.40675581 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

12 0.39279747 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

13 0.38891637 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

14 0.37998262 47 nips-2007-Collapsed Variational Inference for HDP

15 0.37534717 7 nips-2007-A Kernel Statistical Test of Independence

16 0.36370751 196 nips-2007-The Infinite Gamma-Poisson Feature Model

17 0.35966054 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

18 0.35607165 40 nips-2007-Bundle Methods for Machine Learning

19 0.355223 214 nips-2007-Variational inference for Markov jump processes

20 0.34533995 200 nips-2007-The Tradeoffs of Large Scale Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.048), (13, 0.036), (16, 0.022), (18, 0.014), (21, 0.065), (31, 0.013), (34, 0.025), (35, 0.026), (47, 0.072), (49, 0.029), (83, 0.143), (85, 0.025), (87, 0.012), (90, 0.085), (98, 0.296)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74300468 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

2 0.73163128 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

Author: Max Welling, Ian Porteous, Evgeniy Bart

Abstract: A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1

3 0.68034351 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu

Abstract: We consider the problem of Support Vector Machine transduction, which involves a combinatorial problem with exponential computational complexity in the number of unlabeled examples. Although several studies are devoted to Transductive SVM, they suffer either from the high computation complexity or from the solutions of local optimum. To address this problem, we propose solving Transductive SVM via a convex relaxation, which converts the NP-hard problem to a semi-definite programming. Compared with the other SDP relaxation for Transductive SVM, the proposed algorithm is computationally more efficient with the number of free parameters reduced from O(n2 ) to O(n) where n is the number of examples. Empirical study with several benchmark data sets shows the promising performance of the proposed algorithm in comparison with other state-of-the-art implementations of Transductive SVM. 1

4 0.5589816 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

5 0.5567894 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

6 0.55435365 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

7 0.55249119 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

8 0.55199713 187 nips-2007-Structured Learning with Approximate Inference

9 0.55182028 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

10 0.55102313 7 nips-2007-A Kernel Statistical Test of Independence

11 0.55094326 128 nips-2007-Message Passing for Max-weight Independent Set

12 0.55023271 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

13 0.54813743 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

14 0.54754102 185 nips-2007-Stable Dual Dynamic Programming

15 0.54733479 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

16 0.54687411 49 nips-2007-Colored Maximum Variance Unfolding

17 0.54678595 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

18 0.54552495 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

19 0.54475462 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

20 0.54411125 175 nips-2007-Semi-Supervised Multitask Learning