nips nips2012 nips2012-95 knowledge-graph by maker-knowledge-mining

95 nips-2012-Density-Difference Estimation


Source: pdf

Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. [sent-2, score-0.177]

2 However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. [sent-3, score-0.167]

3 In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. [sent-4, score-0.285]

4 We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. [sent-5, score-0.121]

5 We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. [sent-6, score-0.095]

6 Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection. [sent-7, score-0.107]

7 To cope with this problem, it would be more appropriate to directly estimate the target quantity in a single-shot process without separately estimating the two elements. [sent-9, score-0.089]

8 More recently, a problem of estimating the ratio of two probability densities was tackled in a similar fashion [2, 3]: The ratio of two probability densities is directly estimated without going through separate estimation of the two probability densities. [sent-11, score-0.237]

9 In this paper, we further explore this line of research, and propose a method for directly estimating the difference between two probability densities in a single-shot process. [sent-12, score-0.142]

10 Density differences would be more desirable than density ratios because density ratios can diverge to infinity even under a mild condition (e. [sent-13, score-0.252]

11 , two Gaussians [4]), whereas density differences are always finite as long as each density is bounded. [sent-15, score-0.218]

12 Density differences can be used for solving various machine learning tasks such as class-balance estimation under class-prior change [5] and change-point detection in time series [6]. [sent-16, score-0.142]

13 For this density-difference estimation problem, we propose a single-shot method, called the leastsquares density-difference (LSDD) estimator, that directly estimates the density difference without separately estimating two densities. [sent-17, score-0.32]

14 We derive a finite-sample error bound for the LSDD estimator and show that it achieves the optimal convergence rate in a non-parametric setup. [sent-19, score-0.156]

15 We then apply LSDD to L2 -distance estimation and show that it is more accurate than the difference of KDEs, which tends to severely under-estimate the L2 -distance [7]. [sent-20, score-0.164]

16 Because the L2 -distance is more robust against outliers than the Kullback-Leibler divergence [8], the proposed L2 -distance estimator can lead to the paradigm of robust distribution comparison. [sent-21, score-0.112]

17 We experimentally demonstrate the usefulness of LSDD in semi-supervised class-prior estimation and unsupervised change detection. [sent-22, score-0.151]

18 2 Density-Difference Estimation In this section, we propose a single-shot method for estimating the difference between two probability densities from samples, and analyze its theoretical properties. [sent-23, score-0.142]

19 Problem Formulation and Naive Approach: First, we formulate the problem of densitydifference estimation. [sent-24, score-0.063]

20 Suppose that we are given two sets of independent and identically distributed samples X := {xi }n and X := {xi }n =1 from probability distributions on Rd with densities i=1 i p(x) and p (x), respectively. [sent-25, score-0.075]

21 Our goal is to estimate the density difference, f (x) := p(x) − p (x), from the samples X and X . [sent-26, score-0.129]

22 A naive approach to density-difference estimation is to use kernel density estimators (KDEs). [sent-27, score-0.288]

23 However, we argue that the KDE-based density-difference estimator is not the best approach because of its two-step nature. [sent-28, score-0.095]

24 Intuitively, good density estimators tend to be smooth and thus the difference between such smooth density estimators tends to be over-smoothed as a density-difference estimator [9]. [sent-29, score-0.486]

25 To overcome this weakness, we give a single-shot procedure of directly estimating the density difference f (x) without separately estimating the densities p(x) and p (x). [sent-30, score-0.34]

26 Least-Squares Density-Difference Estimation: In our proposed approach, we fit a densitydifference model g(x) to the true density-difference function f (x) under the squared loss: g(x) − f (x) argmin g 2 dx. [sent-31, score-0.151]

27 We use the following Gaussian kernel model as g(x): n+n x−c 2σ 2 θ exp − g(x) = =1 2 , (1) where (c1 , . [sent-32, score-0.079]

28 Finally, a density-difference estimator f (x), which we call the least-squares density-difference (LSDD) estimator, is given as n+n θ exp − f (x) = =1 2 x−c 2σ 2 . [sent-56, score-0.115]

29 Non-Parametric Error Bound: Here, we theoretically analyze an estimation error of LSDD. [sent-57, score-0.083]

30 We assume n = n, and let Hγ be the reproducing kernel Hilbert space (RKHS) corresponding to the Gaussian kernel with width γ: kγ (x, x ) = exp − x − x 2 /γ 2 . [sent-58, score-0.164]

31 Let us consider a slightly modified LSDD estimator that is more suitable for non-parametric error analysis1 : 2 L2 (Rd ) g f := argmin g∈Hγ −2 1 n n g(xi ) − i=1 1 n n g(xi ) +λ g 2 Hγ . [sent-59, score-0.164]

32 Suppose also that the density difference f = p − p is a member of Besov space with regularity α. [sent-62, score-0.191]

33 α α That is, f ∈ B2,∞ where B2,∞ is the Besov space with regularity α, and f α B2,∞ := f L2 (Rd ) + sup(t−α ωr,L2 (Rd ) (f, t)) < c for r = α + 1, t>0 where α denotes the largest integer less than or equal to α and ωr,L2 (Rd ) is the r-th modulus of smoothness (see [10] for the definitions). [sent-63, score-0.046]

34 Then, for all ρ, ρ > 0, there exists a constant K > 0 depending on M, c, ρ, and ρ such that, for all n ≥ 1 and τ ≥ 1, the density-difference estimator f with appropriate choice of γ and λ satisfies f −f 2 L2 (Rd ) +λ f 2 Hγ 2α ≤ K n− 2α+d +ρ + τ n−1+ρ with probability not less than 1 − 4e−τ . [sent-68, score-0.095]

35 1 More specifically, the regularizer is replaced from the squared 2 -norm of parameters to the squared RKHSnorm of a learned function, which is necessary to establish consistency. [sent-69, score-0.072]

36 Therefore, the densitydifference estimator with a Gaussian kernel achieves the optimal learning rate by appropriately choosing the regularization parameter and the Gaussian width. [sent-72, score-0.279]

37 Because the learning rate depends on α, the LSDD estimator has adaptivity to the smoothness of the true function. [sent-73, score-0.154]

38 It is known that, if the naive KDE with a Gaussian kernel is used for estimating a probability density with regularity α > 2, the optimal learning rate cannot be achieved [11, 12]. [sent-74, score-0.333]

39 To achieve the optimal rate by KDE, we should choose a kernel function specifically tailored to each regularity α [13]. [sent-75, score-0.14]

40 However, such a kernel function is not non-negative and it is difficult to implement it in practice. [sent-76, score-0.059]

41 On the other hand, our LSDD estimator can always achieve the optimal learning rate for a Gaussian kernel without regard to regularity α. [sent-77, score-0.258]

42 , the kernel width σ and the regularization parameter λ). [sent-81, score-0.112]

43 , all samples without Xt and Xt ), and compute its hold-out error for Xt and Xt as CV(t) := ft (x)2 dx − 2 |Xt | ft (x) + x∈Xt 2 |Xt | ft (x ), x ∈Xt where |X | denotes the number of elements in the set X . [sent-86, score-0.207]

44 i=1 i For an equivalent expression L2 (p, p ) = f (x)p(x)dx − f (x )p (x )dx , if we replace f (x) with an LSDD estimator f (x) and approximate the expectations by empirical averages, we obtain L2 (p, p ) ≈ h θ. [sent-93, score-0.112]

45 Similarly, for another expression L2 (p, p ) = f (x)2 dx, replacing f (x) with an LSDD estimator f (x) gives L2 (p, p ) ≈ θ H θ. [sent-94, score-0.095]

46 To explain the reason, let us consider a generalized L2 -distance estimator of the form β h θ + (1 − β)θ H θ, where β is a real scalar. [sent-96, score-0.095]

47 If the regularization parameter λ (≥ 0) is small, this can be expressed as β h θ + (1 − β)θ H θ = h H −1 h − λ(2 − β)h H −2 h + op (λ), (4) where op denotes the probabilistic order. [sent-97, score-0.075]

48 This can be naturally interpreted through a lower bound of L2 (p, p ) obtained by Legendre-Fenchel convex duality [14]: L2 (p, p ) = sup 2 g g(x)p(x)dx − g(x )p (x )dx − g(x)2 dx , where the supremum is attained at g = f . [sent-112, score-0.068]

49 If the expectations are replaced by empirical estimators and the Gaussian kernel model (1) is used as g, the above optimization problem is reduced to the LSDD objective function without regularization (see Eq. [sent-113, score-0.133]

50 4 Experiments In this section, we experimentally demonstrate the usefulness of LSDD. [sent-119, score-0.05]

51 Illustration: Let N (x; μ, Σ) be the multi-dimensional normal density with mean vector μ and variance-covariance matrix Σ with respect to x, and let p(x) = N (x; (μ, 0, . [sent-125, score-0.109]

52 We compare LSDD with KDEi (KDE with two Gaussian widths chosen independently by least-squares cross-validation [15]) and KDEj (KDE with two Gaussian widths chosen jointly to minimize the LSDD criterion [9]). [sent-133, score-0.044]

53 Figure 1 depicts density-difference estimation results obtained by LSDD, KDEi, and KDEj for μ = 0 (i. [sent-135, score-0.079]

54 The figure shows that LSDD and KDEj give accurate estimates of the density difference f (x) = 0. [sent-138, score-0.193]

55 On the other hand, the estimate obtained by KDEi is rather fluctuated, although both densities are reasonably well approximated by KDEs. [sent-139, score-0.055]

56 This illustrates an advantage of directly estimating the density difference without going through separate estimation of each density. [sent-140, score-0.253]

57 KDEi and KDEj give the same estimation result for this dataset, which slightly underestimates the peaks. [sent-145, score-0.128]

58 Figure 3 depicts the mean and standard error of estimated L2 -distances over 1000 runs as functions of mean μ. [sent-152, score-0.067]

59 When d = 1 (Figure 3(a)), the LSDD-based L2 -distance estimator gives the most accurate estimates of the true L2 -distance, whereas the KDEi-based L2 -distance estimator slightly underestimates the true L2 -distance when μ is large. [sent-153, score-0.357]

60 This is caused by the fact that KDE tends to provide smooth density estimates (see Figure 2(b) again): Such smooth density estimates are accurate as density estimates, but the difference of smooth density estimates yields a small L2 distance estimate [7]. [sent-154, score-0.675]

61 The KDEj-based L2 -distance estimator tends to improve this drawback of KDEi, but it still slightly underestimates the true L2 -distance when μ is large. [sent-155, score-0.227]

62 When d = 5 (Figure 3(b)), the KDE-based L2 -distance estimators even severely underestimate the true L2 -distance when μ is large. [sent-156, score-0.069]

63 On the other hand, the LSDD-based L2 -distance estimator still gives reasonably accurate estimates of the true L2 -distance even when d = 5. [sent-157, score-0.167]

64 However, we note that LSDD also slightly underestimates the true L2 -distance when μ is large, because slight underestimation tends to yield smaller variance and thus such stabilized solutions are more accurate in terms of the bias-variance trade-off. [sent-158, score-0.151]

65 Semi-Supervised Class-Balance Estimation: In real-world pattern recognition tasks, changes in class balance between the training and test phases are often observed. [sent-159, score-0.074]

66 5 1 (c) KDEj Figure 1: Estimation of density difference when μ = 0 (i. [sent-193, score-0.145]

67 5 x x (a) LSDD (b) KDEi (c) KDEj Figure 2: Estimation of density difference when μ = 0. [sent-217, score-0.145]

68 8 (b) d = 5 Figure 3: L2 -distance estimation by LSDD, KDEi, and KDEj for n = n = 200 as functions of the Gaussian mean μ. [sent-251, score-0.057]

69 training produces significant estimation bias because the class balance in the training dataset does not properly reflect that of the test dataset. [sent-253, score-0.154]

70 Our goal is to learn the class balance of a test dataset in a semi-supervised learning setup where unlabeled test samples are provided in addition to labeled training samples [16]. [sent-255, score-0.137]

71 The class balance in the test set can be estimated by matching a mixture of class-wise training input densities, qtest (x; π) := πptrain (x|y = +1) + (1 − π)ptrain (x|y = −1), to the test input density ptest (x) [5], where π ∈ [0, 1] is a mixing coefficient to learn. [sent-256, score-0.282]

72 Here, we use the L2 -distance estimated by LSDD and the difference of KDEs for this distribution matching. [sent-258, score-0.055]

73 Note that, when LSDD is used to estimate the L2 -distance, separate estimation of ptrain (x|y = ±1) is not involved, but the difference between ptest (x) and qtest (x; π) is directly estimated. [sent-259, score-0.252]

74 edu/ml/), where we randomly choose 10 labeled training samples from each class and 50 unlabeled test samples following true class-prior π ∗ = 0. [sent-263, score-0.079]

75 Figure 6 plots the mean and standard error of the squared difference between true and estimated class-balances π and the misclassification error by a weighted 2 -regularized least-squares classifier [17] with weighted cross-validation [18] over 1000 runs. [sent-270, score-0.167]

76 The results show that LSDD tends to provide better class-balance estimates than the KDEi-based, the KDEj-based, and the EM-based methods [5], which are translated into lower classification errors. [sent-271, score-0.066]

77 6 Unsupervised Change Detection: The objective of change detection is to discover abrupt property changes behind time-series data. [sent-272, score-0.085]

78 Let Y(t) be a set of r retrospective subsequence samples starting at time t: Y(t) := {Y (t), Y (t + 1), . [sent-278, score-0.056]

79 Our strategy is to compute a certain dissimilarity measure between two consecutive segments Y(t) and Y(t+r), and use it as the plausibility of change points (see Figure 5). [sent-282, score-0.062]

80 As a dissimilarity measure, we use the L2 -distance estimated by LSDD and the Kullback-Leibler (KL) divergence estimated by the KL importance estimation procedure (KLIEP) [2, 3]. [sent-283, score-0.128]

81 The top graphs in Figure 7(a) display the original timeseries (true change points were manually annotated) and change scores obtained by KLIEP and LSDD. [sent-291, score-0.121]

82 The graphs show that the LSDD-based change score indicates the existence of change points more clearly than the KLIEP-based change score. [sent-292, score-0.175]

83 The top graphs in Figure 7(b) display the original time-series for a sequence of actions “jog”, “stay”, “stair down”, “stay”, and “stair up” (there exists 4 change points at time 540, 1110, 1728, and 2286) and the change scores obtained by KLIEP and LSDD. [sent-298, score-0.121]

84 The results show that LSDD tends to outperform other methods and is comparable to state-of-the-art native change-detection methods. [sent-307, score-0.058]

85 5 Conclusions In this paper, we proposed a method for directly estimating the difference between two probability density functions without density estimation. [sent-308, score-0.305]

86 The proposed method, called the least-squares densitydifference (LSDD), was derived within the framework of kernel least-squares estimation, and its solution can be computed analytically in a computationally efficient and stable manner. [sent-309, score-0.139]

87 Furthermore, LSDD is equipped with cross-validation, and thus all tuning parameters such as the kernel width and the regularization parameter can be systematically and objectively optimized. [sent-310, score-0.18]

88 We also proposed an L2 -distance estimator based on LSDD, which nicely cancels a bias caused by regularization. [sent-312, score-0.095]

89 Through experiments on class-prior estimation and change-point detection, the usefulness of the proposed LSDD was demonstrated. [sent-313, score-0.088]

90 Acknowledgments: We would like to thank Wittawat Jitkrittum for his comments and Za¨d Harı chaoui for providing us a program code of kernel change-point detection. [sent-314, score-0.076]

91 Class balance squared error Class balance squared error Class balance squared error LSDD KDEi 0. [sent-373, score-0.363]

92 25 g d ( k Y (t + r − 1) ptest (x) h g e g f b f c b Y (t) a Y (t + 1) c Class balance squared error ptrain(x|y = −1) ( y(t + r) y(t) 0. [sent-376, score-0.169]

93 8 (c) German dataset (d) Statlogheart dataset Figure 6: Results of semi-supervised class-balance estimation. [sent-389, score-0.046]

94 2 0 500 1000 0 1500 KLIEP score 40 1000 1500 2000 2500 3000 2000 2500 3000 2000 2500 3000 KLIEP score 20 20 0 500 40 0 500 1000 0 1500 LSDD score 0 500 1000 1500 LSDD score 2 1 1 0 0 500 1000 0 1500 0 500 1000 Time 1 0. [sent-395, score-0.104]

95 024 SE (b) Accelerometer data (a) Speech data Figure 7: Results of unsupervised change detection. [sent-461, score-0.044]

96 Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. [sent-508, score-0.218]

97 Robust and efficient estimation by minimising a density power divergence. [sent-518, score-0.166]

98 On the best obtainable asymptotic rates of convergence in estimation of a density function at a point. [sent-534, score-0.166]

99 On the estimation of a probability density function and mode. [sent-543, score-0.166]

100 A unifying framework for detecting outliers and change points from non-stationary time series data. [sent-580, score-0.061]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lsdd', 0.825), ('kdei', 0.238), ('kdej', 0.206), ('kliep', 0.19), ('density', 0.109), ('kde', 0.096), ('estimator', 0.095), ('kcp', 0.079), ('kfd', 0.079), ('ptrain', 0.079), ('mext', 0.077), ('sst', 0.07), ('dx', 0.068), ('densitydifference', 0.063), ('misclassification', 0.063), ('balance', 0.059), ('kernel', 0.059), ('estimation', 0.057), ('underestimates', 0.056), ('densities', 0.055), ('auc', 0.054), ('japan', 0.051), ('estimating', 0.051), ('hasc', 0.048), ('ptest', 0.048), ('regularity', 0.046), ('kakenhi', 0.044), ('change', 0.044), ('xt', 0.043), ('kdes', 0.042), ('detection', 0.041), ('ar', 0.04), ('separately', 0.038), ('tends', 0.037), ('subsequence', 0.036), ('squared', 0.036), ('difference', 0.036), ('rate', 0.035), ('roc', 0.034), ('naive', 0.033), ('besov', 0.032), ('censrec', 0.032), ('nagoya', 0.032), ('qtest', 0.032), ('usefulness', 0.031), ('ft', 0.031), ('estimators', 0.03), ('estimates', 0.029), ('systematically', 0.029), ('stair', 0.028), ('argmin', 0.028), ('regularization', 0.027), ('rd', 0.027), ('score', 0.026), ('error', 0.026), ('kawahara', 0.026), ('width', 0.026), ('cn', 0.025), ('op', 0.024), ('se', 0.024), ('objectively', 0.024), ('true', 0.024), ('regard', 0.023), ('dataset', 0.023), ('svm', 0.023), ('si', 0.023), ('gaussian', 0.022), ('widths', 0.022), ('sugiyama', 0.022), ('depicts', 0.022), ('native', 0.021), ('exp', 0.02), ('stage', 0.02), ('samples', 0.02), ('smooth', 0.02), ('tokyo', 0.019), ('accurate', 0.019), ('experimentally', 0.019), ('hall', 0.019), ('estimated', 0.019), ('dissimilarity', 0.018), ('covariate', 0.018), ('false', 0.017), ('ratios', 0.017), ('cv', 0.017), ('outliers', 0.017), ('expectations', 0.017), ('analytically', 0.017), ('program', 0.017), ('graphs', 0.017), ('xi', 0.016), ('misclassi', 0.016), ('display', 0.016), ('stay', 0.016), ('class', 0.015), ('incurred', 0.015), ('slightly', 0.015), ('tuning', 0.015), ('severely', 0.015), ('importance', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 95 nips-2012-Density-Difference Estimation

Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

2 0.10228717 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

3 0.06337864 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

4 0.058342852 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

Author: Azadeh Khaleghi, Daniil Ryabko

Abstract: The problem of multiple change point estimation is considered for sequences with unknown number of change points. A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed. In order for the consistency to be established the only assumption required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form. The theoretical results are complemented with experimental evaluations. 1

5 0.049642999 16 nips-2012-A Polynomial-time Form of Robust Regression

Author: Ozlem Aslan, Dale Schuurmans, Yao-liang Yu

Abstract: Despite the variety of robust regression methods that have been developed, current regression formulations are either NP-hard, or allow unbounded response to even a single leverage point. We present a general formulation for robust regression—Variational M-estimation—that unifies a number of robust regression methods while allowing a tractable approximation strategy. We develop an estimator that requires only polynomial-time, while achieving certain robustness and consistency guarantees. An experimental evaluation demonstrates the effectiveness of the new estimation approach compared to standard methods. 1

6 0.048181869 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.044119317 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

8 0.043937795 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

9 0.04312792 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

10 0.041892413 324 nips-2012-Stochastic Gradient Descent with Only One Projection

11 0.038242251 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.038192872 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

13 0.037027918 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

14 0.036405433 37 nips-2012-Affine Independent Variational Inference

15 0.036301594 188 nips-2012-Learning from Distributions via Support Measure Machines

16 0.034965884 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

17 0.03485566 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

18 0.034668732 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

19 0.034037914 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

20 0.033394616 294 nips-2012-Repulsive Mixtures


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.105), (1, 0.013), (2, 0.02), (3, 0.008), (4, 0.023), (5, 0.003), (6, -0.003), (7, 0.022), (8, -0.023), (9, -0.067), (10, 0.005), (11, -0.029), (12, 0.015), (13, -0.051), (14, 0.011), (15, -0.06), (16, 0.078), (17, -0.024), (18, -0.042), (19, -0.018), (20, 0.035), (21, -0.056), (22, 0.073), (23, -0.035), (24, -0.048), (25, -0.008), (26, 0.045), (27, 0.031), (28, 0.051), (29, -0.039), (30, 0.02), (31, -0.029), (32, -0.031), (33, -0.041), (34, -0.013), (35, -0.001), (36, 0.02), (37, 0.029), (38, 0.001), (39, 0.039), (40, -0.013), (41, -0.046), (42, -0.04), (43, -0.039), (44, 0.025), (45, -0.021), (46, -0.057), (47, -0.001), (48, 0.001), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91425002 95 nips-2012-Density-Difference Estimation

Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

2 0.83866823 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

3 0.76481253 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

4 0.71648943 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

5 0.70672297 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch

Abstract: We propose an efficient, generalized, nonparametric, statistical KolmogorovSmirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods. 1

6 0.67857319 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

7 0.6172511 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

8 0.60462511 222 nips-2012-Multi-Task Averaging

9 0.59001756 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

10 0.55283636 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

11 0.50844866 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

12 0.4905903 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

13 0.48547083 16 nips-2012-A Polynomial-time Form of Robust Regression

14 0.48077315 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

15 0.47985989 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

16 0.4739444 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

17 0.46227583 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

18 0.46003494 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

19 0.43506598 305 nips-2012-Selective Labeling via Error Bound Minimization

20 0.43447566 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (17, 0.013), (21, 0.018), (38, 0.104), (42, 0.028), (54, 0.015), (55, 0.394), (74, 0.041), (76, 0.148), (80, 0.062), (92, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87047571 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

2 0.86901265 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

Author: Francisco Ruiz, Isabel Valera, Carlos Blanco, Fernando Pérez-Cruz

Abstract: The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) database contains a large amount of information, regarding the way of life, medical conditions, etc., of a representative sample of the U.S. population. In this paper, we are interested in seeking the hidden causes behind the suicide attempts, for which we propose to model the subjects using a nonparametric latent model based on the Indian Buffet Process (IBP). Due to the nature of the data, we need to adapt the observation model for discrete random variables. We propose a generative model in which the observations are drawn from a multinomial-logit distribution given the IBP matrix. The implementation of an efficient Gibbs sampler is accomplished using the Laplace approximation, which allows integrating out the weighting factors of the multinomial-logit likelihood model. Finally, the experiments over the NESARC database show that our model properly captures some of the hidden causes that model suicide attempts. 1

3 0.8652184 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

Author: Francesco Dinuzzo, Bernhard Schölkopf

Abstract: The representer theorem is a property that lies at the foundation of regularization theory and kernel methods. A class of regularization functionals is said to admit a linear representer theorem if every member of the class admits minimizers that lie in the finite dimensional subspace spanned by the representers of the data. A recent characterization states that certain classes of regularization functionals with differentiable regularization term admit a linear representer theorem for any choice of the data if and only if the regularization term is a radial nondecreasing function. In this paper, we extend such result by weakening the assumptions on the regularization term. In particular, the main result of this paper implies that, for a sufficiently large family of regularization functionals, radial nondecreasing functions are the only lower semicontinuous regularization terms that guarantee existence of a representer theorem for any choice of the data. 1

4 0.85524803 155 nips-2012-Human memory search as a random walk in a semantic network

Author: Joseph L. Austerweil, Joshua T. Abbott, Thomas L. Griffiths

Abstract: The human mind has a remarkable ability to store a vast amount of information in memory, and an even more remarkable ability to retrieve these experiences when needed. Understanding the representations and algorithms that underlie human memory search could potentially be useful in other information retrieval settings, including internet search. Psychological studies have revealed clear regularities in how people search their memory, with clusters of semantically related items tending to be retrieved together. These findings have recently been taken as evidence that human memory search is similar to animals foraging for food in patchy environments, with people making a rational decision to switch away from a cluster of related information as it becomes depleted. We demonstrate that the results that were taken as evidence for this account also emerge from a random walk on a semantic network, much like the random web surfer model used in internet search engines. This offers a simpler and more unified account of how people search their memory, postulating a single process rather than one process for exploring a cluster and one process for switching between clusters. 1

5 0.79846823 211 nips-2012-Meta-Gaussian Information Bottleneck

Author: Melanie Rey, Volker Roth

Abstract: We present a reformulation of the information bottleneck (IB) problem in terms of copula, using the equivalence between mutual information and negative copula entropy. Focusing on the Gaussian copula we extend the analytical IB solution available for the multivariate Gaussian case to distributions with a Gaussian dependence structure but arbitrary marginal densities, also called meta-Gaussian distributions. This opens new possibles applications of IB to continuous data and provides a solution more robust to outliers. 1

6 0.73223168 215 nips-2012-Minimizing Uncertainty in Pipelines

same-paper 7 0.72572809 95 nips-2012-Density-Difference Estimation

8 0.59606469 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

9 0.5913952 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

10 0.58907068 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

11 0.58239877 231 nips-2012-Multiple Operator-valued Kernel Learning

12 0.57629025 193 nips-2012-Learning to Align from Scratch

13 0.57154936 298 nips-2012-Scalable Inference of Overlapping Communities

14 0.57087696 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

15 0.5649091 188 nips-2012-Learning from Distributions via Support Measure Machines

16 0.55962348 210 nips-2012-Memorability of Image Regions

17 0.55077589 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

18 0.54875284 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

19 0.54359096 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

20 0.54046267 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task