nips nips2012 nips2012-117 knowledge-graph by maker-knowledge-mining

117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation


Source: pdf

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ensemble weighted kernel estimators for multivariate entropy estimation Kumar Sricharan, Alfred O. [sent-1, score-0.971]

2 edu Abstract The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. [sent-3, score-0.397]

3 Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. [sent-4, score-0.478]

4 However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. [sent-5, score-0.718]

5 In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). [sent-6, score-1.647]

6 This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. [sent-8, score-1.021]

7 1 Introduction Non-linear entropy functionals of a multivariate density f of the form g(f (x), x)f (x)dx arise in applications including machine learning, signal processing, mathematical statistics, and statistical communication theory. [sent-9, score-0.475]

8 Several estimators of entropy measures have been proposed for general multivariate densities f . [sent-16, score-0.575]

9 These include consistent estimators based on histograms [10, 2], kernel density plug-in estimators, entropic graphs [5, 20], gap estimators [24] and nearest neighbor distances [8, 18, 19]. [sent-17, score-1.058]

10 Kernel density plug-in estimators [1, 6, 11, 15, 12] are simple, easy to implement, computationally fast and therefore widely used for estimation of entropy [2, 23, 14, 4, 13]. [sent-18, score-0.701]

11 However, these estimators suffer from mean squared error (MSE) rates which typically grow with feature dimension d as O(T −γ/d ), where T is the number of samples and γ is a positive rate parameter. [sent-19, score-0.407]

12 1 In this paper, we propose a novel weighted ensemble kernel density plug-in estimator ˆ of entropy Gw , that achieves parametric MSE rates of O(T −1 ) when the feature densˆ ity is smooth. [sent-20, score-1.409]

13 The estimator is constructed as a weighted convex combination Gw = ˆ k(l) of individual kernel density plug-in estimators Gk(l) wrt the weights ˆ l∈¯ w(l)G l {w(l); l ∈ ¯ Here, ¯ is a vector of indices {l1 , . [sent-21, score-1.32]

14 The individual kernel estimˆ ators Gk(l) are similar to the data-split kernel estimator of Gy¨rfi and van der Muelen [11], o −1/1+d and have slow MSE rates of convergence of order O(T ). [sent-25, score-1.001]

15 It is shown that the weights {w(l); l ∈ ¯ can be chosen so as to significantly improve the rate of MSE convergence l} ˆ of the weighted estimator Gw . [sent-28, score-0.699]

16 In fact our ensemble averaging method can improve MSE ˆ w to the parametric rate O(T −1 ). [sent-29, score-0.307]

17 Birge and Massart [3] show that for density f in a Holder smoothness class with s derivatives, the minimax MSE rate for estimation of a smooth functional is T −2γ , where γ = min{1/2, 4s/(4s + d)}. [sent-38, score-0.297]

18 The kernel estimators proposed in this paper require higher order smoothness conditions on the density, i. [sent-40, score-0.496]

19 While there exist other estimators [17, 7] that achieve parametric MSE rates of O(1/T ) when s > 4/d, these estimators are more difficult to implement than kernel density estimators, which are a staple of many toolboxes in machine learning, pattern recognition, and statistics. [sent-43, score-1.004]

20 The proposed ensemble weighted estimator is a simple weighted combination of off-the-shelf kernel density estimators. [sent-44, score-1.276]

21 We formally describe the kernel plug-in entropy estimators for entropy estimation in Section 2 and discuss the MSE convergence properties of these estimators. [sent-47, score-1.03]

22 In particular, we establish that these estimators have MSE rate which decays as O(T −1/1+d ). [sent-48, score-0.349]

23 Next, we propose the weighted ensemble of kernel entropy estimators in Section 3. [sent-49, score-1.075]

24 4) and show that the resultant optimally weighted estimator has a MSE of O(T −1 ). [sent-51, score-0.586]

25 We present simulation results in Section 4 that illustrate the superior performance of this ensemble entropy estimator in the context of (i) estimation of the Panter-Dite distortion-rate factor [9] and (ii) testing the probability distribution of a random sample. [sent-52, score-0.95]

26 We denote the expectation operator by the symbol E, the variance operator as V[X] = E[(X − E[X])2 ], and the bias of an estimator by B. [sent-55, score-0.494]

27 2 2 Entropy estimation This paper focuses on the estimation of general non-linear functionals G(f ) of d-dimensional multivariate densities f with known support S = [a, b]d , where G(f ) has the form G(f ) = g(f (x), x)f (x)dµ(x), (2. [sent-56, score-0.254]

28 1 Plug-in estimators of entropy A truncated kernel density estimator with uniform kernel is defined below. [sent-71, score-1.647]

29 Our proposed weighted ensemble method applies to other types of kernels as well but we specialize to uniform kernels as it makes the derivations clearer. [sent-72, score-0.379]

30 Define the truncated kernel bin region for each X ∈ S to be Sk (X) = {Y ∈ S : ||X − Y ||1 ≤ dk /2}, and the volume of the truncated kernel bins to be Vk (X) = Sk (X) dz. [sent-74, score-0.752]

31 The truncated kernel density estimator is defined as i=1 lk (X) . [sent-77, score-0.939]

32 2) The plug-in estimator of the density functional is constructed using a data splitting approach as follows. [sent-79, score-0.585]

33 In the first stage, we estimate the kernel density estimate ˆk at the N points {X1 , . [sent-87, score-0.384]

34 3) i=1 Also define a standard kernel density estimator with uniform kernel ˜k (X), which is identical f ˆk (X) except that the volume Vk (X) is always set to be Vk (X) = k/M . [sent-99, score-0.973]

35 4) i=1 ˜ The estimator Gk is identical to the estimator of Gy¨rfi and van der Muelen [11]. [sent-102, score-0.902]

36 1 Assumptions We make a number of technical assumptions that will allow us to obtain tight MSE convergence rates for the kernel density estimators defined in above. [sent-106, score-0.697]

37 0) : Assume that the kernel bandwidth satisfies k = k0 M β for any rate constant 0 < β < 1, and assume that M , N and T are linearly related through the proportionality constant αf rac with: 0 < αf rac < 1, M = αf rac T and N = (1 − αf rac )T . [sent-110, score-0.717]

38 The bias of the plug-in estimators Gk , Gk is given by ˆ B(Gk ) = c1,i i∈I k M i/d k M + c2 +o k k 1 + k M 1/d c2 k 1 . [sent-134, score-0.345]

39 The variance of the plug-in estimators Gk , Gk is given by 1 1 1 1 ˆ + c5 +o + V(Gk ) = c4 N M M N 1 1 1 1 ˜ V(Gk ) = c4 + c5 +o . [sent-136, score-0.323]

40 3 Optimal MSE rate ˆ ˜ From Theorem 1, k → ∞ and k/M → 0 for the estimators Gk and Gk to be unbiased. [sent-144, score-0.349]

41 Likewise from Theorem 2 N → ∞ and M → ∞ for the variance of the estimator to converge to 0. [sent-145, score-0.443]

42 The optimal MSE for ˆ ˜ the estimators Gk and Gk is therefore achieved for the choice of k = O(M 1/(1+d) ), and is −2/(1+d) ˆ ˜ given by O(T ). [sent-153, score-0.294]

43 Our goal is to reduce the estimator MSE to O(T −1 ). [sent-155, score-0.414]

44 3 Ensemble estimators For a positive integer L > d, choose ¯√ {l1 , . [sent-157, score-0.294]

45 Define the weighted ensemble estimator ˆ ˆ w(l)Gk(l) . [sent-164, score-0.77]

46 The bias of the ensemble estimator c1,i γw (i)M −i/2d + O i∈I 1 √ T . [sent-174, score-0.649]

47 5) and the Cauchy-Schwarz inequality, the entries of Σ ˆ weighted estimator Gw can then be bound as follows:   ¯ ¯ w ′ ΣL w λmax (ΣL )||w||2 2 ˆ ˆ wl Gk(l)  = w′ ΣL w = V[Gw ] = V  ≤ . [sent-180, score-0.586]

48 3) T T ¯ l∈l We seek a weight vector w that (i) ensures that the bias of the weighted estimator is O(T −1/2 ) and (ii) has low ℓ2 norm ||w||2 in order to limit the contribution of the variance of the weighted estimator. [sent-182, score-0.838]

49 4), the estimator variance V[Gw∗ ] is of 1 0 order O(η(d)/T ). [sent-209, score-0.443]

50 While we have illustrated the weighted ensemble method only in the context of kernel estimators, this method can be applied to any general ensemble of estimators that satisfy bias and variance conditions C . [sent-211, score-1.116]

51 4 Experiments We illustrate the superior performance of the proposed weighted ensemble estimator for two applications: (i) estimation of the Panter-Dite rate distortion factor, and (ii) estimation of entropy to test for randomness of a random sample. [sent-214, score-1.2]

52 The Panter-Dite factor corresponds to the 5 0 10 −1 Mean squared error 10 −2 10 Standard kernel plug−in estimator [12] Truncated kernel plug−in estimator (2. [sent-217, score-1.257]

53 3) Histogram plug−in estimator [11] k−nearest neighbor estimator [19] Entropic graph estimator [6,21] Weighted kernel estimator (3. [sent-218, score-1.886]

54 From the figure, we see that the proposed weighted estimator has the fastest MSE rate of convergence wrt sample size T . [sent-220, score-0.719]

55 0 10 −1 Mean squared error 10 −2 10 Standard kernel plug−in estimator [12] Truncated kernel plug−in estimator (2. [sent-221, score-1.232]

56 3) Histogram plug−in estimator [11] k−nearest neighbor estimator [19] Entropic graph estimator [6,21] Weighted kernel estimator (3. [sent-222, score-1.886]

57 From the figure, we see that the MSE of the proposed weighted estimator has the slowest rate of growth with increasing dimension d. [sent-224, score-0.666]

58 Figure 1: Variation of MSE of Panter-Dite factor estimates using standard kernel plug-in estimator [12], truncated kernel plug-in estimator (2. [sent-225, score-1.449]

59 3), histogram plug-in estimator[11], k-NN estimator [19], entropic graph estimator [6,21] and the weighted ensemble estimator (3. [sent-226, score-1.735]

60 The Panter-Dite factor is directly related to the R´nyi α-entropy, for which e several other estimators have been proposed. [sent-230, score-0.319]

61 4), and in addition the following popular entropy estimators: (iv) histogram plug-in estimator [10], (v) k-nearest neighbor (k-NN) entropy estimator [18] and ˜ ˆ (vi) entropic k-NN graph estimator [5, 20]. [sent-232, score-1.853]

62 1 Variation of MSE with sample size T The MSE results of these different estimators are shown in Fig. [sent-238, score-0.294]

63 It is clear from the figure that the proposed ensemble estimator Gw has significantly 6 100 1 0. [sent-240, score-0.598]

64 5 Hypothesis index True entropy Standard kernel plug−in estimate Truncated kernel plug−in estimate Weighted plug−in estimate 50 0 −1 0 −0. [sent-241, score-0.702]

65 5 (a) Entropy estimates for random samples cor- (b) Histogram envelopes of entropy estimates responding to hypothesis H0 and H1 . [sent-270, score-0.35]

66 Figure 2: Entropy estimates using standard kernel plug-in estimator, truncated kernel plugin estimator and the weighted estimator, for random samples corresponding to hypothesis H0 and H1 . [sent-272, score-1.239]

67 The weighted estimator provided better discrimination ability by suppressing the bias, at the cost of some additional variance. [sent-273, score-0.645]

68 faster rate of convergence while the MSE of the rest of the estimators, including the truncated kernel plug-in estimator, have similar, slow rates of convergence. [sent-274, score-0.502]

69 2 Variation of MSE with dimension d The MSE results of these different estimators are shown in Fig. [sent-278, score-0.319]

70 For the standard kernel plug-in estimator and truncated kernel plug-in estimator, the MSE varies exponentially with d as expected. [sent-280, score-0.975]

71 The MSE of the histogram and k-NN estimators increase at a similar rate, indicating that these estimators suffer from the curse of dimensionality as well. [sent-281, score-0.668]

72 The MSE of the weighted estimator on the other hand increases at a slower rate, which is in agreement with our theory that the MSE is O(η(d)/T ) and observing that η(d) is an increasing function of d. [sent-282, score-0.586]

73 Also observe that the MSE of the weighted estimator is significantly smaller than the MSE of the other estimators for all dimensions d > 3. [sent-283, score-0.899]

74 2 Distribution testing In this section, Shannon differential entropy is estimated using the function g(f, x) = − log(f )I(f > 0) + I(f = 0) and used as a test statistic to test for the underlying probability distribution of a random sample. [sent-285, score-0.278]

75 The true entropy, and entropy estimates using Gk , Gk ˆ and Gw for the cases corresponding to each of the 500 instances of hypothesis H0 and H1 are shown in Fig. [sent-294, score-0.315]

76 From this figure, it is apparent that the weighted estimator provides better discrimination ability by suppressing the bias, at the cost of some additional variance. [sent-296, score-0.645]

77 Furthermore, we quantitatively measure the discriminative ability of the different estimators using the 2 2 deflection statistic ds = |µ1 − µ0 |/ σ0 + σ1 , where µ0 and σ0 (respectively µ1 and σ1 ) are 7 1 1 0. [sent-299, score-0.326]

78 6 Standard kernel plug−in estimator Truncated kernel plug−in estimator Weighted estimator 0. [sent-320, score-1.646]

79 5 (a) ROC curves corresponding to entropy estimates obtained using standard and truncated kernel plug-in estimator and the weighted estimator. [sent-324, score-1.226]

80 Neyman−Pearson test Standard kernel plug−in estimate Truncated kernel plug−in estimate Weighted estimate 0. [sent-329, score-0.479]

81 8 1 (b) Variation of AUC curves vs δ(= a0 − a1 , b0 − b1 ) corresponding to Neyman-Pearson omniscient test, entropy estimates using the standard and truncated kernel plug-in estimator and the weighted estimator. [sent-333, score-1.284]

82 The weighted estimator uniformly outperforms the individual plug-in estimators. [sent-335, score-0.607]

83 89 for the standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator respectively. [sent-340, score-1.561]

84 The receiver operating curves (ROC) for this test using these three different estimators is shown in Fig. [sent-341, score-0.317]

85 Figure 4 shows that the weighted estimator uniformly and significantly outperforms the individual plug-in estimators and is closest to the performance of the omniscient Neyman-Pearson likelihood test. [sent-351, score-0.959]

86 5 Conclusions A novel method of weighted ensemble estimation was proposed in this paper. [sent-353, score-0.408]

87 This method combines slowly converging individual estimators to produce a new estimator with faster MSE rate of convergence. [sent-354, score-0.784]

88 In this paper, we applied weighted ensembles to improve the MSE of a set of uniform kernel density estimators with different kernel width parameters. [sent-355, score-1.048]

89 We showed by theory and in simulation that that the improved ensemble estimator achieves parametric MSE convergence rate O(T −1 ). [sent-356, score-0.738]

90 The superior performance of the weighted ensemble entropy estimator was verified in the context of two important problems: (i) estimation of the Panter-Dite factor and (ii) non-parametric hypothesis testing. [sent-358, score-1.156]

91 A nonparametric estimation of the entropy for absolutely continuous distributions (corresp. [sent-362, score-0.301]

92 Selection of a MCMC simulation strategy via an entropy convergence criterion. [sent-380, score-0.259]

93 Geodesic entropic graphs for dimension and entropy estimation in manifold learning. [sent-387, score-0.389]

94 Best asymptotic normality of the kernel density entropy estimator for smooth densities. [sent-394, score-0.99]

95 Uniform in bandwidth estimation of integral functionals of the e density function. [sent-401, score-0.334]

96 A new class of random vector entropy estimators and its applications in testing statistical hypotheses. [sent-409, score-0.54]

97 Density-free convergence properties of various estimators o of entropy. [sent-419, score-0.33]

98 An entropy estimate based on a kernel density estimation. [sent-428, score-0.582]

99 Estimation of entropy and other functionals of a multivariate density. [sent-462, score-0.343]

100 A class of R´nyi information estimators for multie dimensional densities. [sent-479, score-0.294]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('estimator', 0.414), ('mse', 0.403), ('gk', 0.369), ('estimators', 0.294), ('entropy', 0.223), ('kernel', 0.202), ('plug', 0.201), ('ensemble', 0.184), ('gw', 0.182), ('weighted', 0.172), ('truncated', 0.157), ('density', 0.132), ('rac', 0.096), ('functionals', 0.092), ('entropic', 0.089), ('erent', 0.073), ('gy', 0.065), ('di', 0.062), ('nyi', 0.062), ('omniscient', 0.058), ('hypothesis', 0.057), ('rate', 0.055), ('auc', 0.055), ('roc', 0.052), ('estimation', 0.052), ('leonenko', 0.051), ('bias', 0.051), ('su', 0.05), ('parametric', 0.049), ('histogram', 0.048), ('xn', 0.048), ('derivatives', 0.044), ('supf', 0.044), ('der', 0.044), ('vk', 0.043), ('wrt', 0.042), ('functional', 0.039), ('bandwidth', 0.039), ('afin', 0.039), ('birge', 0.039), ('kopt', 0.039), ('muelen', 0.039), ('proportionality', 0.037), ('convergence', 0.036), ('null', 0.036), ('estimates', 0.035), ('dk', 0.034), ('beirlant', 0.034), ('sricharan', 0.034), ('lk', 0.034), ('rates', 0.033), ('variation', 0.033), ('curse', 0.032), ('shannon', 0.032), ('statistic', 0.032), ('ection', 0.031), ('suppressing', 0.031), ('pl', 0.03), ('realizations', 0.03), ('densities', 0.03), ('van', 0.03), ('please', 0.03), ('variance', 0.029), ('superior', 0.029), ('multivariate', 0.028), ('sk', 0.028), ('discrimination', 0.028), ('neighbor', 0.028), ('hero', 0.027), ('ii', 0.027), ('alternate', 0.026), ('nonparametric', 0.026), ('fu', 0.026), ('arbor', 0.026), ('estimate', 0.025), ('dimension', 0.025), ('factor', 0.025), ('scandinavian', 0.024), ('testing', 0.023), ('partial', 0.023), ('ensembles', 0.023), ('uniform', 0.023), ('curves', 0.023), ('michigan', 0.022), ('ann', 0.022), ('weights', 0.022), ('ll', 0.022), ('convex', 0.021), ('coded', 0.021), ('individual', 0.021), ('iii', 0.02), ('randomness', 0.019), ('smooth', 0.019), ('gure', 0.019), ('nearest', 0.019), ('observe', 0.019), ('integral', 0.019), ('slow', 0.019), ('averaging', 0.019), ('quantization', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

2 0.39465889 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

3 0.14899227 16 nips-2012-A Polynomial-time Form of Robust Regression

Author: Ozlem Aslan, Dale Schuurmans, Yao-liang Yu

Abstract: Despite the variety of robust regression methods that have been developed, current regression formulations are either NP-hard, or allow unbounded response to even a single leverage point. We present a general formulation for robust regression—Variational M-estimation—that unifies a number of robust regression methods while allowing a tractable approximation strategy. We develop an estimator that requires only polynomial-time, while achieving certain robustness and consistency guarantees. An experimental evaluation demonstrates the effectiveness of the new estimation approach compared to standard methods. 1

4 0.14622642 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

5 0.14044088 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

6 0.1269466 231 nips-2012-Multiple Operator-valued Kernel Learning

7 0.11567621 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

8 0.10228717 95 nips-2012-Density-Difference Estimation

9 0.091094092 200 nips-2012-Local Supervised Learning through Space Partitioning

10 0.086731337 142 nips-2012-Generalization Bounds for Domain Adaptation

11 0.084917516 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

12 0.084104925 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

13 0.082389683 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

14 0.082222998 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

15 0.077322975 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

16 0.077066906 222 nips-2012-Multi-Task Averaging

17 0.074039489 279 nips-2012-Projection Retrieval for Classification

18 0.072735876 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

19 0.071302347 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

20 0.070774175 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.175), (1, 0.037), (2, 0.051), (3, -0.057), (4, 0.037), (5, 0.06), (6, 0.011), (7, 0.136), (8, -0.056), (9, -0.144), (10, -0.028), (11, -0.048), (12, 0.065), (13, -0.164), (14, 0.006), (15, -0.22), (16, 0.227), (17, -0.031), (18, -0.069), (19, -0.057), (20, 0.095), (21, -0.022), (22, 0.287), (23, -0.016), (24, -0.059), (25, -0.03), (26, 0.17), (27, 0.052), (28, 0.192), (29, -0.024), (30, 0.119), (31, -0.074), (32, 0.083), (33, 0.047), (34, -0.005), (35, -0.081), (36, 0.115), (37, 0.128), (38, -0.087), (39, 0.011), (40, 0.055), (41, -0.113), (42, -0.032), (43, -0.036), (44, 0.072), (45, -0.002), (46, -0.09), (47, -0.016), (48, 0.048), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98057002 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

2 0.92508835 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections

Author: Ping Li, Cun-hui Zhang

Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.

3 0.81026322 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

Author: Han Liu, Larry Wasserman, John D. Lafferty

Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1

4 0.74403578 95 nips-2012-Density-Difference Estimation

Author: Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus D. Plessis, Song Liu, Ichiro Takeuchi

Abstract: We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2 -distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.

5 0.67117172 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1

6 0.62581486 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

7 0.61057639 222 nips-2012-Multi-Task Averaging

8 0.54297858 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

9 0.49977034 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

10 0.48773691 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

11 0.46219608 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

12 0.45859247 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

13 0.42934284 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

14 0.40325513 16 nips-2012-A Polynomial-time Form of Robust Regression

15 0.40060797 145 nips-2012-Gradient Weights help Nonparametric Regressors

16 0.39847451 231 nips-2012-Multiple Operator-valued Kernel Learning

17 0.38269475 279 nips-2012-Projection Retrieval for Classification

18 0.37779352 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

19 0.36665618 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

20 0.36517903 338 nips-2012-The Perturbed Variation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.014), (17, 0.012), (21, 0.055), (30, 0.181), (38, 0.194), (39, 0.017), (42, 0.057), (54, 0.012), (55, 0.034), (74, 0.06), (76, 0.145), (80, 0.079), (92, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92242986 71 nips-2012-Co-Regularized Hashing for Multimodal Data

Author: Yi Zhen, Dit-Yan Yeung

Abstract: Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. 1

2 0.88327241 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

Author: Henrik Ohlsson, Allen Yang, Roy Dong, Shankar Sastry

Abstract: While compressive sensing (CS) has been one of the most vibrant research fields in the past few years, most development only applies to linear models. This limits its application in many areas where CS could make a difference. This paper presents a novel extension of CS to the phase retrieval problem, where intensity measurements of a linear system are used to recover a complex sparse signal. We propose a novel solution using a lifting technique – CPRL, which relaxes the NP-hard problem to a nonsmooth semidefinite program. Our analysis shows that CPRL inherits many desirable properties from CS, such as guarantees for exact recovery. We further provide scalable numerical solvers to accelerate its implementation. 1

same-paper 3 0.86415917 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

Author: Kumar Sricharan, Alfred O. Hero

Abstract: The problem of estimation of entropy functionals of probability densities has received much attention in the information theory, machine learning and statistics communities. Kernel density plug-in estimators are simple, easy to implement and widely used for estimation of entropy. However, for large feature dimension d, kernel plug-in estimators suffer from the curse of dimensionality: the MSE rate of convergence is glacially slow - of order O(T −γ/d ), where T is the number of samples, and γ > 0 is a rate parameter. In this paper, it is shown that for sufficiently smooth densities, an ensemble of kernel plug-in estimators can be combined via a weighted convex combination, such that the resulting weighted estimator has a superior parametric MSE rate of convergence of order O(T −1 ). Furthermore, it is shown that these optimal weights can be determined by solving a convex optimization problem which does not require training data or knowledge of the underlying density, and therefore can be performed offline. This novel result is remarkable in that, while each of the individual kernel plug-in estimators belonging to the ensemble suffer from the curse of dimensionality, by appropriate ensemble averaging we can achieve parametric convergence rates. 1

4 0.82238126 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

5 0.81489372 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

Author: Xue-xin Wei, Alan Stocker

Abstract: A common challenge for Bayesian models of perception is the fact that the two fundamental Bayesian components, the prior distribution and the likelihood function, are formally unconstrained. Here we argue that a neural system that emulates Bayesian inference is naturally constrained by the way it represents sensory information in populations of neurons. More specifically, we show that an efficient coding principle creates a direct link between prior and likelihood based on the underlying stimulus distribution. The resulting Bayesian estimates can show biases away from the peaks of the prior distribution, a behavior seemingly at odds with the traditional view of Bayesian estimation, yet one that has been reported in human perception. We demonstrate that our framework correctly accounts for the repulsive biases previously reported for the perception of visual orientation, and show that the predicted tuning characteristics of the model neurons match the reported orientation tuning properties of neurons in primary visual cortex. Our results suggest that efficient coding is a promising hypothesis in constraining Bayesian models of perceptual inference. 1 Motivation Human perception is not perfect. Biases have been observed in a large number of perceptual tasks and modalities, of which the most salient ones constitute many well-known perceptual illusions. It has been suggested, however, that these biases do not reflect a failure of perception but rather an observer’s attempt to optimally combine the inherently noisy and ambiguous sensory information with appropriate prior knowledge about the world [13, 4, 14]. This hypothesis, which we will refer to as the Bayesian hypothesis, has indeed proven quite successful in providing a normative explanation of perception at a qualitative and, more recently, quantitative level (see e.g. [15]). A major challenge in forming models based on the Bayesian hypothesis is the correct selection of two main components: the prior distribution (belief) and the likelihood function. This has encouraged some to criticize the Bayesian hypothesis altogether, claiming that arbitrary choices for these components always allow for unjustified post-hoc explanations of the data [1]. We do not share this criticism, referring to a number of successful attempts to constrain prior beliefs and likelihood functions based on principled grounds. For example, prior beliefs have been defined as the relative distribution of the sensory variable in the environment in cases where these statistics are relatively easy to measure (e.g. local visual orientations [16]), or where it can be assumed that subjects have learned them over the course of the experiment (e.g. time perception [17]). Other studies have constrained the likelihood function according to known noise characteristics of neurons that are crucially involved in the specific perceptual process (e.g motion tuned neurons in visual cor∗ http://www.sas.upenn.edu/ astocker/lab 1 world neural representation efficient encoding percept Bayesian decoding Figure 1: Encoding-decoding framework. A stimulus representing a sensory variable θ elicits a firing rate response R = {r1 , r2 , ..., rN } in a population of N neurons. The perceptual task is to generate a ˆ good estimate θ(R) of the presented value of the sensory variable based on this population response. Our framework assumes that encoding is efficient, and decoding is Bayesian based on the likelihood p(R|θ), the prior p(θ), and a squared-error loss function. tex [18]). However, we agree that finding appropriate constraints is generally difficult and that prior beliefs and likelihood functions have been often selected on the basis of mathematical convenience. Here, we propose that the efficient coding hypothesis [19] offers a joint constraint on the prior and likelihood function in neural implementations of Bayesian inference. Efficient coding provides a normative description of how neurons encode sensory information, and suggests a direct link between measured perceptual discriminability, neural tuning characteristics, and environmental statistics [11]. We show how this link can be extended to a full Bayesian account of perception that includes perceptual biases. We validate our model framework against behavioral as well as neural data characterizing the perception of visual orientation. We demonstrate that we can account not only for the reported perceptual biases away from the cardinal orientations, but also for the specific response characteristics of orientation-tuned neurons in primary visual cortex. Our work is a novel proposal of how two important normative hypotheses in perception science, namely efficient (en)coding and Bayesian decoding, might be linked. 2 Encoding-decoding framework We consider perception as an inference process that takes place along the simplified neural encodingdecoding cascade illustrated in Fig. 11 . 2.1 Efficient encoding Efficient encoding proposes that the tuning characteristics of a neural population are adapted to the prior distribution p(θ) of the sensory variable such that the population optimally represents the sensory variable [19]. Different definitions of “optimally” are possible, and may lead to different results. Here, we assume an efficient representation that maximizes the mutual information between the sensory variable and the population response. With this definition and an upper limit on the total firing activity, the square-root of the Fisher Information must be proportional to the prior distribution [12, 21]. In order to constrain the tuning curves of individual neurons in the population we also impose a homogeneity constraint, requiring that there exists a one-to-one mapping F (θ) that transforms the ˜ physical space with units θ to a homogeneous space with units θ = F (θ) in which the stimulus distribution becomes uniform. This defines the mapping as θ F (θ) = p(χ)dχ , (1) −∞ which is the cumulative of the prior distribution p(θ). We then assume a neural population with identical tuning curves that evenly tiles the stimulus range in this homogeneous space. The population provides an efficient representation of the sensory variable θ according to the above constraints [11]. ˜ The tuning curves in the physical space are obtained by applying the inverse mapping F −1 (θ). Fig. 2 1 In the context of this paper, we consider ‘inferring’, ‘decoding’, and ‘estimating’ as synonymous. 2 stimulus distribution d samples # a Fisher information discriminability and average firing rates and b firing rate [ Hz] efficient encoding F likelihood function F -1 likelihood c symmetric asymmetric homogeneous space physical space Figure 2: Efficient encoding constrains the likelihood function. a) Prior distribution p(θ) derived from stimulus statistics. b) Efficient coding defines the shape of the tuning curves in the physical space by transforming a set of homogeneous neurons using a mapping F −1 that is the inverse of the cumulative of the prior p(θ) (see Eq. (1)). c) As a result, the likelihood shape is constrained by the prior distribution showing heavier tails on the side of lower prior density. d) Fisher information, discrimination threshold, and average firing rates are all uniform in the homogeneous space. illustrates the applied efficient encoding scheme, the mapping, and the concept of the homogeneous space for the example of a symmetric, exponentially decaying prior distribution p(θ). The key idea here is that by assuming efficient encoding, the prior (i.e. the stimulus distribution in the world) directly constrains the likelihood function. In particular, the shape of the likelihood is determined by the cumulative distribution of the prior. As a result, the likelihood is generally asymmetric, as shown in Fig. 2, exhibiting heavier tails on the side of the prior with lower density. 2.2 Bayesian decoding Let us consider a population of N sensory neurons that efficiently represents a stimulus variable θ as described above. A stimulus θ0 elicits a specific population response that is characterized by the vector R = [r1 , r2 , ..., rN ] where ri is the spike-count of the ith neuron over a given time-window τ . Under the assumption that the variability in the individual firing rates is governed by a Poisson process, we can write the likelihood function over θ as N p(R|θ) = (τ fi (θ))ri −τ fi (θ) e , ri ! i=1 (2) ˆ with fi (θ) describing the tuning curve of neuron i. We then define a Bayesian decoder θLSE as the estimator that minimizes the expected squared-error between the estimate and the true stimulus value, thus θp(R|θ)p(θ)dθ ˆ θLSE (R) = , (3) p(R|θ)p(θ)dθ where we use Bayes’ rule to appropriately combine the sensory evidence with the stimulus prior p(θ). 3 Bayesian estimates can be biased away from prior peaks Bayesian models of perception typically predict perceptual biases toward the peaks of the prior density, a characteristic often considered a hallmark of Bayesian inference. This originates from the 3 a b prior attraction prior prior attraction likelihood repulsion! likelihood c prior prior repulsive bias likelihood likelihood mean posterior mean posterior mean Figure 3: Bayesian estimates biased away from the prior. a) If the likelihood function is symmetric, then the estimate (posterior mean) is, on average, shifted away from the actual value of the sensory variable θ0 towards the prior peak. b) Efficient encoding typically leads to an asymmetric likelihood function whose normalized mean is away from the peak of the prior (relative to θ0 ). The estimate is determined by a combination of prior attraction and shifted likelihood mean, and can exhibit an overall repulsive bias. c) If p(θ0 ) < 0 and the likelihood is relatively narrow, then (1/p(θ)2 ) > 0 (blue line) and the estimate is biased away from the prior peak (see Eq. (6)). common approach of choosing a parametric description of the likelihood function that is computationally convenient (e.g. Gaussian). As a consequence, likelihood functions are typically assumed to be symmetric (but see [23, 24]), leaving the bias of the Bayesian estimator to be mainly determined by the shape of the prior density, i.e. leading to biases toward the peak of the prior (Fig. 3a). In our model framework, the shape of the likelihood function is constrained by the stimulus prior via efficient neural encoding, and is generally not symmetric for non-flat priors. It has a heavier tail on the side with lower prior density (Fig. 3b). The intuition is that due to the efficient allocation of neural resources, the side with smaller prior density will be encoded less accurately, leading to a broader likelihood function on that side. The likelihood asymmetry pulls the Bayes’ least-squares estimate away from the peak of the prior while at the same time the prior pulls it toward its peak. Thus, the resulting estimation bias is the combination of these two counter-acting forces - and both are determined by the prior! 3.1 General derivation of the estimation bias In the following, we will formally derive the mean estimation bias b(θ) of the proposed encodingdecoding framework. Specifically, we will study the conditions for which the bias is repulsive i.e. away from the peak of the prior density. ˆ We first re-write the estimator θLSE (3) by replacing θ with the inverse of its mapping to the homo−1 ˜ geneous space, i.e., θ = F (θ). The motivation for this is that the likelihood in the homogeneous space is symmetric (Fig. 2). Given a value θ0 and the elicited population response R, we can write the estimator as ˜ ˜ ˜ ˜ θp(R|θ)p(θ)dθ F −1 (θ)p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) ˆ θLSE (R) = = . ˜ ˜ ˜ p(R|θ)p(θ)dθ p(R|F −1 (θ))p(F −1 (θ))dF −1 (θ) Calculating the derivative of the inverse function and noting that F is the cumulative of the prior density, we get 1 1 1 ˜ ˜ ˜ ˜ ˜ ˜ dθ = dθ. dF −1 (θ) = (F −1 (θ)) dθ = dθ = −1 (θ)) ˜ F (θ) p(θ) p(F ˆ Hence, we can simplify θLSE (R) as ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)p(R|F −1 (θ))dθ . ˜ ˜ p(R|F −1 (θ))dθ With ˜ K(R, θ) = ˜ p(R|F −1 (θ)) ˜ ˜ p(R|F −1 (θ))dθ 4 we can further simplify the notation and get ˆ θLSE (R) = ˜ ˜ ˜ F −1 (θ)K(R, θ)dθ . (4) ˆ ˜ In order to get the expected value of the estimate, θLSE (θ), we marginalize (4) over the population response space S, ˆ ˜ ˜ ˜ ˜ θLSE (θ) = p(R)F −1 (θ)K(R, θ)dθdR S = F −1 ˜ (θ)( ˜ ˜ p(R)K(R, θ)dR)dθ = ˜ ˜ ˜ F −1 (θ)L(θ)dθ, S where we define ˜ L(θ) = ˜ p(R)K(R, θ)dR. S ˜ ˜ ˜ It follows that L(θ)dθ = 1. Due to the symmetry in this space, it can be shown that L(θ) is ˜0 . Intuitively, L(θ) can be thought as the normalized ˜ symmetric around the true stimulus value θ average likelihood in the homogeneous space. We can then compute the expected bias at θ0 as b(θ0 ) = ˜ ˜ ˜ ˜ F −1 (θ)L(θ)dθ − F −1 (θ0 ) (5) ˜ This is expression is general where F −1 (θ) is defined as the inverse of the cumulative of an arbitrary ˜ prior density p(θ) (see Eq. (1)) and the dispersion of L(θ) is determined by the internal noise level. ˜ ˜ Assuming the prior density to be smooth, we expand F −1 in a neighborhood (θ0 − h, θ0 + h) that is larger than the support of the likelihood function. Using Taylor’s theorem with mean-value forms of the remainder, we get 1 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ F −1 (θ) = F −1 (θ0 ) + F −1 (θ0 ) (θ − θ0 ) + F −1 (θx ) (θ − θ0 )2 , 2 ˜ ˜ ˜ with θx lying between θ0 and θ. By applying this expression to (5), we find ˜ θ0 +h b(θ0 ) = = 1 2 ˜ θ0 −h 1 −1 ˜ ˜ ˜ ˜ ˜ 1 F (θx )θ (θ − θ0 )2 L(θ)dθ = ˜ 2 2 ˜ θ0 +h −( ˜ θ0 −h p(θx )θ ˜ ˜ 2 ˜ ˜ 1 )(θ − θ0 ) L(θ)dθ = p(θx )3 4 ˜ θ0 +h 1 ˜ − θ0 )2 L(θ)dθ ˜ ˜ ˜ ( ) ˜(θ ˜ p(F −1 (θx )) θ ( 1 ˜ ˜ ˜ ˜ ) (θ − θ0 )2 L(θ)dθ. p(θx )2 θ ˜ θ0 −h ˜ θ0 +h ˜ θ0 −h In general, there is no simple rule to judge the sign of b(θ0 ). However, if the prior is monotonic ˜ ˜ on the interval F −1 ((θ0 − h, θ0 + h)), then the sign of ( p(θ1 )2 ) is always the same as the sign of x 1 1 ( p(θ0 )2 ) . Also, if the likelihood is sufficiently narrow we can approximate ( p(θ1 )2 ) by ( p(θ0 )2 ) , x and therefore approximate the bias as b(θ0 ) ≈ C( 1 ) , p(θ0 )2 (6) where C is a positive constant. The result is quite surprising because it states that as long as the prior is monotonic over the support of the likelihood function, the expected estimation bias is always away from the peaks of the prior! 3.2 Internal (neural) versus external (stimulus) noise The above derivation of estimation bias is based on the assumption that all uncertainty about the sensory variable is caused by neural response variability. This level of internal noise depends on the response magnitude, and thus can be modulated e.g. by changing stimulus contrast. This contrastcontrolled noise modulation is commonly exploited in perceptual studies (e.g. [18]). Internal noise will always lead to repulsive biases in our framework if the prior is monotonic. If internal noise is low, the likelihood is narrow and thus the bias is small. Increasing internal noise leads to increasingly 5 larger biases up to the point where the likelihood becomes wide enough such that monotonicity of the prior over the support of the likelihood is potentially violated. Stimulus noise is another way to modulate the noise level in perception (e.g. random-dot motion stimuli). Such external noise, however, has a different effect on the shape of the likelihood function as compared to internal noise. It modifies the likelihood function (2) by convolving it with the noise kernel. External noise is frequently chosen as additive and symmetric (e.g. zero-mean Gaussian). It is straightforward to prove that such symmetric external noise does not lead to a change in the mean of the likelihood, and thus does not alter the repulsive effect induced by its asymmetry. However, by increasing the overall width of the likelihood, the attractive influence of the prior increases, resulting in an estimate that is closer to the prior peak than without external noise2 . 4 Perception of visual orientation We tested our framework by modelling the perception of visual orientation. Our choice was based on the fact that i) we have pretty good estimates of the prior distribution of local orientations in natural images, ii) tuning characteristics of orientation selective neurons in visual cortex are wellstudied (monkey/cat), and iii) biases in perceived stimulus orientation have been well characterized. We start by creating an efficient neural population based on measured prior distributions of local visual orientation, and then compare the resulting tuning characteristics of the population and the predicted perceptual biases with reported data in the literature. 4.1 Efficient neural model population for visual orientation Previous studies measured the statistics of the local orientation in large sets of natural images and consistently found that the orientation distribution is multimodal, peaking at the two cardinal orientations as shown in Fig. 4a [16, 20]. We assumed that the visual system’s prior belief over orientation p(θ) follows this distribution and approximate it formally as p(θ) ∝ 2 − | sin(θ)| (black line in Fig. 4b) . (7) Based on this prior distribution we defined an efficient neural representation for orientation. We assumed a population of model neurons (N = 30) with tuning curves that follow a von-Mises distribution in the homogeneous space on top of a constant spontaneous firing rate (5 Hz). We then ˜ applied the inverse transformation F −1 (θ) to all these tuning curves to get the corresponding tuning curves in the physical space (Fig. 4b - red curves), where F (θ) is the cumulative of the prior (7). The concentration parameter for the von-Mises tuning curves was set to κ ≈ 1.6 in the homogeneous space in order to match the measured average tuning width (∼ 32 deg) of neurons in area V1 of the macaque [9]. 4.2 Predicted tuning characteristics of neurons in primary visual cortex The orientation tuning characteristics of our model population well match neurophysiological data of neurons in primary visual cortex (V1). Efficient encoding predicts that the distribution of neurons’ preferred orientation follows the prior, with more neurons tuned to cardinal than oblique orientations by a factor of approximately 1.5. A similar ratio has been found for neurons in area V1 of monkey/cat [9, 10]. Also, the tuning widths of the model neurons vary between 25-42 deg depending on their preferred tuning (see Fig. 4c), matching the measured tuning width ratio of 0.6 between neurons tuned to the cardinal versus oblique orientations [9]. An important prediction of our model is that most of the tuning curves should be asymmetric. Such asymmetries have indeed been reported for the orientation tuning of neurons in area V1 [6, 7, 8]. We computed the asymmetry index for our model population as defined in previous studies [6, 7], and plotted it as a function of the preferred tuning of each neuron (Fig. 4d). The overall asymmetry index in our model population is 1.24 ± 0.11, which approximately matches the measured values for neurons in area V1 of the cat (1.26 ± 0.06) [6]. It also predicts that neurons tuned to the cardinal and oblique orientations should show less symmetry than those tuned to orientations in between. Finally, 2 Note, that these predictions are likely to change if the external noise is not symmetric. 6 a b 25 firing rate(Hz) 0 orientation(deg) asymmetry vs. tuning width 1.0 2.0 90 2.0 e asymmetry 1.0 0 asymmetry index 50 30 width (deg) 10 90 preferred tuning(deg) -90 0 d 0 0 90 asymmetry index 0 orientation(deg) tuning width -90 0 0 probability 0 -90 c efficient representation 0.01 0.01 image statistics -90 0 90 preferred tuning(deg) 25 30 35 40 tuning width (deg) Figure 4: Tuning characteristics of model neurons. a) Distribution of local orientations in natural images, replotted from [16]. b) Prior used in the model (black) and predicted tuning curves according to efficient coding (red). c) Tuning width as a function of preferred orientation. d) Tuning curves of cardinal and oblique neurons are more symmetric than those tuned to orientations in between. e) Both narrowly and broadly tuned neurons neurons show less asymmetry than neurons with tuning widths in between. neurons with tuning widths at the lower and upper end of the range are predicted to exhibit less asymmetry than those neurons whose widths lie in between these extremes (illustrated in Fig. 4e). These last two predictions have not been tested yet. 4.3 Predicted perceptual biases Our model framework also provides specific predictions for the expected perceptual biases. Humans show systematic biases in perceived orientation of visual stimuli such as e.g. arrays of Gabor patches (Fig. 5a,d). Two types of biases can be distinguished: First, perceived orientations show an absolute bias away from the cardinal orientations, thus away from the peaks of the orientation prior [2, 3]. We refer to these biases as absolute because they are typically measured by adjusting a noise-free reference until it matched the orientation of the test stimulus. Interestingly, these repulsive absolute biases are the larger the smaller the external stimulus noise is (see Fig. 5b). Second, the relative bias between the perceived overall orientations of a high-noise and a low-noise stimulus is toward the cardinal orientations as shown in Fig. 5c, and thus toward the peak of the prior distribution [3, 16]. The predicted perceptual biases of our model are shown Fig. 5e,f. We computed the likelihood function according to (2) and used the prior in (7). External noise was modeled by convolving the stimulus likelihood function with a Gaussian (different widths for different noise levels). The predictions well match both, the reported absolute bias away as well as the relative biases toward the cardinal orientations. Note, that our model framework correctly accounts for the fact that less external noise leads to larger absolute biases (see also discussion in section 3.2). 5 Discussion We have presented a modeling framework for perception that combines efficient (en)coding and Bayesian decoding. Efficient coding imposes constraints on the tuning characteristics of a population of neurons according to the stimulus distribution (prior). It thus establishes a direct link between prior and likelihood, and provides clear constraints on the latter for a Bayesian observer model of perception. We have shown that the resulting likelihoods are in general asymmetric, with 7 absolute bias (data) b c relative bias (data) -4 0 bias(deg) 4 a low-noise stimulus -90 e 90 absolute bias (model) low external noise high external noise 3 high-noise stimulus -90 f 0 90 relative bias (model) 0 bias(deg) d 0 attraction -3 repulsion -90 0 orientation (deg) 90 -90 0 orientation (deg) 90 Figure 5: Biases in perceived orientation: Human data vs. Model prediction. a,d) Low- and highnoise orientation stimuli of the type used in [3, 16]. b) Humans show absolute biases in perceived orientation that are away from the cardinal orientations. Data replotted from [2] (pink squares) and [3] (green (black) triangles: bias for low (high) external noise). c) Relative bias between stimuli with different external noise level (high minus low). Data replotted from [3] (blue triangles) and [16] (red circles). e,f) Model predictions for absolute and relative bias. heavier tails away from the prior peaks. We demonstrated that such asymmetric likelihoods can lead to the counter-intuitive prediction that a Bayesian estimator is biased away from the peaks of the prior distribution. Interestingly, such repulsive biases have been reported for human perception of visual orientation, yet a principled and consistent explanation of their existence has been missing so far. Here, we suggest that these counter-intuitive biases directly follow from the asymmetries in the likelihood function induced by efficient neural encoding of the stimulus. The good match between our model predictions and the measured perceptual biases and orientation tuning characteristics of neurons in primary visual cortex provides further support of our framework. Previous work has suggested that there might be a link between stimulus statistics, neuronal tuning characteristics, and perceptual behavior based on efficient coding principles, yet none of these studies has recognized the importance of the resulting likelihood asymmetries [16, 11]. We have demonstrated here that such asymmetries can be crucial in explaining perceptual data, even though the resulting estimates appear “anti-Bayesian” at first sight (see also models of sensory adaptation [23]). Note, that we do not provide a neural implementation of the Bayesian inference step. However, we and others have proposed various neural decoding schemes that can approximate Bayes’ leastsquares estimation using efficient coding [26, 25, 22]. It is also worth pointing out that our estimator is set to minimize total squared-error, and that other choices of the loss function (e.g. MAP estimator) could lead to different predictions. Our framework is general and should be directly applicable to other modalities. In particular, it might provide a new explanation for perceptual biases that are hard to reconcile with traditional Bayesian approaches [5]. Acknowledgments We thank M. Jogan and A. Tank for helpful comments on the manuscript. This work was partially supported by grant ONR N000141110744. 8 References [1] M. Jones, and B. C. Love. Bayesian fundamentalism or enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34, 169–231,2011. [2] D. P. Andrews. Perception of contours in the central fovea. Nature, 205:1218- 1220, 1965. [3] A. Tomassini, M. J.Morgam. and J. A. Solomon. Orientation uncertainty reduces perceived obliquity. Vision Res, 50, 541–547, 2010. [4] W. S. Geisler, D. Kersten. Illusions, perception and Bayes. Nature Neuroscience, 5(6):508- 510, 2002. [5] M. O. Ernst Perceptual learning: inverting the size-weight illusion. Current Biology, 19:R23- R25, 2009. [6] G. H. Henry, B. Dreher, P. O. Bishop. Orientation specificity of cells in cat striate cortex. J Neurophysiol, 37(6):1394-409,1974. [7] D. Rose, C. Blakemore An analysis of orientation selectivity in the cat’s visual cortex. Exp Brain Res., Apr 30;20(1):1-17, 1974. [8] N. V. Swindale. Orientation tuning curves: empirical description and estimation of parameters. Biol Cybern., 78(1):45-56, 1998. [9] R. L. De Valois, E. W. Yund, N. Hepler. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res.,22, 531544,1982. [10] B. Li, M. R. Peterson, R. D. Freeman. The oblique effect: a neural basis in the visual cortex. J. Neurophysiol., 90, 204217, 2003. [11] D. Ganguli and E.P. Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Adv. Neural Information Processing Systems NIPS 23, vol. 23:658–666, 2011. [12] M. D. McDonnell, N. G. Stocks. Maximally Informative Stimuli and Tuning Curves for Sigmoidal RateCoding Neurons and Populations. Phys Rev Lett., 101(5):058103, 2008. [13] H Helmholtz. Treatise on Physiological Optics (transl.). Thoemmes Press, Bristol, U.K., 2000. Original publication 1867. [14] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [15] D.C. Knill and W. Richards, editors. Perception as Bayesian Inference. Cambridge University Press, 1996. [16] A R Girshick, M S Landy, and E P Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nat Neurosci, 14(7):926–932, Jul 2011. [17] M. Jazayeri and M.N. Shadlen. Temporal context calibrates interval timing. Nature Neuroscience, 13(8):914–916, 2010. [18] A.A. Stocker and E.P. Simoncelli. Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, pages 578–585, April 2006. [19] H.B. Barlow. Possible principles underlying the transformation of sensory messages. In W.A. Rosenblith, editor, Sensory Communication, pages 217–234. MIT Press, Cambridge, MA, 1961. [20] D.M. Coppola, H.R. Purves, A.N. McCoy, and D. Purves The distribution of oriented contours in the real world. Proc Natl Acad Sci U S A., 95(7): 4002–4006, 1998. [21] N. Brunel and J.-P. Nadal. Mutual information, Fisher information and population coding. Neural Computation, 10, 7, 1731–1757, 1998. [22] X-X. Wei and A.A. Stocker. Bayesian inference with efficient neural population codes. In Lecture Notes in Computer Science, Artificial Neural Networks and Machine Learning - ICANN 2012, Lausanne, Switzerland, volume 7552, pages 523–530, 2012. [23] A.A. Stocker and E.P. Simoncelli. Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages o 1291–1298. MIT Press, Cambridge, MA, 2006. Oral presentation. [24] D.C. Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7(7):1–24, 2007. [25] Deep Ganguli. Efficient coding and Bayesian inference with neural populations. PhD thesis, Center for Neural Science, New York University, New York, NY, September 2012. [26] B. Fischer. Bayesian estimates from heterogeneous population codes. In Proc. IEEE Intl. Joint Conf. on Neural Networks. IEEE, 2010. 9

6 0.8133781 187 nips-2012-Learning curves for multi-task Gaussian process regression

7 0.81238091 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

8 0.81169063 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

9 0.81156069 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

10 0.80800688 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

11 0.80774748 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

12 0.8075366 358 nips-2012-Value Pursuit Iteration

13 0.80708784 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

14 0.80694383 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

15 0.80641943 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.80625749 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

17 0.8061406 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

18 0.80587882 179 nips-2012-Learning Manifolds with K-Means and K-Flats

19 0.80585885 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

20 0.80561966 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes