nips nips2008 nips2008-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. [sent-5, score-0.327]
2 As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. [sent-6, score-0.121]
3 Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. [sent-7, score-0.444]
4 , xn } and our goal is to recover the linear transformation A and the sources s1 , . [sent-15, score-0.087]
5 Given the minimal statement of the problem, it has been shown [6] that one can recover the original sources up to a scaling and a permutation provided that at most one of the underlying sources is Gaussian and the rest are non-Gaussian. [sent-19, score-0.094]
6 Upon pre-whitening the observed data, the problem reduces to a search over rotation matrices in order to recover the source and mixing matrix in the sense described above [10]. [sent-20, score-0.236]
7 Working with W = A−1 as the parametrization, one readily obtains ˆ a gradient or fixed-point algorithm that yields an estimate W and provides estimates of the latent ˆ = W X [10]. [sent-23, score-0.104]
8 The problem is then, obviously, to find a suitable contrast function, i. [sent-26, score-0.079]
9 The earliest ICA algorithms were based on contrast functions defined in terms of expectations of a single fixed nonlinear function, chosen in ad-hoc manner [5]. [sent-29, score-0.079]
10 More sophisticated algorithms have been obtained by careful choice of a single fixed nonlinear function, such that the expectations of this function yield a robust approximation to the mutual information [9]. [sent-30, score-0.127]
11 Maximizing the likelihood in the semiparametric ICA model is essentially equivalent to minimizing ˆ ˆ the mutual information between the components of the estimate S = W X [4]. [sent-31, score-0.183]
12 The usage of the mutual information as a contrast function to be minimized in estimating the ICA model is well motivated, quite apart from the link to maximum likelihood [6]. [sent-32, score-0.232]
13 Several modern approaches rely on knearest neighbor estimates of entropy and mutual information [12, 16]. [sent-34, score-0.458]
14 Recently the Vasicek estimator [17] for the differential entropy of 1D random variables, based on k-nearest neighbors statistics, was applied to ICA [8, 13]. [sent-35, score-0.618]
15 In addition ICA was studied by another recently introduced MI estimator [16]. [sent-36, score-0.259]
16 However, the derivative of the estimators that are based on order statistics can hardly be computed and therefore the optimization of such numerical criteria can not be based on gradient techniques. [sent-37, score-0.347]
17 Also the result numerical criteria tend to have a non-smooth dependency on sample values. [sent-38, score-0.085]
18 The optimization therefore should involve computation of contrast function on a whole grid of searched parameters. [sent-39, score-0.139]
19 In addition, such estimators do not utilize optimally the whole amount of data included in the samples of random vectors. [sent-40, score-0.198]
20 Therefore they require significant artificial enlargement of data sets by a technique called data augmentation [13] that replaces each data point in sample with R-tuple (R is usually 30) of points given by an statistical procedure with ad-hoc parameters. [sent-41, score-0.098]
21 An alternative is the Fourier filtering of the estimated values of the evaluated MI estimators [16]. [sent-42, score-0.145]
22 In the present paper we propose new smooth estimators for the differential entropy, the mutual information and the divergence. [sent-43, score-0.459]
23 The estimators are obtained by a novel approach averaging k-nearest neighbor statistics for the all possible values of order statistics k. [sent-44, score-0.25]
24 The estimators are smooth, their derivatives may be easily analytically calculated thus enabling fast gradient optimization techniques. [sent-45, score-0.295]
25 The estimators provide a novel geometrical interpretation for the entropy. [sent-47, score-0.186]
26 When applied to ICA problem, the proposed estimator leads to the most precise results for many distributions known at present. [sent-48, score-0.259]
27 The rest of the paper is organized as follows: Section 2 reviews the kNN approach for the entropy and divergence estimation, Section 3 introduces the mean estimator for the differential entropy, the mutual information and the divergence. [sent-49, score-0.795]
28 Section 4 describes the application of the proposed estimators to the ICA problem and Section 5 describes conducted numerical experiments. [sent-50, score-0.182]
29 2 kNN Estimators for the Differential Entropy We review the nearest neighbor technique for the Shannon entropy estimation. [sent-51, score-0.404]
30 The differential entropy of X is defined as: H(X) = − f (x) log f (x)dx (1) We describe the derivation of the Shannon differential entropy estimate of [11, 18]. [sent-52, score-0.799]
31 If one had unbiased estimators for log f (xi ), one would arrive to an unbiased estimator for the entropy. [sent-58, score-0.537]
32 We will estimate log f (xi ) by considering the probability density function Pik ( ) for the distance between xi and its k-th nearest neighbor (the probability is computed over the positions of all other n − 1 points, with xi kept fixed). [sent-59, score-0.428]
33 The probability Pik ( )d is equal to the chance that there is one point within distance r ∈ [ , + d ] from xi , that there are k−1 other points at smaller distances, and that the remaining n−k−1 points have larger distances from xi . [sent-60, score-0.207]
34 Denote the mass of the -ball centered at xi by pi ( ), i. [sent-61, score-0.144]
35 Hence, the expected value of the function log pi ( ) according to the distribution Pik ( ) is: ∞ EPik ( ) (log pi ( )) = Pik ( ) log pi ( )d = k 0 n−1 k 1 pk−1 (1 − p)n−k−1 log p dp (3) 0 = ψ(k) − ψ(n) where ψ(x) is the digamma function (the logarithmic derivative of the gamma function). [sent-69, score-0.456]
36 The expectation is taken over the positions of all other n − 1 points, with xi kept fixed. [sent-71, score-0.096]
37 Assuming that f (x) is almost constant in the entire -ball around xi , we obtain: pi ( ) ≈ cd d f (xi ). [sent-72, score-0.216]
38 (4) where d is the dimension of x and cd is the volume of the d-dimensional unit ball (cd = π d/2 /Γ(1 + d/2) for Euclidean norm). [sent-73, score-0.072]
39 (3), we obtain: − log f (xi ) ≈ ψ(n) − ψ(k) + log(cd ) + dE(log( )) (5) which finally leads to the unbiased kNN estimator for the differential entropy [11]: Hk (X) = ψ(n) − ψ(k) + log(cd ) + d n n log (6) i i=1 where i is the distance from xi to its k-th nearest neighbor. [sent-76, score-0.952]
40 An alternative proof of the asymptotic unbiasedness and consistency of the kNN estimator is found at [15]. [sent-77, score-0.329]
41 A similar approach can be used to obtain a kNN estimator for the Kullback-Leibler divergence [19]. [sent-78, score-0.309]
42 By definition the divergence is given by: p(x) (7) D(p q) = p(x) log q(x) The distance of xi to its nearest neighbor in {xj }j=i is defined as ρn (i) = min d(xi , xj ) (8) j=i We also define the distance of xi to its nearest neighbor in {yj } νn (i) = min d(xi , yj ) (9) j=1,. [sent-90, score-0.698]
43 ,m Then the estimator of [19] is given by d ˆ Dn,m = n n log i=1 νm (i) m + log ρn (i) n−1 (10) The authors established asymptotic unbiasedness and mean-square consistency of the estimator (10). [sent-93, score-0.75]
44 The same proofs could be applied to obtain k-nearest neighbor version of the estimator: d ˆk Dn,m = n n log i=1 k vm (i) m + log ρk (i) n−1 n (11) Being non-parametric, the kNN estimators (6, 11) rely on the order statistics. [sent-94, score-0.443]
45 This makes the analytical calculation of the gradient hardly possible. [sent-95, score-0.11]
46 Also it leads to a certain lack of smoothness of the estimator value as a function of the sample coordinates. [sent-96, score-0.284]
47 One also should mention that finding the k-nearest neighbor is a computationally intensive problem. [sent-97, score-0.171]
48 It becomes necessarily to use involved approximate nearest neighbor techniques for large data sets. [sent-98, score-0.178]
49 3 The MeanNN Entropy Estimator We propose a novel approach for the entropy estimation as a function of sample coordinates. [sent-99, score-0.282]
50 It is based on the fact that the kNN estimator (6) is valid for every k. [sent-100, score-0.259]
51 Therefore the differential entropy can be also extracted from a mean of several estimators corresponding to different values of k. [sent-101, score-0.504]
52 Next we consider all the possible values of order statistics k from 1 to n − 1: Hmean 1 = n−1 n−1 k=1 1 Hk = log(cd ) + ψ(n) + n−1 n−1 k=1 d (−ψ(k) + n n log i,k ) (12) i=1 where i,k is the k-th nearest neighbor of xi . [sent-102, score-0.332]
53 Exchanging the order of summation, the last sum adds for each sample point xi the sum of log of 3 its distances to all its nearest neighbors in the sample. [sent-105, score-0.313]
54 It is of course equivalent to the sum of log of its distances to all other points in the sample set. [sent-106, score-0.167]
55 Hence the mean estimator (12) for the differential entropy can be written as: Hmean = const + d n(n − 1) log xi − xj (13) i=j where the constant depends just on the sample size and dimensionality. [sent-107, score-0.838]
56 We dub this estimator, the MeanNN estimator for differential entropy. [sent-108, score-0.392]
57 It follows that the differential entropy (approximation) has a clear geometric meaning. [sent-109, score-0.359]
58 It is proportional to log of the products of distances between each two points in a random i. [sent-110, score-0.142]
59 It is an intuitive observation since a higher entropy would lead to a larger scattering of the samples thus pairwise distances would grow resulting in a larger product of all distances. [sent-114, score-0.287]
60 Moreover, the MeanNN estimator (13) is a smooth function of the sample coordinates. [sent-115, score-0.338]
61 The asymptotic unbiasedness and consistency of the estimator follow from the same properties of the kNN estimator (6). [sent-117, score-0.588]
62 In this case case the entropy may be analytically calculated as H = log µ + 1. [sent-119, score-0.343]
63 We compared the performance of the MeanNN estimator with k-nearest neighbor estimator (6) for various values of k. [sent-120, score-0.623]
64 One may see that the mean square error of the MeanNN estimator is the same or worse for the traditional kNN estimators. [sent-122, score-0.259]
65 But the standard deviation of the estimator values is best for the MeanNN estimator. [sent-123, score-0.259]
66 In such cases the most important characteristics of an estimator is its monotonic dependency on the estimated value and the prediction of the exact value of the entropy is less important. [sent-125, score-0.485]
67 Therefore one may conclude that MeanNN is better applicable for optimization of entropy based numerical criteria. [sent-126, score-0.295]
68 1698 Mean square error of entropy estimation STD of estimator values 4NN 0. [sent-129, score-0.516]
69 1029 Table 1: Performance of MeanNN entropy estimator in comparison with kNN entropy estimators. [sent-135, score-0.711]
70 If the product of all distances inside one sample is small in comparison with the product of pairwise distances between the samples then one concludes that divergence is large and vice versa. [sent-138, score-0.197]
71 4 The MeanNN ICA Algorithm As many approaches do, we will use a contrast function J(Y ) = q(y1 , . [sent-139, score-0.079]
72 , Xd ) − log(|W |) (17) t=1 In particular, the change in the entropy of the joint distribution under linear transformation is simply the logarithm of the Jacobian of the transformation. [sent-152, score-0.226]
73 As we will assume the X’s to be pre-whitened, W will be restricted to rotation matrices, therefore log(|W |) = 0 and the minimization of J(Y ) reduces to finding ˆ W = arg min H(Y1 ) + . [sent-153, score-0.151]
74 , wd ) , we can explicitly write the minimization expression as a function of W : d ˆ W = arg min W H(wt X) (19) t=1 Then we can plug the MeanNN entropy estimator into Eq. [sent-159, score-0.485]
75 In this parametrization a rotation matrix W ∈ Rd×d is represented by a product of d(d − 1)/2 plane rotations: d−1 d W = Gst (21) s=1 t=s+1 where Gst is a rotation matrix corresponding to a rotation in the st plane by an angle λst . [sent-163, score-0.72]
76 ∂ The contrast function S(W ) and its gradient ∂λst S may in theory suffer from discontinuities if a row wt is perpendicular to a vector xi − xj . [sent-165, score-0.309]
77 One thousand pairs of random numbers x and y are mixed as x = x cos φ + y sin φ, y = −x sin φ + y cos φ with random angle φ common to all pairs (i. [sent-173, score-0.32]
78 We applied the conjugate gradient methods for the optimization of the contrast function (25) with = 1/n = 0. [sent-176, score-0.214]
79 To assess the quality of the estimator A ˆ = A−1 ), we use the Amari performance index Perr ˆ (or, equivalently, of the back transformation W from [1]. [sent-179, score-0.259]
80 d Perr = |pij | |pij | 1 ( + )−1 2d i,j=1 maxk |pik | maxk |pkj | (27) ˆ where pij = (A−1 A)ij . [sent-180, score-0.114]
81 For the first two techniques that utilize different information theoretic measures assessed by order statistics it is highly recommended to use dataset augmentation. [sent-183, score-0.089]
82 The proposed method gives smooth results without any additional augmentation due to its smooth nature (see Eq. [sent-185, score-0.143]
83 edu/∼elm/ICA/, 6 the comparison of different contrast functions based on different order statistics estimators for a grid of possible rotations angles for the mixture of two exponentially distributed random variables (case e). [sent-255, score-0.361]
84 The contrast function corresponding to the order statistics k = 10 generally coincides with √ the n MILCA approach. [sent-256, score-0.104]
85 Also the contrast function corresponding to the order statistics k = 30 generally coincides with the RADICAL method. [sent-257, score-0.104]
86 One may see that MeanNN ICA contrast function leads to much more robust prediction of the rotation angle. [sent-258, score-0.23]
87 One should mention that the gradient based optimization enables to obtain the global optimum with high precision as opposed to MILCA and RADICAL schemes which utilize subspace grid optimization. [sent-259, score-0.285]
88 Application of the gradient based optimization schemes also leads to a computational advantage. [sent-260, score-0.139]
89 The number of needed function evaluations was limited by 20 as opposed to 150 evaluations for grid optimization schemes MILCA and RADICAL. [sent-261, score-0.111]
90 Contrast function dependence on a rotation angle for different entropy estimators. [sent-279, score-0.415]
91 For that purpose we chose at random D (generally) different distributions, then we mixed them by a random rotation and ran the compared ICA algorithms to recover the rotation matrix. [sent-283, score-0.334]
92 1000 samples, 10 repetitions 6 Conclusion We proposed a novel approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. [sent-299, score-0.327]
93 The estimators represent smooth differential functions with clear geometrical meaning. [sent-300, score-0.373]
94 The first group is based on exact entropy estimation, that usually leads to high performance as demonstrated by MILCA and RADICAL. [sent-304, score-0.226]
95 The drawback of such estimators is the lack of the gradient and therefore numerical difficulties in optimization. [sent-305, score-0.264]
96 The second group apply different from entropy criteria, that benefit easy calculation of gradient (KernelICA). [sent-306, score-0.308]
97 It represents a contrast function based on an accurate entropy estimation and its gradient is given analytically therefore it may be readily optimized. [sent-309, score-0.454]
98 Finally we mention that the proposed estimation method may further be applied to various problems in the field of machine learning and beyond. [sent-310, score-0.07]
99 An information-maximization approach to blind separation and blind deconvolution. [sent-328, score-0.181]
100 Blind separation of mixtures of independent signals through a quasi-maximum likelihood approach. [sent-353, score-0.074]
wordName wordTfidf (topN-words)
[('meannn', 0.521), ('ica', 0.307), ('estimator', 0.259), ('gst', 0.239), ('entropy', 0.226), ('milca', 0.195), ('knn', 0.163), ('st', 0.152), ('pik', 0.152), ('rotation', 0.151), ('estimators', 0.145), ('differential', 0.133), ('guv', 0.13), ('mutual', 0.127), ('radical', 0.122), ('hmean', 0.108), ('neighbor', 0.105), ('kernelica', 0.087), ('wqr', 0.087), ('gradient', 0.082), ('yd', 0.081), ('log', 0.081), ('contrast', 0.079), ('cos', 0.076), ('blind', 0.074), ('xi', 0.073), ('nearest', 0.073), ('cd', 0.072), ('pi', 0.071), ('aug', 0.065), ('sin', 0.065), ('amari', 0.061), ('distances', 0.061), ('angles', 0.058), ('smooth', 0.054), ('utilize', 0.053), ('pij', 0.052), ('rotations', 0.051), ('divergence', 0.05), ('unbiasedness', 0.046), ('nonsymmetric', 0.043), ('perr', 0.043), ('xjr', 0.043), ('xj', 0.041), ('mixtures', 0.041), ('geometrical', 0.041), ('mention', 0.039), ('angle', 0.038), ('givens', 0.038), ('xir', 0.038), ('enlargement', 0.038), ('numerical', 0.037), ('theoretic', 0.036), ('analytically', 0.036), ('augmentation', 0.035), ('wq', 0.035), ('gbauer', 0.035), ('kraskov', 0.035), ('hyvarinen', 0.035), ('wt', 0.034), ('separation', 0.033), ('multidimensional', 0.033), ('semiparametric', 0.033), ('recover', 0.032), ('optimization', 0.032), ('sources', 0.031), ('estimation', 0.031), ('shannon', 0.031), ('vm', 0.031), ('maxk', 0.031), ('qr', 0.029), ('mi', 0.029), ('hardly', 0.028), ('grid', 0.028), ('mixing', 0.028), ('signal', 0.027), ('intensive', 0.027), ('parametrization', 0.027), ('veri', 0.027), ('component', 0.027), ('opposed', 0.026), ('unbiased', 0.026), ('usage', 0.026), ('student', 0.025), ('coincides', 0.025), ('schemes', 0.025), ('matrix', 0.025), ('sample', 0.025), ('gaussians', 0.025), ('multiplied', 0.025), ('asymptotic', 0.024), ('yj', 0.024), ('xn', 0.024), ('components', 0.023), ('criteria', 0.023), ('kept', 0.023), ('double', 0.022), ('obtains', 0.022), ('hk', 0.022), ('conjugate', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
2 0.20812751 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
Author: Fernando Pérez-Cruz
Abstract: We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or KullbackLeibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent. We show that the two-sample and independence tests based on these nonconvergent estimates compare favorably with the maximum mean discrepancy test and the Hilbert Schmidt independence criterion. 1
3 0.14917809 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction
Author: Fabian H. Sinz, Matthias Bethge
Abstract: Bandpass filtering, orientation selectivity, and contrast gain control are prominent features of sensory coding at the level of V1 simple cells. While the effect of bandpass filtering and orientation selectivity can be assessed within a linear model, contrast gain control is an inherently nonlinear computation. Here we employ the class of Lp elliptically contoured distributions to investigate the extent to which the two features—orientation selectivity and contrast gain control—are suited to model the statistics of natural images. Within this framework we find that contrast gain control can play a significant role for the removal of redundancies in natural images. Orientation selectivity, in contrast, has only a very limited potential for redundancy reduction. 1
4 0.13780133 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
Author: Siwei Lyu, Eero P. Simoncelli
Abstract: We consider the problem of transforming a signal to a representation in which the components are statistically independent. When the signal is generated as a linear transformation of independent Gaussian or non-Gaussian sources, the solution may be computed using a linear transformation (PCA or ICA, respectively). Here, we consider a complementary case, in which the source is non-Gaussian but elliptically symmetric. Such a source cannot be decomposed into independent components using a linear transform, but we show that a simple nonlinear transformation, which we call radial Gaussianization (RG), is able to remove all dependencies. We apply this methodology to natural signals, demonstrating that the joint distributions of nearby bandpass filter responses, for both sounds and images, are closer to being elliptically symmetric than linearly transformed factorial sources. Consistent with this, we demonstrate that the reduction in dependency achieved by applying RG to either pairs or blocks of bandpass filter responses is significantly greater than that achieved by PCA or ICA.
5 0.13068435 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
6 0.11615456 234 nips-2008-The Infinite Factorial Hidden Markov Model
7 0.1010971 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
8 0.098403879 107 nips-2008-Influence of graph construction on graph-based clustering measures
9 0.097945273 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
10 0.069922306 12 nips-2008-Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes
11 0.068242088 112 nips-2008-Kernel Measures of Independence for non-iid Data
12 0.065421484 113 nips-2008-Kernelized Sorting
13 0.062906213 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
14 0.055580247 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
15 0.054738816 136 nips-2008-Model selection and velocity estimation using novel priors for motion patterns
16 0.054576166 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
17 0.053261563 62 nips-2008-Differentiable Sparse Coding
18 0.049742594 78 nips-2008-Exact Convex Confidence-Weighted Learning
19 0.048335314 178 nips-2008-Performance analysis for L\ 2 kernel classification
20 0.048036732 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
topicId topicWeight
[(0, -0.167), (1, 0.002), (2, -0.002), (3, 0.072), (4, 0.067), (5, 0.012), (6, 0.01), (7, 0.094), (8, 0.05), (9, 0.02), (10, 0.017), (11, -0.075), (12, 0.115), (13, 0.077), (14, -0.261), (15, -0.097), (16, 0.125), (17, 0.184), (18, 0.032), (19, 0.194), (20, -0.125), (21, 0.139), (22, 0.04), (23, 0.114), (24, -0.014), (25, -0.048), (26, 0.098), (27, -0.062), (28, 0.099), (29, -0.09), (30, -0.023), (31, 0.022), (32, -0.026), (33, -0.077), (34, -0.072), (35, -0.009), (36, -0.14), (37, 0.101), (38, 0.011), (39, -0.008), (40, 0.045), (41, 0.029), (42, 0.095), (43, -0.064), (44, -0.043), (45, -0.086), (46, -0.093), (47, 0.035), (48, -0.006), (49, -0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.95402795 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
2 0.7514255 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
Author: Fernando Pérez-Cruz
Abstract: We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or KullbackLeibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent. We show that the two-sample and independence tests based on these nonconvergent estimates compare favorably with the maximum mean discrepancy test and the Hilbert Schmidt independence criterion. 1
3 0.5762639 54 nips-2008-Covariance Estimation for High Dimensional Data Vectors Using the Sparse Matrix Transform
Author: Guangzhi Cao, Charles Bouman
Abstract: Covariance estimation for high dimensional vectors is a classically difficult problem in statistical analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel sparsity constraint. More specifically, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efficiently estimated using greedy minimization of the log likelihood function, and the number of Givens rotations can be efficiently computed using a cross-validation procedure. The resulting estimator is positive definite and well-conditioned even when the sample size is limited. Experiments on standard hyperspectral data sets show that the SMT covariance estimate is consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. 1
4 0.56464064 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
Author: Siwei Lyu, Eero P. Simoncelli
Abstract: We consider the problem of transforming a signal to a representation in which the components are statistically independent. When the signal is generated as a linear transformation of independent Gaussian or non-Gaussian sources, the solution may be computed using a linear transformation (PCA or ICA, respectively). Here, we consider a complementary case, in which the source is non-Gaussian but elliptically symmetric. Such a source cannot be decomposed into independent components using a linear transform, but we show that a simple nonlinear transformation, which we call radial Gaussianization (RG), is able to remove all dependencies. We apply this methodology to natural signals, demonstrating that the joint distributions of nearby bandpass filter responses, for both sounds and images, are closer to being elliptically symmetric than linearly transformed factorial sources. Consistent with this, we demonstrate that the reduction in dependency achieved by applying RG to either pairs or blocks of bandpass filter responses is significantly greater than that achieved by PCA or ICA.
5 0.53350455 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
Author: Iain Murray, Ruslan Salakhutdinov
Abstract: We present a simple new Monte Carlo algorithm for evaluating probabilities of observations in complex latent variable models, such as Deep Belief Networks. While the method is based on Markov chains, estimates based on short runs are formally unbiased. In expectation, the log probability of a test set will be underestimated, and this could form the basis of a probabilistic bound. The method is much cheaper than gold-standard annealing-based methods and only slightly more expensive than the cheapest Monte Carlo methods. We give examples of the new method substantially improving simple variational bounds at modest extra cost. 1
7 0.50613916 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
8 0.47557712 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
9 0.42769063 234 nips-2008-The Infinite Factorial Hidden Markov Model
10 0.41973394 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
11 0.40132612 113 nips-2008-Kernelized Sorting
12 0.39468348 178 nips-2008-Performance analysis for L\ 2 kernel classification
13 0.36323655 233 nips-2008-The Gaussian Process Density Sampler
14 0.30065337 8 nips-2008-A general framework for investigating how far the decoding process in the brain can be simplified
15 0.29133996 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
16 0.29011789 196 nips-2008-Relative Margin Machines
17 0.28915408 78 nips-2008-Exact Convex Confidence-Weighted Learning
18 0.28597334 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
19 0.27473965 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
20 0.264272 112 nips-2008-Kernel Measures of Independence for non-iid Data
topicId topicWeight
[(6, 0.061), (7, 0.106), (12, 0.049), (26, 0.306), (28, 0.165), (57, 0.073), (59, 0.019), (63, 0.017), (71, 0.014), (77, 0.052), (83, 0.049)]
simIndex simValue paperId paperTitle
1 0.74341959 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
Author: Deli Zhao, Xiaoou Tang
Abstract: Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we tackle the problem of clustering complex data of multiple distributions and multiple scales. To this end, we develop an algorithm named Zeta l-links (Zell) which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local l-links of samples. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A new mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of a cluster in determinantal form. The popularity character of a cluster is conceptualized as the global fusion of variations of such a structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the hierarchical agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data clustering, imagery pattern clustering, and image segmentation show the competitive performance of Zell. The 98.1% accuracy, in the sense of the normalized mutual information (NMI), is obtained on the FRGC face data of 16028 samples and 466 facial clusters. 1
same-paper 2 0.72924048 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
Author: Lev Faivishevsky, Jacob Goldberger
Abstract: In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance of the proposed ICA algorithm is demonstrated on several test examples in comparison with state-ofthe-art techniques. 1
3 0.59287804 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
4 0.59208542 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
5 0.59062511 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh
Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.
6 0.59029138 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
7 0.58844876 118 nips-2008-Learning Transformational Invariants from Natural Movies
8 0.58837587 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
9 0.58740783 138 nips-2008-Modeling human function learning with Gaussian processes
10 0.58712536 194 nips-2008-Regularized Learning with Networks of Features
11 0.58667576 4 nips-2008-A Scalable Hierarchical Distributed Language Model
12 0.58592904 66 nips-2008-Dynamic visual attention: searching for coding length increments
13 0.58592391 200 nips-2008-Robust Kernel Principal Component Analysis
14 0.58547539 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
15 0.58506584 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
16 0.58367503 60 nips-2008-Designing neurophysiology experiments to optimally constrain receptive field models along parametric submanifolds
17 0.58357799 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
18 0.58354664 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
19 0.58335727 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
20 0.58325195 99 nips-2008-High-dimensional support union recovery in multivariate regression