nips nips2005 nips2005-66 knowledge-graph by maker-knowledge-mining

66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization


Source: pdf

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. [sent-2, score-1.039]

2 The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. [sent-3, score-0.593]

3 Introduction The goal of nonlinear dimensionality reduction (NLDR) [1, 2, 3] is to find low-dimensional manifold descriptions of high-dimensional data. [sent-6, score-0.247]

4 Most NLDR schemes require a good estimate of the intrinsic dimensionality of the data to be available in advance. [sent-7, score-0.326]

5 A number of existing methods for estimating the intrinsic dimension (e. [sent-8, score-0.341]

6 , [3, 4, 5]) rely on the fact that, for data uniformly distributed on a d-dimensional compact smooth submanifold of IRD , the probability of a small ball of radius ǫ around any point on the manifold is Θ(ǫd ). [sent-10, score-0.309]

7 In this paper, we connect this argument with the notion of quantization dimension [6, 7], which relates the intrinsic dimension of a manifold (a topological property) to the asymptotic optimal quantization error for distributions on the manifold (an operational property). [sent-11, score-1.338]

8 However, to the best of our knowledge, it has not been previously used for dimension estimation in manifold learning. [sent-13, score-0.338]

9 The definition of quantization dimension leads to a family of dimensionality estimation algorithms, parametrized by the distortion exponent r ∈ [1, ∞), yielding in the limit of r = ∞ a scheme equivalent to K´ gl’s recent technique based on packing numbers [4]. [sent-14, score-0.88]

10 e To date, many theoretical aspects of intrinsic dimensionality estimation remain poorly understood. [sent-15, score-0.327]

11 data: they compute the dimensionality estimate from a fixed training sequence (typically, the entire dataset of interest), whereas we show that an independent test sequence is necessary to avoid overfitting. [sent-20, score-0.394]

12 In addition, using the framework of high-rate vector quantization allows us to analyze the performance of our scheme in the presence of noise. [sent-21, score-0.247]

13 Quantization-based estimation of intrinsic dimension Let us begin by introducing the definitions and notation used in the rest of the paper. [sent-23, score-0.367]

14 A D-dimensional k-point vector quantizer [6] is a measurable map Qk : IRD → C, where C = {y1 , . [sent-24, score-0.516]

15 , yk } ⊂ IRD is called the codebook and the yi ’s are called the codevectors. [sent-27, score-0.172]

16 The sets △ Ri = {x ∈ IRD : Qk (x) = yi }, 1 ≤ i ≤ k, are called the quantizer cells (or partition regions). [sent-29, score-0.516]

17 The quantizer performance on a random vector X distributed according to a probability distribution µ (denoted X ∼ µ) is measured by the average rth-power △ distortion δr (Qk |µ) = Eµ X − Qk (X) r , r ∈ [1, ∞), where · is the Euclidean D norm on IR . [sent-30, score-0.628]

18 In the sequel, we will often find it more convenient to work with the △ quantizer error er (Qk |µ) = δr (Qk |µ)1/r . [sent-31, score-0.674]

19 Then the performance achieved by an optimal k-point quantizer on X △ △ ∗ ∗ is δr (k|µ) = inf Qk ∈Qk δr (Qk |µ) or equivalently, e∗ (k|µ) = δr (k|µ)1/r . [sent-33, score-0.585]

20 When the quantizer rate is high, the partition cells can be well approximated by D-dimensional balls around the codevectors. [sent-37, score-0.614]

21 This is referred to as the high-rate (or high-resolution) r approximation, and motivates the definition of quantization dimension of order r: △ dr (µ) = − lim k→∞ log k . [sent-39, score-0.455]

22 log e∗ (k|µ) r The theory of high-rate quantization confirms that, for a regular µ supported on the manifold M , dr (µ) exists for all 1 ≤ r ≤ ∞ and equals the intrinsic dimension of M [7, 6]. [sent-40, score-0.787]

23 ) This definition immediately suggests an empirical procedure for estimating the intrinsic dimension of a manifold from a set of samples. [sent-44, score-0.508]

24 Briefly, we select a range k1 ≤ k ≤ k2 of codebook sizes for which the high-rate approximation holds (see Sec. [sent-53, score-0.196]

25 3 for implementation details), and design a sequence of quantizers ˆ {Qk }k2 1 that give us good approximations er (k|µ) to the optimal error e∗ (k|µ) over the ˆ r k=k chosen range of k. [sent-54, score-0.189]

26 Then an estimate of the intrinsic dimension is obtained by plotting log k vs. [sent-55, score-0.419]

27 − log er (k|µ) and measuring the slope of the plot over the chosen range of k (because ˆ the high-rate approximation holds, the plot is linear). [sent-56, score-0.301]

28 the normalized surface measure on M is regular of dimension d. [sent-63, score-0.188]

29 A less biased estimate is ˆ given by the performance of Qk on a test sequence independent from the training set. [sent-68, score-0.261]

30 Provided m is sufficiently large, the law of large numbers guarantees that the empirical average ˆ er (Qk |µtest ) = 1 m 1/r m i=1 ˆ Zi − Qk (Zi ) r ˆ will be a good estimate of the test error er (Qk |µ). [sent-76, score-0.434]

31 Using learning-theoretic formalism [8], one can show that the test error of an empirically optimal quantizer is a strongly consistent estimate of e∗ (k|µ), i. [sent-77, score-0.868]

32 The optimum e∗ (k|µ) = inf Qk ∈Qk e∞ (Qk |µ) has an interesting interpretation as the ∞ smallest covering radius of the most parsimonious covering of supp(µ) by k or fewer balls of equal radii [6]. [sent-85, score-0.254]

33 Let us describe how the r = ∞ case is equivalent to dimensionality estimation using packing numbers [4]. [sent-86, score-0.376]

34 The covering number NM (ǫ) of a manifold M ⊂ IRD is defined as the size of the smallest covering of M by balls of radius ǫ > 0, while the packing number PM (ǫ) is the cardinality of the maximal set S ⊂ M with x − y ≥ ǫ for all distinct x, y ∈ S. [sent-87, score-0.574]

35 If d is the dimension of M , then NM (ǫ) = Θ(ǫ−d ) for small enough ǫ, △ NM leading to the definition of the capacity dimension: dcap (M ) = − limǫ→0 loglog ǫ (ǫ) . [sent-88, score-0.24]

36 If this limit exists, then it equals the intrinsic dimension of M . [sent-89, score-0.339]

37 Alternatively, K´ gl [4] suggests e using the easily proved inequality NM (ǫ) ≤ PM (ǫ) ≤ NM (ǫ/2) to express the capacity PM dimension in terms of packing numbers as dcap (M ) = − limǫ→0 loglog ǫ(ǫ) . [sent-90, score-0.527]

38 Then it is not hard to show that − log k log PM (ǫk ) log PM (2ǫk ) ≤− <− , log 2ǫk − 1 log e∗ (k|µ) log ǫk ∞ k ≥ k0 . [sent-95, score-0.318]

39 , in the high-rate regime) the ratio − log k/ log e∗ (k|µ) can be approx∞ imated increasingly finely both from below and from above by quantities involving the packing numbers PM (ǫk ) and PM (2ǫk ) and converging to the common value dcap (M ). [sent-98, score-0.406]

40 This demonstrates that the r = ∞ case of our scheme is numerically equivalent to K´ gl’s e method based on packing numbers. [sent-99, score-0.232]

41 For a finite training set, the r = ∞ case requires us to find an empirically optimal kpoint quantizer w. [sent-100, score-0.715]

42 In theory, this worst-case quantizer is completely insensitive to variations in sampling density, since the optimal error e∗ (k|µ) is the same for all µ with the same ∞ support. [sent-106, score-0.635]

43 This will cause the empirically designed quantizer to be matched to the noisy distribution ν, whereas our aim is to estimate optimal quantizer performance on the original clean data. [sent-111, score-1.234]

44 It is a natural measure of quantizer mismatch, i. [sent-113, score-0.516]

45 , the difference in performance that results from using a quantizer matched to ν on data distributed according to µ [9]. [sent-115, score-0.539]

46 It is possible to show (details omitted for lack of space) that for an empirically optimal k-point quantizer Q∗ trained on n samk,r ples of ν, |er (Q∗ |ν) − e∗ (k|µ)| ≤ 2¯r (νn , ν) + ρr (µ, ν). [sent-120, score-0.606]

47 Thus, provided the training set is ¯ sufficiently large, the distortion estimation error is controlled by ρr (µ, ν). [sent-122, score-0.337]

48 The magnitude of the ¯ bound, and hence the worst-case sensitivity of the estimation procedure to noise, is controlled by the noise variance, by the extrinsic dimension, and by the distortion exponent. [sent-127, score-0.266]

49 The factor involving the gamma functions grows without bound both as D → ∞ and as r → ∞, which suggests that the susceptibility of our algorithm to noise increases with the extrinsic dimension of the data and with the distortion exponent. [sent-128, score-0.386]

50 25 Rate (log k) −10 −5 9 8 7 0 5 Training error Test error Training fit Test fit 6 10 5 0 10 5 −5 −10 −2 −1 (a) 0 −log(Error) Dim. [sent-141, score-0.23]

51 Estimate Rate (log k) 2 10 9 8 4 2 4 7 0 −2 −2 −4 Training error Test error 6 −4 1 (d) 1. [sent-149, score-0.176]

52 75 Training estimate Test estimate 6 7 8 Rate (f) Figure 2: (a) The swiss roll (20,000 samples). [sent-153, score-0.211]

53 negative log of the quantizer error (log-log curves), together with parametric curves fitted using linear least squares (see text). [sent-155, score-0.807]

54 Thus, we were compelled to also implement a greedy algorithm reminiscent of K´ gl’s algorithm for estimating the packing number [4]: e supposing that k − 1 codevectors have already been selected, the kth one is chosen to be the sample point with the largest distance from the nearest codevector. [sent-167, score-0.291]

55 Because this is the point that gives the worst-case error for codebook size k −1, adding it to the codebook lowers the error. [sent-168, score-0.432]

56 For the experiment shown in Figure 3, the training error curves produced by this greedy algorithm were on average 21% higher than those of the Lloyd algorithm, but the test curves were only 8% higher. [sent-170, score-0.561]

57 In many cases, the two test curves are visually almost coincident (Figure 1). [sent-171, score-0.207]

58 The lowest rate is 5 bits, and the highest rate is chosen as log(n/2), where n is the size of the training set. [sent-179, score-0.191]

59 The high-rate approximation suggests the asymptotic form Θ(k −1/d ) for the quantizer error as a function of codebook size k. [sent-180, score-0.839]

60 To validate this approximation, we use linear least squares to fit curves of the form a + b k −1/2 to the r = 2 training and test distortion curves for the the swiss roll. [sent-181, score-0.568]

61 the negative logarithm of the training and test error (“log-log curves” in the following). [sent-189, score-0.322]

62 Note that the additive constant for the training error is negative, reflecting the fact that the training error of the empirical quantizer is identically zero when n = k (each sample becomes a codevector). [sent-190, score-0.97]

63 On the other hand, the test error has a positive additive constant as a consequence of quantizer suboptimality. [sent-191, score-0.741]

64 Significantly, the fit deteriorates as n/k → 1, as the average number of training samples per quantizer cell becomes too small to sustain the exponentially slow decay required for the high-rate approximation. [sent-192, score-0.676]

65 2(c) shows the slopes of the training and test log-log curves, obtained by fitting a line to each successive set of 10 points. [sent-194, score-0.278]

66 These slopes are, in effect, rate-dependent dimensionality estimates for the dataset. [sent-195, score-0.26]

67 Note that the training slope is always below the test slope; this is a consequence of the “optimism” of the training error and the “pessimism” of the test error (as reflected in the additive constants of the parametric fits). [sent-196, score-0.751]

68 The shapes of the two slope curves are typical of many “well-behaved” datasets. [sent-197, score-0.201]

69 At low rates, both the training and the test slopes are close to the extrinsic dimension, reflecting the global geometry of the dataset. [sent-198, score-0.355]

70 As rate increases, the local manifold structure is revealed, and the slope yields its intrinsic dimension. [sent-199, score-0.45]

71 However, as n/k → 1, the quantizer begins to “see” isolated samples instead of the manifold structure. [sent-200, score-0.709]

72 Thus, the training slope begins to fall to zero, and the test slope rises, reflecting the failure of the quantizer to generalize to the test set. [sent-201, score-1.021]

73 For most datasets in our experiments, a good intrinsic dimensionality estimate is given by the first minimum of the test slope where the line-fitting residual is sufficiently low (marked by a diamond in Fig. [sent-202, score-0.524]

74 For completeness, we also report the slope of the training curve at the same rate (note that the training curve may not have local minima because of its tendency to fall as the rate increases). [sent-204, score-0.45]

75 Interestingly, some datasets yield several well-defined dimensionality estimates at different rates. [sent-205, score-0.193]

76 Accordingly, the log-log plot of the test error (Fig. [sent-208, score-0.219]

77 2(e)) has two distinct linear parts, yielding dimension estimates of 2. [sent-209, score-0.258]

78 1 that the high-rate approximation for regular probability distributions is based on the assumption that the intersection of each quantizer cell with the manifold is a d-dimensional neighborhood of that manifold. [sent-215, score-0.725]

79 Because we compute our dimensionality estimate at a rate for which this approximation is valid, we know that the empirically optimal quantizer at this rate partitions the data into clusters that are locally d-dimensional. [sent-216, score-0.893]

80 Thus, our dimensionality estimation procedure is also useful for finding a clustering of the data that respects the intrinsic neighborhood structure of the manifold from which it is sampled. [sent-217, score-0.469]

81 2(c), we obtain two distinct dimensionality estimates of 2 and 1 at rates 6. [sent-219, score-0.218]

82 To ascertain the effect of noise and extrinsic dimension on our method, we have embedded the swiss roll in dimensions 4 to 8 by zero-padding the coordinates and applying a random orthogonal matrix, and added isotropic zero-mean Gaussian noise in the high-dimensional space, with σ = 0. [sent-226, score-0.411]

83 3(a) shows the maximum differences between the noisy and the noiseless test error curves for each combination of D and σ, and the bottom part shows the corresponding values of the Wasserstein √ bound σ D for comparison. [sent-236, score-0.413]

84 For each value of σ, the test error of the empirically designed √ quantizer differs from the noiseless case by O( D), while, for a fixed D, the difference r = ∞ training r = 2 training 3 3. [sent-237, score-1.036]

85 8 1 (c) r = ∞ Figure 3: (a) Top: empirically√ ´ ` observed differences between noisy and noiseless test curves; bottom: theoretically derived bound σ D . [sent-279, score-0.22]

86 (b) Height plot of dimension estimates for the r = 2 algorithm as a function of D and σ. [sent-280, score-0.262]

87 Note that the training estimates are consistently lower than the test estimates: the average difference is 0. [sent-284, score-0.299]

88 of the noisy and noiseless test errors grows as O(σ). [sent-289, score-0.193]

89 As predicted by the bound, the additive constant in the parametric form of the test error increases with σ, resulting in larger slopes of the log-log curve and therefore higher dimension estimates. [sent-290, score-0.486]

90 3(b) and (c), which show training and test dimensionality estimates for r = 2 and r = ∞, respectively. [sent-292, score-0.404]

91 The r = ∞ estimates are much less stable than those for r = 2 because the r = ∞ (worst-case) error is controlled by outliers and often stays constant over a range of rates. [sent-293, score-0.176]

92 The piecewise-constant shape of the test error curves (see Fig. [sent-294, score-0.295]

93 The rest of the table shows the estimates obtained from the training and test curves of the r = 2 quantizer and the (greedy) r = ∞ quantizer. [sent-300, score-0.92]

94 The relatively high values of the dimension estimates reflect the many degrees of freedom found in handwritten digits, including different scale, slant and thickness of the strokes, as well as the presence of topological features (i. [sent-303, score-0.287]

95 For the face dataset, the different dimensionality estimates range from 4. [sent-307, score-0.216]

96 Handwritten digits (MNIST data set) # samples Regression r = 2 train r = 2 test r = ∞ train r = ∞ test 6903 11. [sent-320, score-0.374]

97 Discussion We have demonstrated an approach to intrinsic dimensionality estimation based on highrate vector quantization. [sent-378, score-0.327]

98 In the future we plan to integrate our proposed method into a unified package of quantization-based algorithms for estimating the intrinsic dimension of the data, obtaining its dimension-reduced manifold representation, and compressing the low-dimensional data [11]. [sent-383, score-0.483]

99 Asymptotic quantization error of continuous signals and the quantization dimension. [sent-426, score-0.5]

100 4 Interestingly, Brand [3] reports an intrinsic dimension estimate of 3 for this data set. [sent-458, score-0.366]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('quantizer', 0.516), ('qk', 0.486), ('quantization', 0.206), ('packing', 0.191), ('codebook', 0.172), ('intrinsic', 0.171), ('pm', 0.15), ('dimension', 0.145), ('manifold', 0.142), ('ird', 0.133), ('training', 0.109), ('dimensionality', 0.105), ('curves', 0.105), ('test', 0.102), ('slope', 0.096), ('distortion', 0.089), ('estimates', 0.088), ('error', 0.088), ('extrinsic', 0.077), ('lloyd', 0.077), ('wasserstein', 0.077), ('er', 0.07), ('nm', 0.067), ('gl', 0.067), ('supp', 0.067), ('slopes', 0.067), ('empirically', 0.059), ('swiss', 0.058), ('dcap', 0.057), ('toroidal', 0.057), ('balls', 0.057), ('spiral', 0.053), ('noiseless', 0.053), ('roll', 0.053), ('log', 0.053), ('greedy', 0.052), ('estimation', 0.051), ('samples', 0.051), ('mnist', 0.051), ('estimate', 0.05), ('enclosing', 0.05), ('ball', 0.05), ('covering', 0.047), ('regular', 0.043), ('ecting', 0.042), ('scheme', 0.041), ('rate', 0.041), ('train', 0.04), ('radius', 0.039), ('digits', 0.039), ('asymptotic', 0.039), ('beckman', 0.038), ('lazebnik', 0.038), ('loglog', 0.038), ('maxim', 0.038), ('nldr', 0.038), ('raginsky', 0.038), ('svetlana', 0.038), ('noisy', 0.038), ('inf', 0.038), ('additive', 0.035), ('compact', 0.033), ('optimal', 0.031), ('handwritten', 0.031), ('plot', 0.029), ('numbers', 0.029), ('dataset', 0.028), ('bound', 0.027), ('curve', 0.027), ('faces', 0.027), ('fit', 0.027), ('dr', 0.027), ('locally', 0.026), ('smallest', 0.026), ('embedded', 0.026), ('noise', 0.026), ('consistency', 0.026), ('distinct', 0.025), ('decoder', 0.025), ('estimating', 0.025), ('empirical', 0.025), ('clean', 0.024), ('id', 0.024), ('lim', 0.024), ('approximation', 0.024), ('december', 0.023), ('frey', 0.023), ('sequel', 0.023), ('converging', 0.023), ('topological', 0.023), ('negative', 0.023), ('distributed', 0.023), ('sensitivity', 0.023), ('face', 0.023), ('limit', 0.023), ('nearest', 0.023), ('formalism', 0.022), ('smooth', 0.022), ('parametric', 0.022), ('gamma', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

2 0.1137518 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1

3 0.10665233 138 nips-2005-Non-Local Manifold Parzen Windows

Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent

Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.

4 0.085001305 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

Author: Jorge Silva, Jorge Marques, João Lemos

Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1

5 0.078183688 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

6 0.071504459 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI

7 0.063465364 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

8 0.062850475 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

9 0.057555672 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

10 0.056356929 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

11 0.053504419 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

12 0.052790519 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

13 0.051763661 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

14 0.051523283 136 nips-2005-Noise and the two-thirds power Law

15 0.050699394 30 nips-2005-Assessing Approximations for Gaussian Process Classification

16 0.050494879 47 nips-2005-Consistency of one-class SVM and related algorithms

17 0.048070014 117 nips-2005-Learning from Data of Variable Quality

18 0.047855567 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

19 0.047278989 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

20 0.046996929 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.168), (1, 0.042), (2, -0.038), (3, -0.018), (4, 0.027), (5, 0.042), (6, 0.032), (7, -0.072), (8, 0.081), (9, -0.049), (10, 0.035), (11, 0.001), (12, 0.094), (13, 0.011), (14, 0.019), (15, 0.071), (16, -0.079), (17, 0.09), (18, -0.049), (19, 0.071), (20, 0.092), (21, 0.052), (22, -0.027), (23, 0.185), (24, -0.073), (25, 0.093), (26, -0.2), (27, 0.045), (28, 0.121), (29, 0.03), (30, 0.024), (31, 0.042), (32, 0.132), (33, -0.042), (34, 0.018), (35, -0.003), (36, 0.017), (37, -0.006), (38, -0.007), (39, 0.027), (40, 0.018), (41, -0.012), (42, 0.117), (43, -0.051), (44, -0.009), (45, -0.153), (46, -0.01), (47, -0.119), (48, -0.005), (49, -0.147)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92996913 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

2 0.63018012 138 nips-2005-Non-Local Manifold Parzen Windows

Author: Yoshua Bengio, Hugo Larochelle, Pascal Vincent

Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.

3 0.6131686 161 nips-2005-Radial Basis Function Network for Multi-task Learning

Author: Xuejun Liao, Lawrence Carin

Abstract: We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1

4 0.52947408 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we define what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of finding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identified and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classification. A numerical study demonstrates the usefulness of our method. 1

5 0.46126452 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

Author: Zhenyue Zhang, Hongyuan Zha

Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efficiency and effectiveness of the proposed methods. 1

6 0.45863187 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

7 0.45447749 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

8 0.42774123 117 nips-2005-Learning from Data of Variable Quality

9 0.42299572 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

10 0.41694325 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

11 0.41557682 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

12 0.41535676 19 nips-2005-Active Learning for Misspecified Models

13 0.39401224 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

14 0.38150278 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

15 0.37575996 112 nips-2005-Learning Minimum Volume Sets

16 0.3586486 189 nips-2005-Tensor Subspace Analysis

17 0.35227337 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

18 0.35169601 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

19 0.35077566 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

20 0.34102976 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.059), (10, 0.035), (11, 0.011), (27, 0.024), (31, 0.039), (34, 0.084), (39, 0.022), (40, 0.256), (55, 0.036), (65, 0.046), (69, 0.066), (73, 0.043), (77, 0.018), (88, 0.092), (91, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88091725 33 nips-2005-Bayesian Sets

Author: Zoubin Ghahramani, Katherine A. Heller

Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1

same-paper 2 0.78188634 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: We introduce a technique for dimensionality estimation based on the notion of quantization dimension, which connects the asymptotic optimal quantization error for a probability distribution on a manifold to its intrinsic dimension. The definition of quantization dimension yields a family of estimation algorithms, whose limiting case is equivalent to a recent method based on packing numbers. Using the formalism of high-rate vector quantization, we address issues of statistical consistency and analyze the behavior of our scheme in the presence of noise.

3 0.5589788 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

Author: Jorge Silva, Jorge Marques, João Lemos

Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1

4 0.54851252 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.

5 0.54686844 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia

Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

6 0.54680043 74 nips-2005-Faster Rates in Regression via Active Learning

7 0.54637784 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

8 0.54542023 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

9 0.54441702 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

10 0.54144657 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

11 0.54138696 102 nips-2005-Kernelized Infomax Clustering

12 0.54095036 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

13 0.54090935 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

14 0.54035628 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

15 0.53820759 50 nips-2005-Convex Neural Networks

16 0.53803283 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

17 0.53801221 144 nips-2005-Off-policy Learning with Options and Recognizers

18 0.53755617 177 nips-2005-Size Regularized Cut for Data Clustering

19 0.53717703 98 nips-2005-Infinite latent feature models and the Indian buffet process

20 0.53655362 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs