nips nips2001 nips2001-155 knowledge-graph by maker-knowledge-mining

155 nips-2001-Quantizing Density Estimators


Source: pdf

Author: Peter Meinicke, Helge Ritter

Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. [sent-5, score-0.868]

2 The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. [sent-6, score-0.458]

3 For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. [sent-7, score-0.491]

4 We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. [sent-8, score-0.85]

5 For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data. [sent-9, score-0.886]

6 These alternative representations usually reflect some important properties of the underlying distribution and usually they try to exploit some redundancy in the data. [sent-11, score-0.084]

7 In that way many unsupervised methods aim at a complexity-reduced representation of the data, like the most common approaches, namely vector quantization (VQ) and principal component analysis (PCA). [sent-12, score-0.735]

8 Both approaches can be viewed as specific kinds of quantization, which is a basic mechanism of complexity reduction. [sent-13, score-0.066]

9 The objective of our approach to unsupervised learning is to achieve a suitable quantization of the data space which allows for an optimal reconstruction of the underlying density from a finite sample. [sent-14, score-0.897]

10 In that way we consider unsupervised learning as density estimation on a quantized sample space and the resulting estimator will be referred to as quantizing density estimator (QDE). [sent-15, score-1.181]

11 The construction of a QDE first requires to specify a suitable class of parametrized quantization functions and then to select from this set a certain function with good generalization properties. [sent-16, score-0.675]

12 While the first point is common to unsupervised learning, the latter point is addressed in a density estimation framework where we tackle the model selection problem in a data-driven and nonparametric way. [sent-17, score-0.559]

13 It is often overlooked that modern Bayesian approaches to unsupervised learning and model selection are almost always based on some strong assumptions about the data distribution. [sent-18, score-0.289]

14 Unfortunately these assumptions usually cannot be inferred from human knowledge about the data domain and therefore the model building process is usually driven by computational considerations. [sent-19, score-0.115]

15 Although our approach can be interpreted in terms of a generative model of the data, in contrast to most other generative models (see [10] for an overview), the present approach is nonparametric, since no specific assumptions about the functional form of the data distribution have to be made. [sent-20, score-0.059]

16 In that way our approach compares well with other quantization methods, like principal curves and surfaces [4, 13, 6], which only have to make rather general assumptions about the underlying distribution. [sent-21, score-0.553]

17 The QDE approach can utilize these methods as specific quantization techniques and shows a practical way how to further automatize the construction of unsupervised learning machines. [sent-22, score-0.689]

18 2 Quantization by Density Estimation We will now explain how the QDE may be derived from a generalization of the kernel density estimator (KDE), one of the most popular methods for nonparametric density estimation [12, 11]. [sent-23, score-0.869]

19 If we construct a kernel density estimator on the basis of a quantized sample, we have the following estimator ¦ (¦ (1) 8¦176¢ 5 5 & $ '% ¤¢ " ¤ ! [sent-24, score-0.76]

20 For a fixed non-zero kernel bandwidth the parameters of the quantization function may be determined by nonparametric maximum likelihood (ML) estimation, as will be introduced in the next section. [sent-30, score-1.099]

21 Clearly, we would prefer a quantizer which does not decrease the performance of the estimator on unseen and unquantized data. [sent-32, score-0.369]

22 3  #2212  ¤ ) ¤ 000 To get an idea of how to select a suitable quantization function let us consider an example from a 1D data space. [sent-33, score-0.531]

23 In one dimension a natural projection set can be specified by a set of quantization levels on the real line, i. [sent-34, score-0.642]

24 For a fixed kernel bandwidth, we can now perform maximum likelihood estimation of the level coordinates. [sent-37, score-0.243]

25 In that way we obtain a maximum likelihood estimator of the form e ™ 3 f™ 2221  d) ¨ H F 000 (4) ™ ¦   ‰¢ g    a§ i  ˜ e    ¨ ¦ g¢  Sh—¡ l ™ k 2) j 3 sq   fdg npi! [sent-38, score-0.218]

26 (eonm9 Q¨ j y r j q ph g f d c ¨ l j  with counting the number of data points which are quantized to level . [sent-39, score-0.247]

27 In this case, it remains the question how to choose the number of quantization levels. [sent-40, score-0.463]

28 i From a different starting point the authors in [3] proposed the same functional form of a nonparametric ML density estimator with respect to Gaussian kernels of equal width centered on variable positions. [sent-41, score-0.501]

29 As with the traditional Gaussian KDE (fixed kernel centers on data points), for consistency of the estimator the bandwidth has to be decreased as the sample size increases. [sent-42, score-0.853]

30 In [3] the authors reported that for a fixed non-zero bandwidth, MLestimation of the kernel centers always resulted in a smaller number of actually distinct centers, i. [sent-43, score-0.177]

31 Therefore the resulting estimator had the form of (4) where corresponds to the number of distinct centers with counting the number of kernels coinciding at . [sent-46, score-0.296]

32 The optimum number of effective quantization levels for a given bandwidth therefore arises as an automatic byproduct of ML estimation. [sent-47, score-0.883]

33      ˜ ™ i Finally one has to choose an appropriate kernel width which implicitly determines the complexity of the quantizer. [sent-48, score-0.128]

34 The bandwidth selection problem has been tackled in the domain of kernel density estimation for some time and many approaches have been proposed (see e. [sent-49, score-0.794]

35 In the next section we will adopt the method of likelihood cross-validation to find a practical answer to the bandwidth selection problem. [sent-52, score-0.482]

36 3 General Learning Scheme By applying the method of sieves as proposed in [3], for a fixed non-zero bandwidth we can estimate the parameters of the quantization function via maximization of the log-likelihood w. [sent-53, score-0.871]

37 For consistency of the resulting density estimator the bandwidth has to be decreased as the sample size increases, since asymptotically the estimator must converge to a mixture of delta functions centered on the data points. [sent-57, score-1.094]

38 Thus, for decreasing bandwidth, the quantization function of the QDE must converge to the identity function, i. [sent-58, score-0.482]

39 the QDE must converge to the kernel density estimator. [sent-60, score-0.29]

40 & ¦ ¤¢ ¡¢ %¦ —Wf  £¡  ¤¢§     For a fixed bandwidth, maximization of the likelihood can be achieved by applying the EMalgorithm [2] which provides a convenient optimization scheme, especially for Gaussian kernels. [sent-61, score-0.155]

41 # ¥¢  %¦ & '$   ¨   ©¨ ¦ r¤§¥  & © ¤¦  ¨ M-Step: ¦ (¦ E-Step: (5) (6) #  000 122 s%¨ & $ for a sequence with suitable initial parameter vector and sufficient convergence at . [sent-68, score-0.059]

42 Thereby denotes the posterior probability that data point has been “generated” by mixture component with density . [sent-69, score-0.222]

43 For further insight one may realize that the M-Step requires to solve a constrained optimization problem by searching for l ¥i © ) ¦ w w vtr us 0 ¦ ¢©Tˆy ¤ phig (eU¨ ‚ ‚ p fdc   £ § & ‰$   £ ¤¡ ¥ r 0 ‚¢ T ¤ ¥B( ¢  f r  2 1H    c g     ! [sent-70, score-0.103]

44 Therefore it may be convenient not to maximize but only to increase the log-likelihood at the M-Step which then corresponds to an application of the generalized EM-algorithm. [sent-73, score-0.072]

45 Without (8) unconstrained maximization according to (7) yields another class of interesting learning schemes which for reasons of space will not be considered in this paper. [sent-74, score-0.086]

46 1 Bandwidth Selection It is easy to see that the kernel bandwidth cannot be determined by ML-estimation since maximization of the likelihood would drive the bandwidth towards zero. [sent-77, score-0.927]

47 For selection of the kernel bandwidth, we therefore apply the method of likelihood cross-validation (see e. [sent-78, score-0.203]

48 ¥¢ £¡  ¢§ ¦ ¤ ¢  ¢ (¦ ¢ £¡ f       ¥r    ¤ y  ¨ ¦ ¤ ¢  ©§¥¢ £¡     with respect to the kernel bandwidth. [sent-83, score-0.107]

49 For a an appropriate EM scheme requires the following M- ¡ (10) & w ¦ & ‰$ ¦ the idea is to maximize Gaussian kernel with bandwidth Step update rule (9) r ¥   w ¥ ¥r £ ¨  ¦ ¤ © § ¤¢ " r ! [sent-84, score-0.556]

50 In an overall optimization scheme one may now alter the estimation of and or alternatively one may estimate both by likelihood cross-validation. [sent-86, score-0.191]

51 ¦ © 4 Projection Sets in Multidimensions By the specification of a certain class of quantization functions we can incorporate domain knowledge into the density estimation process, in order to improve generalization. [sent-87, score-0.768]

52 Thereby the idea is to reduce the variance of the density estimator by reducing the variation of the quantized training set. [sent-88, score-0.482]

53 The price is an increase of the bias which requires a careful selection of the set of admissible quantization functions. [sent-89, score-0.534]

54 We will now show how to utilize existing methods for unsupervised learning within the current density estimation framework. [sent-91, score-0.458]

55 Because many unsupervised methods can be stated in terms of finding optimal projection sets, it is straightforward to apply the corresponding types of quantization functions within the current framework. [sent-92, score-0.794]

56 Thus in the following we shall consider specific parametrizations of the general projection set (3) which correspond to traditional unsupervised learning methods. [sent-93, score-0.365]

57 1 Vector Quantization Vector quantization (VQ) is a standard technique among unsupervised methods and it is easily incorporated into the current density estimation framework by straightforward gen- eralization of the one-dimensional quantizer in section 2 to the multi-dimensional case. [sent-95, score-1.054]

58 Again with a fixed kernel bandwidth ML estimation yields a certain number of distinct (“effective”) quantization levels, similar to maximum entropy clustering [9, 1]. [sent-96, score-1.096]

59    ˜ The projection set of a vector quantizer can be parametrized according to a general basis function representation [7] 8¦6©¥ 5¢ 3  221  ) x§Y'¢©¦¤b¦ 000 ‘ ¡ ¦ ¡¢ ¥ £ ¨ & R$ ¢ ©T ¡ ¢ -dimensional vector of basis functions for component . [sent-97, score-0.458]

60 k ¨ ¦ ©¢ Ssvr    with (11) ¨ The QDE on the basis of a vector quantizer can be expected to generalize well if some cluster structure is present within the data. [sent-100, score-0.175]

61 In multi-dimensional spaces the data are often concentrated in certain regions which allows for a sparse representation by some reference vectors well-positioned in those regions. [sent-101, score-0.077]

62 An alternative approach has been proposed in [14] where the application of the support vector formalism to density estimation results in a sparse representation of the data distribution. [sent-102, score-0.304]

63 2 Principal Component Analysis A linear affine parametrization of the projection set yields candidate functions of the form – A ‘ ‚ —xˆs‚   (12) & R$ ¨ S¦   ‚¢ ©T § with . [sent-104, score-0.21]

64 The PCA approach reflects our knowledge that in most high-dimensional data spaces, the data are concentrated around some manifold of lower dimensionality. [sent-105, score-0.102]

65 With a Gaussian kernel with fixed bandwidth the constrained optimization problem at the M-Step takes a convenient form which facilitates further analysis of the learning algorithm. [sent-107, score-0.558]

66 From (7) and (8) it follows that one has to maximize the following objective function ¦ ¡ w y ¦ Sv(y r ¢ ¤  %   w ¥ § r § ¡ ¦ y ¤ r        ' " " &&% $ y const. [sent-108, score-0.05]

67 ¢ " ) 0§ matrix has orthogonal columns which span the subspace of the projection where manifold. [sent-110, score-0.212]

68 From the consideration of the corresponding stationarity conditions one finds that the sample mean is an estimator of the shift vector . [sent-111, score-0.239]

69 1 ¤     ¤ ¨    then requires to maximize the following trace  " 982" 7 ' tr ! [sent-112, score-0.073]

70 with dimensionality (15) (16) @ @ 000 212 ¡  which complements a recent result about parametric dimensionality estimation with respect to a -factor model with isotropic Gaussian noise [8]. [sent-118, score-0.176]

71 For the QDE, the two limiting cases of zero and infinite bandwidth, are of particular interest. [sent-119, score-0.028]

72 For sufficiently small bandwidth becomes positive definite implying , i. [sent-123, score-0.365]

73 3 Independent Component Analysis The PCA method provides a rather coarse quantization scheme since it only decides between one-level and no quantization for each subspace dimension. [sent-127, score-1.027]

74 A natural refinement would therefore be to allow for a certain number of effective quantization levels for each component. [sent-128, score-0.544]

75 Such an approach may be viewed as a nonparametric variant of independent component analysis (ICA). [sent-129, score-0.17]

76 The idea is to quantize each coordinate axis separately, which yields a multi-dimensional quantization grid according to  ) 000 ‘ ‚ ’s5¦ ¢©¥ ' §   ¡ ¢ C 3  212    (18)  §  ¦ C 8¦6©¥ 5¢  £ ¨   b¦ & ‰$ ‚¢ ©T   A ‘ § C ! [sent-130, score-0.506]

77 Thereby the components of contain the quantization levels of the -th coordinate axis with direction . [sent-132, score-0.52]

78 There are strong similarities with a parametric ICA model which has been suggested in [10], where source densities have been mixtures of delta functions and additive noise has been isotropic Gaussian. [sent-134, score-0.093]

79 ¦  ¨  ¦ w l w Other unsupervised learning methods which correspond to different projection sets, like principal curves or multilayer perceptrons (see [7] for an overview) can as well be incorporated into the QDE framework and will be considered elsewhere. [sent-135, score-0.43]

80 5 Experiments In the following experiments we investigated the PCA based QDE with Gaussian kernel and compared the generalization performance with that of the “non-quantizing” KDE. [sent-136, score-0.142]

81 All parameters, including the bandwidth of the KDE, were estimated by likelihood crossvalidation. [sent-137, score-0.412]

82 In the first experiment we sampled 100 points from a stretched and rotated uniform distribution with support on a rectangle. [sent-138, score-0.035]

83 With an automatically se) the PCA-QDE could improve the performance lected 1D subspace (compression ratio of the KDE from to . [sent-141, score-0.069]

84 The estimated density functions are depicted in figure 1, where grey-values are proportional to on a grid. [sent-143, score-0.19]

85  0 h$ $0 )   £$  )  $      & $  $ 0  ¡   $  ¨ £©0  In a second experiment we trained PCA-QDEs with -dimensional real-world data ( images) which had been derived from the MNIST database of handwritten digits (http://www. [sent-145, score-0.03]

86 Again the PCA-QDE improved the generalization performance of the KDE although the QDE decided to remove about 40 “redundant” dimensions per digit class. [sent-151, score-0.092]

87  £ Table 1: Results on -dimensional digit data for different digit classes ’0’. [sent-152, score-0.144]

88 ’9’ (first row); second row: difference between average log-likelihoods of (PCA-)QDE and KDE on test set; third row: optimal subspace dimensionality of QDE Digit: :  ¥ £ ©   § ¥ £ ¡ ¦¢¢¨¨¦¤¢  : 0 1. [sent-155, score-0.096]

89 33 25 6 Conclusion The QDE offers a nonparametric approach to unsupervised learning of quantization functions which can be viewed as a generalization of the kernel density estimator. [sent-165, score-1.12]

90 While the KDE is directly constructed from the given data set the QDE first creates a quantized representation of the data. [sent-166, score-0.177]

91 Unlike traditional quantization methods which minimize the associated reconstruction error of the data points, the QDE adjusts the quantizer to optimize an estimate of the data density. [sent-167, score-0.76]

92 This feature allows for a convenient model selection procedure, since the complexity of the quantizer can be controlled by the kernel bandwidth, which in turn can be selected in a data-driven way. [sent-168, score-0.373]

93 For a practical realization we outlined EM-schemes for parameter estimation and bandwidth selection. [sent-169, score-0.508]

94 As an illustration, we discussed examples with different projection sets which correspond to VQ, PCA and ICA methods. [sent-170, score-0.164]

95 We presented experiments which demonstrate that the bias imposed by the quantization can lead to an improved generalization as compared to the “non-quantizing” KDE. [sent-171, score-0.498]

96 This suggests that QDEs offer a promising approach to unsupervised learning that allows to control bias without the usually rather strong distributional assumptions of the Bayesian approach. [sent-172, score-0.218]

97 Empirical risk approximation: A statistical learning theory of data clustering. [sent-178, score-0.051]

98 Maximum likelihood from incomplete data via the EM algorithm. [sent-190, score-0.077]

99 Nonparametric maximum likelihood estimation by the method of sieves. [sent-193, score-0.136]

100 A brief survey of bandwidth selection for density estimation. [sent-207, score-0.578]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qde', 0.533), ('quantization', 0.463), ('bandwidth', 0.365), ('kde', 0.174), ('estimator', 0.171), ('density', 0.164), ('quantizer', 0.154), ('quantized', 0.147), ('projection', 0.143), ('unsupervised', 0.14), ('nonparametric', 0.117), ('kernel', 0.107), ('estimation', 0.089), ('bielefeld', 0.088), ('subspace', 0.069), ('quantizing', 0.067), ('parametrized', 0.065), ('pca', 0.064), ('principal', 0.061), ('digit', 0.057), ('vq', 0.053), ('helge', 0.053), ('selection', 0.049), ('sample', 0.047), ('likelihood', 0.047), ('centers', 0.046), ('ph', 0.044), ('ml', 0.044), ('meinicke', 0.044), ('unquantized', 0.044), ('maximization', 0.043), ('convenient', 0.042), ('traditional', 0.04), ('ica', 0.039), ('vtr', 0.039), ('suitable', 0.038), ('xed', 0.036), ('levels', 0.036), ('overview', 0.035), ('stretched', 0.035), ('neuroinformatics', 0.035), ('generalization', 0.035), ('delta', 0.034), ('isotropic', 0.033), ('realization', 0.033), ('thereby', 0.032), ('scheme', 0.032), ('nite', 0.031), ('compression', 0.031), ('peter', 0.03), ('speci', 0.03), ('data', 0.03), ('maximize', 0.03), ('assumptions', 0.029), ('kernels', 0.029), ('limiting', 0.028), ('exploit', 0.028), ('component', 0.028), ('usually', 0.028), ('dimensionality', 0.027), ('certain', 0.026), ('functions', 0.026), ('gaussian', 0.026), ('counting', 0.026), ('viewed', 0.025), ('decreased', 0.024), ('distinct', 0.024), ('optimization', 0.023), ('indices', 0.023), ('optimally', 0.023), ('consistency', 0.023), ('requires', 0.022), ('yields', 0.022), ('offers', 0.022), ('reconstruct', 0.022), ('incorporated', 0.022), ('methods', 0.022), ('american', 0.022), ('utilize', 0.022), ('row', 0.021), ('vector', 0.021), ('concentrated', 0.021), ('multivariate', 0.021), ('axis', 0.021), ('practical', 0.021), ('complexity', 0.021), ('learning', 0.021), ('tr', 0.021), ('manifold', 0.021), ('reconstruction', 0.021), ('correspond', 0.021), ('objective', 0.02), ('approaches', 0.02), ('centered', 0.02), ('effective', 0.019), ('converge', 0.019), ('gurewitz', 0.019), ('forschungsgemeinschaft', 0.019), ('fdc', 0.019), ('parametrization', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 155 nips-2001-Quantizing Density Estimators

Author: Peter Meinicke, Helge Ritter

Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.

2 0.13909672 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

3 0.09581387 74 nips-2001-Face Recognition Using Kernel Methods

Author: Ming-Hsuan Yang

Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction

4 0.087961905 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

5 0.084597692 119 nips-2001-Means, Correlations and Bounds

Author: Martijn Leisink, Bert Kappen

Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1

6 0.082863785 58 nips-2001-Covariance Kernels from Bayesian Generative Models

7 0.078021809 84 nips-2001-Global Coordination of Local Linear Models

8 0.074228801 139 nips-2001-Online Learning with Kernels

9 0.073997773 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

10 0.07224144 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

11 0.068544857 8 nips-2001-A General Greedy Approximation Algorithm with Applications

12 0.068217039 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

13 0.068146266 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

14 0.067346357 170 nips-2001-Spectral Kernel Methods for Clustering

15 0.066269159 21 nips-2001-A Variational Approach to Learning Curves

16 0.063557774 136 nips-2001-On the Concentration of Spectral Properties

17 0.0632599 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

18 0.063160762 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

19 0.059764031 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

20 0.059503857 134 nips-2001-On Kernel-Target Alignment


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.186), (1, 0.068), (2, -0.034), (3, -0.096), (4, 0.032), (5, 0.003), (6, 0.018), (7, 0.06), (8, 0.03), (9, -0.017), (10, -0.042), (11, 0.002), (12, 0.038), (13, -0.103), (14, -0.052), (15, -0.023), (16, 0.001), (17, 0.032), (18, -0.061), (19, 0.113), (20, 0.005), (21, -0.016), (22, 0.084), (23, -0.1), (24, -0.1), (25, 0.114), (26, 0.002), (27, 0.092), (28, -0.062), (29, -0.059), (30, -0.086), (31, -0.01), (32, -0.018), (33, -0.051), (34, -0.003), (35, 0.077), (36, 0.064), (37, 0.009), (38, 0.045), (39, 0.09), (40, 0.016), (41, -0.162), (42, -0.022), (43, 0.169), (44, 0.067), (45, -0.148), (46, 0.001), (47, -0.018), (48, 0.075), (49, -0.144)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92440194 155 nips-2001-Quantizing Density Estimators

Author: Peter Meinicke, Helge Ritter

Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.

2 0.61320615 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

3 0.60058701 74 nips-2001-Face Recognition Using Kernel Methods

Author: Ming-Hsuan Yang

Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction

4 0.5519371 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

5 0.47061598 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

Author: Shun-ichi Amari, Hyeyoung Park, Tomoko Ozeki

Abstract: Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory such as AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator and the Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1

6 0.44217208 84 nips-2001-Global Coordination of Local Linear Models

7 0.43703988 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

8 0.40875894 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

9 0.40803605 58 nips-2001-Covariance Kernels from Bayesian Generative Models

10 0.4048717 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

11 0.40298986 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

12 0.39419076 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

13 0.39395273 120 nips-2001-Minimax Probability Machine

14 0.38901269 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

15 0.38518366 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

16 0.37650982 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists

17 0.36994758 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

18 0.35932565 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.34671888 21 nips-2001-A Variational Approach to Learning Curves

20 0.34210634 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.053), (17, 0.032), (19, 0.048), (27, 0.128), (30, 0.067), (38, 0.025), (59, 0.054), (70, 0.259), (72, 0.091), (79, 0.037), (83, 0.022), (91, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84924293 31 nips-2001-Algorithmic Luckiness

Author: Ralf Herbrich, Robert C. Williamson

Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1

same-paper 2 0.83105397 155 nips-2001-Quantizing Density Estimators

Author: Peter Meinicke, Helge Ritter

Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.

3 0.73038626 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

4 0.63353491 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

5 0.62976289 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

6 0.62574387 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

7 0.62346202 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

8 0.62162912 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

9 0.62149024 8 nips-2001-A General Greedy Approximation Algorithm with Applications

10 0.62094164 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

11 0.61877155 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

12 0.61702836 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

13 0.61598355 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

14 0.61593044 60 nips-2001-Discriminative Direction for Kernel Classifiers

15 0.61577642 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

16 0.61491793 185 nips-2001-The Method of Quantum Clustering

17 0.61468536 114 nips-2001-Learning from Infinite Data in Finite Time

18 0.61458462 58 nips-2001-Covariance Kernels from Bayesian Generative Models

19 0.61372817 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

20 0.61357808 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation