nips nips2000 nips2000-60 knowledge-graph by maker-knowledge-mining

60 nips-2000-Gaussianization

Source: pdf

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract High dimensional data modeling is difficult mainly because the so-called "curse of dimensionality". [sent-6, score-0.079]

2 We propose a technique called "Gaussianization" for high dimensional density estimation, which alleviates the curse of dimensionality by exploiting the independence structures in the data. [sent-7, score-0.441]

3 Gaussianization is motivated from recent developments in the statistics literature: projection pursuit, independent component analysis and Gaussian mixture models with semi-tied covariances. [sent-8, score-0.175]

4 We propose an iterative Gaussianization procedure which converges weakly: at each iteration, the data is first transformed to the least dependent coordinates and then each coordinate is marginally Gaussianized by univariate techniques. [sent-9, score-1.031]

5 Gaussianization offers density estimation sharper than traditional kernel methods and radial basis function methods. [sent-10, score-0.184]

6 Gaussianization can be viewed as efficient solution of nonlinear independent component analysis and high dimensional projection pursuit. [sent-11, score-0.228]

7 In the statistics literature, the univariate problem is well-understood and well-studied. [sent-13, score-0.312]

8 Techniques such as (variable) kernel methods, radial basis function methods, Gaussian mixture models, etc, can be applied successfully to obtain univariate density estimates. [sent-14, score-0.487]

9 However, the high dimensional problem is very challenging, mainly due to the problem of the so-called "curse of dimensionality". [sent-15, score-0.093]

10 In high dimensional space, data samples are often sparsely distributed: it requires very large neighborhoods to achieve sufficient counts, or the number of samples has to grows exponentially according to the dimensions in order to achieve sufficient coverage of the sampling space. [sent-16, score-0.212]

11 As a result, direct extension of univariate techniques can be highly biased, because they are neighborhood-based. [sent-17, score-0.332]

12 In this paper, we attempt to overcome the curse of dimensionality by exploiting independence structures in the data. [sent-18, score-0.187]

13 We advocate the notion that Independence lifts the curse of dimensionality! [sent-19, score-0.098]

14 Indeed, if the dimensions are independent, then there is no curse of dimensionality since the high dimensional problem can be reduced to univariate problems along each dimension. [sent-20, score-0.557]

15 For natural data sets which do not have independent dimensions, we would like to construct transforms such that after the transformation, the dimensions become independent. [sent-21, score-0.131]

16 We propose a technique called "Gaussianization" which finds and exploits independence structures in the data for high dimensional density estimation. [sent-22, score-0.305]

17 We pro- pose an iterative procedure which converges weakly in probability: at each iteration, the data is first transformed to the least dependent coordinates and then each coordinate is marginally Gaussianized by univariate techniques which are based on univariate density estimation. [sent-24, score-1.466]

18 At each iteration, the coordinates become less dependent in terms of the mutual information, and the transformed data samples become more Gaussian in terms of the Kullback-Leibler divergence. [sent-25, score-0.389]

19 In fact, at each iteration, as far as the data is linearly transformed to less dependent coordinates, the convergence result still holds. [sent-26, score-0.256]

20 Our convergence proof of Gaussianization is highly related to Huber's convergence proof of projection pursuit [4]. [sent-27, score-0.216]

21 Algorithmically, each Gaussianization iteration amounts to performing the linear independent component analysis. [sent-28, score-0.249]

22 Since the assumption of linear independent component analysis may not be valid, the resulting linear transform does not necessarily make the coordinate independent, however, it does make the coordinates as independent as possible. [sent-29, score-0.486]

23 Therefore the engine of our algorithm is the linear independent component analysis. [sent-30, score-0.116]

24 We propose an efficient EM algorithm which jointly estimates the linear transform and the marginal univariate Gaussianization transform at each iteration. [sent-31, score-0.801]

25 However, we apply the variational method in the M-step, as in the semi-tied covariance algorithm proposed for Gaussian mixture models by Gales (1999) [3]. [sent-33, score-0.066]

26 ) the probability density function of the standard normal N(O, I); denote ¢(', /-£,};. [sent-36, score-0.127]

27 1 Univariate Gaussianization Univariate Gaussianization exists uniquely and can be derived from univariate density estimation. [sent-40, score-0.441]

28 We assume that the density function of X is strictly positive and differentiable. [sent-42, score-0.11]

29 Throughout the paper, we shall assume that univariate density estimation and univariate Gaussianization can be solved by univariate Gaussian mixture models. [sent-45, score-1.157]

30 2 High Dimensional Gaussianization However, the existence of high dimensional Gaussianization is non-trivial. [sent-47, score-0.117]

31 We first marginally Gaussianize the first coordinate X I and fix the second coordinate X 2 unchanged; the transformed variable will have the following density P(XI,X2) =P(XI)P(X2Ixt) = ¢(xt)p(x2Ixt) . [sent-52, score-0.561]

32 We then marginally Gaussian each conditional density p(·IXI) for each Xl. [sent-53, score-0.259]

33 Notice that the marginal Gaussianization is different for different Xl : T X1 (X2 ) = CP-I(F. [sent-54, score-0.066]

34 l x l(X2 )), Once all the conditional densities are marginally Gaussianized, we achieve joint Gaussianization p(XI,X2) = p(XI)p(X2IxI) = ¢(Xt}¢(X2) . [sent-55, score-0.189]

35 The existence of high dimensional Gaussianization can be proved by similar construction. [sent-56, score-0.133]

36 The above construction, however, is not practical since the marginal Gaussianization of the conditional densities P(X2 = x21X I = xt) requires estimation of the conditional densities given all Xl , which is impossible with finite samples. [sent-57, score-0.189]

37 In the following sections, we shall develop an iterative Gaussianization algorithm that is practical and also can be proved to converge weakly. [sent-58, score-0.163]

38 High-dimensional Gaussianization is unique up to any invertible transforms which preserve the measure on RP induced by the standard Gaussian distribution. [sent-59, score-0.127]

39 Examples of such transforms are orthogonal linear transforms and certain nontrivial Nonlinear transforms. [sent-60, score-0.164]

40 We assume that there exist a linear transform A DxD such that the transformed variable Y = (YI , . [sent-67, score-0.346]

41 p(YD)' In this case, Gaussianization is reduced to linear ICA: we can first find the linear transformation A by linear independent component analysis, and then Gaussianize each individual dimension of Y by univariate Gaussianization. [sent-76, score-0.478]

42 We parametrize the marginal Gaussianization by univariate Gaussian mixtures (2). [sent-77, score-0.403]

43 This amounts to model the coordinates of the transformed variable by univariate Gaussian mixtures: p(Yd) = l:[~l 7rd,i¢(Yd, f. [sent-78, score-0.636]

44 l,d,i, a~,i)' We would like to jointly optimize both the linear transform A and the marginal Gaussianization parameters (7r, f. [sent-79, score-0.276]

45 We point out that modeling the coordinates after the linear transform as non-Gaussian distributions, for which we assume univariate Gaussian mixtures are adequate, leads to ICA while as modeling them as single Gaussians leads to PCA. [sent-82, score-0.642]

46 Ld,i can be entirely determined by the linear transform A. [sent-88, score-0.173]

47 However, updating the linear transform A and the variances (lTd,i) does not have closed form solution and has to be solved iteratively by numerical methods. [sent-89, score-0.212]

48 Attias (1999) [1] proposed to optimize Q via gradient descent: at each iteration, one fixes the linear transform and compute the Gaussian mixture parameters, then fixes the Gaussian ntixture parameters and update the linear transform via gradient descent using the so-called natural gradient. [sent-90, score-0.468]

49 We propose an iterative algorithm as in Gales (1999) [3] for the M-step which does not involve gradient descent and the nuisance and instability caused by of the step size parameter. [sent-91, score-0.172]

50 At each iteration, we fix the linear transform A and update the variances (lTd,i); we then fix (lTd,i) and update each row of A with all the other rows of A fixed: updating each row amounts to solving a system of linear equations. [sent-92, score-0.311]

51 Our iterative scheme guarantees that the auxiliary function Q to be increased at every iteration. [sent-93, score-0.127]

52 Notice that each iteration in our M-step updates the rows of the linear matrix A by solving D linear equations. [sent-94, score-0.216]

53 Although our iterative scheme may be slightly more expensive per iteration than standard numerical optintization techniques such as Attias' algorithm, in practice it converges after very few iterations, as observed in Gales (1999) [3]. [sent-95, score-0.342]

54 In fact, in our experiments, our algorithm converges much faster than Attias's algorithm. [sent-97, score-0.062]

55 Furthermore, our algorithm is stable since each iteration is guaranteed to increase the likelihood. [sent-98, score-0.166]

56 Therefore we can stop each M-step after, say one iteration of updating the parameters; even though the auxiliary function is not optimized, but it is guaranteed to improve. [sent-101, score-0.191]

57 Attias (1999) [1] reported faster convergence of the generalized EM algorithm than the standard EM algorithm. [sent-103, score-0.066]

58 4 Iterative Gaussianization In this section we develop an iterative algorithm which Gaussianizes arbitrary random variables. [sent-104, score-0.125]

59 At each iteration, the data is first transformed to the least dependent coordinates and then each coordinate is marginally Gaussianized by univariate techniques which are based on univariate density estimation. [sent-105, score-1.313]

60 We shall show that transfornting the data into the least dependent coordinates can be achieved by linear independent component analysis. [sent-106, score-0.348]

61 We define the negentropy 1 of a random variable X = (Xl,"', X D) T as the KullbackLeibler divergence between X and the standard Gaussian distribution. [sent-108, score-0.198]

62 We define the marginal negentropy to be JM(X) = Ef=l J(Xd). [sent-109, score-0.202]

63 One can show that the negentropy can be decomposed as the sum of the marginal negentropy and the mutual information: J(X) = JM(X) + I(X). [sent-110, score-0.338]

64 Gaussianization is equivalent to finding an invertible transform TO such that the negentropy ofthe transformed variable vanishes: J(T(X)) = O. [sent-111, score-0.504]

65 For arbitrary random variable X E R D, we propose the following iterative Gaussianization algorithm. [sent-112, score-0.169]

66 At each iteration, (A) Linearly transform the data: y(k) = Ax(k). [sent-114, score-0.135]

67 1 We are abusing the terminology slightly: normally the negentropy of a random variable is defined to be the Kullback-Leibler distance between itself and the Gaussian variable with the same mean and covariance. [sent-115, score-0.226]

68 (B) Nonlinearly transform the data by marginal Gaussianization: X(k+1) = w'Tr,J. [sent-116, score-0.218]

69 ,u (y(k)) where the marginal Gaussianization w1T, It, 0. [sent-118, score-0.066]

70 ), can be derived from univariate Gaussian mixtures (2): The parameters are chosen by minimizing the negentropy of the transformed variable X(k+1): (3) (. [sent-120, score-0.665]

71 A,1r,f-t,u Thus, after each iteration, the transformed variable becomes as close as possible to the standard Gaussian in the Kullback-Leibler distance. [sent-125, score-0.19]

72 First, the problem of minizing the negentropy (3) is equivalent to the maximum likelihood problem for Gaussianization with linear ICA assumption in section 3, and thus can be solved by the same efficient EM algorithm. [sent-126, score-0.174]

73 Second, since the data X(k) might not satisfy the linear ICA assumption, the optimal linear transform might not transform X(k) into independent coordinates. [sent-127, score-0.392]

74 However, it does transform X (k) into the least dependent coordinates, since J(X(k+1)) = JM(w(AX(k))) + I(w(AX(k))) = I(AX(k)) . [sent-128, score-0.222]

75 Therefore our iterative algorithm can be viewed as follows. [sent-130, score-0.125]

76 At each iteration, the data is linearly transformed to the least dependent coordinates and then each coordinate is marginally Gaussianized. [sent-131, score-0.587]

77 In practice, after the first iteration, the algorithm finds linear transforms which are almost orthogonal. [sent-132, score-0.134]

78 Therefore one can also view practically that at each iteration, the data is linearly transformed to the most marginally non-Gaussian coordinates and then each coordinate is marginally Gaussianized. [sent-133, score-0.633]

79 For the sake of simplicity, we assume that we can achieve perfect marginal Gaussianization w(·) by w1T ,It,. [sent-134, score-0.085]

80 Thus it suffices to analyze the ideal iterative Gaussianization X(k) where A = W(AX(k)) = argmin J(W(AX(k))) = argmin I(AX(k)) . [sent-140, score-0.157]

81 the density function of X(k) converges pointwise to the density function of standard normal. [sent-143, score-0.273]

82 S Figure 1: Iterative Gaussianization on a synthetic circular data set We point out that out iterative algorithm can be relaxed as follows. [sent-202, score-0.199]

83 At each iteration, the data can linearly transformed into coordinates which are less dependent, instead of into coordinates which are the least dependent: I(X(k) - I(AkX(k)) ~ E[I(X(k) - i~t I(AX(k))] where the constant E > O. [sent-203, score-0.464]

84 We can show that this relaxed algorithm still converges weakly. [sent-204, score-0.081]

85 5 Examples We demonstrate the process of our iterative Gaussianization algorithm through a very difficult two dimensional synthetic data set. [sent-205, score-0.22]

86 The true underlying variable is circularly distributed: in the polar coordinate system, the angle is uniformly distributed; the radius follows a mixture of four non-overlapping Gaussians. [sent-206, score-0.147]

87 Figure 4 displays the transformed data set at each iteration. [sent-212, score-0.145]

88 Clearly we see the transformed data gradually becomes standard Gaussian. [sent-213, score-0.162]

89 Let X(O) = X ; assume that the iterative Gaussianization procedure converges after K iterations, i. [sent-214, score-0.135]

90 Since the transforms at each iteration are invertible, we can then compute Jacobian and obtain density estimation for X. [sent-217, score-0.349]

91 Figure 4 compares the Gaussianization density estimate (8 iterations) and Gaussian mixture density estimate (40 Gaussians). [sent-219, score-0.26]

92 Clearly we see that the Gaussianization density estimate recovers the four circular structure; however, the Gaussian mixture estimate lacks resolution. [sent-220, score-0.172]

93 6 Discussion Gaussianization is closely connected with the exploratory projection pursuit algorithm proposed by Friedman (1987) [2]. [sent-221, score-0.216]

94 In fact we argue that our iterative Gaussianization procedure can easily constrained as an efficient parametric solution of high dimensional projection pursuit. [sent-222, score-0.291]

95 If we constrain that at each iteration the linear transform has to be orthogonal, and only the first [ coordinates of the transformed variable are marginally Gaussianized, then the iterative Gaussianization algorithm achieves [ dimensional projection pursuit. [sent-224, score-1.021]

96 The bottleneck of Friedman's high dimensional projection pursuit is to find the jointly most non-Gaussian projection and to jointly Gaussianize that projection. [sent-225, score-0.42]

97 In contrast, our algorithm finds the most marginally non-Gaussian projection and marginally Gaussianize that projection; it can be computed by an efficient EM algorithm. [sent-226, score-0.395]

98 We argue that Gaussianization density estimation indeed alleviates the problem of the curse of dimensionality. [sent-227, score-0.283]

99 We are currently experimenting with application of Gaussianization density estimation in automatic speech and speaker recognition. [sent-230, score-0.159]

100 Lippman, "Nonparametric multivariate density estimation: a comparative study", IEEE Transaction Signal Processing, vol. [sent-252, score-0.129]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gaussianization', 0.786), ('univariate', 0.312), ('iteration', 0.14), ('negentropy', 0.136), ('transform', 0.135), ('marginally', 0.133), ('coordinates', 0.132), ('transformed', 0.128), ('ax', 0.114), ('density', 0.11), ('iterative', 0.099), ('curse', 0.098), ('pursuit', 0.087), ('projection', 0.083), ('attias', 0.082), ('gaussianize', 0.076), ('gaussianized', 0.076), ('marginal', 0.066), ('coordinate', 0.062), ('dimensional', 0.062), ('invertible', 0.06), ('dependent', 0.06), ('transforms', 0.05), ('estimation', 0.049), ('gales', 0.047), ('jm', 0.047), ('yd', 0.046), ('huber', 0.045), ('variable', 0.045), ('gaussian', 0.042), ('mixture', 0.04), ('linear', 0.038), ('jointly', 0.037), ('converges', 0.036), ('em', 0.036), ('dimensionality', 0.034), ('high', 0.031), ('fixes', 0.03), ('hwang', 0.03), ('optintization', 0.03), ('ica', 0.029), ('independent', 0.029), ('auxiliary', 0.028), ('linearly', 0.028), ('least', 0.027), ('estimates', 0.027), ('friedman', 0.026), ('orthogonal', 0.026), ('alleviates', 0.026), ('algorithm', 0.026), ('mixtures', 0.025), ('radial', 0.025), ('gaussians', 0.025), ('propose', 0.025), ('independence', 0.025), ('iterations', 0.024), ('existence', 0.024), ('pp', 0.024), ('parametrization', 0.024), ('component', 0.023), ('convergence', 0.023), ('updating', 0.023), ('descent', 0.022), ('differential', 0.022), ('shall', 0.022), ('circular', 0.022), ('infinity', 0.022), ('transaction', 0.022), ('samples', 0.022), ('fix', 0.021), ('densities', 0.021), ('exploratory', 0.02), ('argmin', 0.02), ('finds', 0.02), ('notice', 0.02), ('techniques', 0.02), ('dimensions', 0.02), ('comparative', 0.019), ('relaxed', 0.019), ('jacobian', 0.019), ('amounts', 0.019), ('derived', 0.019), ('achieve', 0.019), ('counts', 0.018), ('ideal', 0.018), ('xt', 0.018), ('cumulative', 0.018), ('weakly', 0.018), ('standard', 0.017), ('data', 0.017), ('annals', 0.017), ('proved', 0.016), ('synthetic', 0.016), ('easily', 0.016), ('variances', 0.016), ('conditional', 0.016), ('become', 0.015), ('structures', 0.015), ('exploiting', 0.015), ('ny', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

2 0.10950267 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

Author: Cynthia Archer, Todd K. Leen

Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1

3 0.084940359 51 nips-2000-Factored Semi-Tied Covariance Matrices

Author: Mark J. F. Gales

Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.

4 0.070652045 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

5 0.049146526 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky

Abstract: We present the embedded trees algorithm, an iterative technique for estimation of Gaussian processes defined on arbitrary graphs. By exactly solving a series of modified problems on embedded spanning trees, it computes the conditional means with an efficiency comparable to or better than other techniques. Unlike other methods, the embedded trees algorithm also computes exact error covariances. The error covariance computation is most efficient for graphs in which removing a small number of edges reveals an embedded tree. In this context, we demonstrate that sparse loopy graphs can provide a significant increase in modeling power relative to trees, with only a minor increase in estimation complexity. 1

6 0.048603747 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

7 0.045549367 36 nips-2000-Constrained Independent Component Analysis

8 0.044470362 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

9 0.043388776 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

10 0.042453606 27 nips-2000-Automatic Choice of Dimensionality for PCA

11 0.041257352 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

12 0.038579881 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

13 0.038321223 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

14 0.037740782 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

15 0.036962934 121 nips-2000-Sparse Kernel Principal Component Analysis

16 0.036828239 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

17 0.036671251 74 nips-2000-Kernel Expansions with Unlabeled Examples

18 0.036440425 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

19 0.035823878 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

20 0.035641637 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, -0.001), (2, 0.072), (3, 0.052), (4, 0.009), (5, 0.021), (6, -0.07), (7, -0.015), (8, -0.028), (9, -0.027), (10, -0.151), (11, 0.007), (12, 0.05), (13, -0.007), (14, -0.09), (15, 0.034), (16, -0.074), (17, 0.03), (18, -0.048), (19, 0.136), (20, 0.007), (21, 0.102), (22, -0.04), (23, 0.118), (24, -0.159), (25, -0.101), (26, -0.078), (27, 0.142), (28, 0.163), (29, 0.027), (30, -0.108), (31, -0.12), (32, -0.078), (33, 0.119), (34, 0.016), (35, 0.003), (36, -0.181), (37, 0.003), (38, 0.044), (39, -0.011), (40, 0.143), (41, -0.085), (42, -0.122), (43, -0.148), (44, 0.109), (45, -0.064), (46, 0.053), (47, -0.072), (48, -0.07), (49, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95772594 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

2 0.72131187 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

Author: Cynthia Archer, Todd K. Leen

3 0.53793871 51 nips-2000-Factored Semi-Tied Covariance Matrices

Author: Mark J. F. Gales

4 0.39341974 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

5 0.29746389 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

Author: Sepp Hochreiter, Michael Mozer

Abstract: The goal of many unsupervised learning procedures is to bring two probability distributions into alignment. Generative models such as Gaussian mixtures and Boltzmann machines can be cast in this light, as can recoding models such as ICA and projection pursuit. We propose a novel sample-based error measure for these classes of models, which applies even in situations where maximum likelihood (ML) and probability density estimation-based formulations cannot be applied, e.g., models that are nonlinear or have intractable posteriors. Furthermore, our sample-based error measure avoids the difficulties of approximating a density function. We prove that with an unconstrained model, (1) our approach converges on the correct solution as the number of samples goes to infinity, and (2) the expected solution of our approach in the generative framework is the ML solution. Finally, we evaluate our approach via simulations of linear and nonlinear models on mixture of Gaussians and ICA problems. The experiments show the broad applicability and generality of our approach. 1

6 0.28434238 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

7 0.27876812 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

8 0.27178872 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

9 0.2679776 36 nips-2000-Constrained Independent Component Analysis

10 0.2627711 85 nips-2000-Mixtures of Gaussian Processes

11 0.26067704 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

12 0.23954481 148 nips-2000-`N-Body' Problems in Statistical Learning

13 0.23649341 22 nips-2000-Algorithms for Non-negative Matrix Factorization

14 0.2088926 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

15 0.202281 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators

16 0.19927914 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.18498237 118 nips-2000-Smart Vision Chip Fabricated Using Three Dimensional Integration Technology

18 0.18382931 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

19 0.18131827 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

20 0.17850652 111 nips-2000-Regularized Winnow Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.293), (10, 0.017), (17, 0.131), (32, 0.051), (33, 0.054), (54, 0.021), (55, 0.027), (62, 0.049), (65, 0.018), (67, 0.063), (76, 0.047), (79, 0.02), (81, 0.015), (90, 0.037), (97, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82829547 8 nips-2000-A New Model of Spatial Representation in Multimodal Brain Areas

Author: Sophie Denève, Jean-René Duhamel, Alexandre Pouget

Abstract: Most models of spatial representations in the cortex assume cells with limited receptive fields that are defined in a particular egocentric frame of reference. However, cells outside of primary sensory cortex are either gain modulated by postural input or partially shifting. We show that solving classical spatial tasks, like sensory prediction, multi-sensory integration, sensory-motor transformation and motor control requires more complicated intermediate representations that are not invariant in one frame of reference. We present an iterative basis function map that performs these spatial tasks optimally with gain modulated and partially shifting units, and tests it against neurophysiological and neuropsychological data. In order to perform an action directed toward an object, it is necessary to have a representation of its spatial location. The brain must be able to use spatial cues coming from different modalities (e.g. vision, audition, touch, proprioception), combine them to infer the position of the object, and compute the appropriate movement. These cues are in different frames of reference corresponding to different sensory or motor modalities. Visual inputs are primarily encoded in retinotopic maps, auditory inputs are encoded in head centered maps and tactile cues are encoded in skin-centered maps. Going from one frame of reference to the other might seem easy. For example, the head-centered position of an object can be approximated by the sum of its retinotopic position and the eye position. However, positions are represented by population codes in the brain, and computing a head-centered map from a retinotopic map is a more complex computation than the underlying sum. Moreover, as we get closer to sensory-motor areas it seems reasonable to assume Spksls 150 100 50 o Figure 1: Response of a VIP cell to visual stimuli appearing in different part of the screen, for three different eye positions. The level of grey represent the frequency of discharge (In spikes per seconds). The white cross is the fixation point (the head is fixed). The cell's receptive field is moving with the eyes, but only partially. Here the receptive field shift is 60% of the total gaze shift. Moreover this cell is gain modulated by eye position (adapted from Duhamel et al). that the representations should be useful for sensory-motor transformations, rather than encode an

same-paper 2 0.78712183 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

3 0.70916384 52 nips-2000-Fast Training of Support Vector Classifiers

Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez

Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez

4 0.50459701 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

5 0.49952891 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

6 0.49751359 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

7 0.49409413 37 nips-2000-Convergence of Large Margin Separable Linear Classification

8 0.49212843 79 nips-2000-Learning Segmentation by Random Walks

9 0.49185109 94 nips-2000-On Reversing Jensen's Inequality

10 0.49039483 133 nips-2000-The Kernel Gibbs Sampler

11 0.48721606 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

12 0.48611614 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

13 0.48470098 4 nips-2000-A Linear Programming Approach to Novelty Detection

14 0.48441294 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

15 0.48405334 111 nips-2000-Regularized Winnow Methods

16 0.48392919 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

17 0.48133412 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

18 0.48031491 21 nips-2000-Algorithmic Stability and Generalization Performance

19 0.479509 130 nips-2000-Text Classification using String Kernels

20 0.47934714 36 nips-2000-Constrained Independent Component Analysis