nips nips2002 nips2002-111 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Trevor Hastie, Rob Tibshirani
Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. [sent-3, score-0.253]
2 Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. [sent-4, score-0.376]
3 Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. [sent-5, score-0.11]
4 1 Introduction Independent component analysis (ICA) is a popular enhancement over principal component analysis (PCA) and factor analysis. [sent-7, score-0.08]
5 In its simplest form , we observe a random vector X E IRP which is assumed to arise from a linear mixing of a latent random source vector S E IRP, (1) X=AS; the components Sj, j = 1, . [sent-8, score-0.119]
6 , and microphones around the room record a mix of the sounds. [sent-14, score-0.028]
7 Hence the second order moments Cov(X) = AAT = A * A *T do not contain enough information to distinguish these two situations. [sent-19, score-0.092]
8 The factor analysis model typically has fewer than p components, and includes an error component for each variable. [sent-21, score-0.04]
9 Two facts are clear: • Since a multivariate Gaussian distribution is completely determined by its first and second moments, this model would not be able to distinguish A and A * . [sent-23, score-0.087]
10 If we hope to recover the original A, at least p - 1 of the components of S will have to be non-Gaussian. [sent-26, score-0.045]
11 Then the joint density of S is p (2) Is(s) = II Ii(Sj), j= l and since A is orthogonal, the joint density of X is p (3) Ix(x) = II Ii(aJ x), j=l where aj is the jth column of A . [sent-33, score-0.623]
12 Equation (3) follows from S = AT X due to the orthogonality of A , and the fact that the determinant in this multivariate transformation is 1. [sent-34, score-0.08]
13 In this paper we fit the model (3) directly using semi-parametric maximum likelihood. [sent-35, score-0.069]
14 We represent each of the densities Ii by an exponentially tilted Gaussian density (Efron & Tibshirani 1996). [sent-36, score-0.311]
15 (4) where ¢ is the standard univariate Gaussian density, and gj is a smooth function, restricted so that Ii integrates to 1. [sent-37, score-0.241]
16 We represent each of the functions gj by a cubic smoothing spline, a rich class of smooth functions whose roughness is controlled by a penalty functional. [sent-38, score-0.503]
17 These choices lead to an attractive and effective semi-parametric implementation of ICA: • Given A, each of the components Ii in (3) can be estimated separately by maximum likelihood. [sent-39, score-0.045]
18 • The components gj represent departures from Gaussianity, and the expected log-likelihood ratio between model (3) and the gaussian density is given by Ex 2: j gj(aJ X), a flexible contrast function. [sent-41, score-0.655]
19 • Since the first and second derivatives of each of the estimated gj are immediately available, second order methods are available for estimating the orthogonal matrix A . [sent-42, score-0.318]
20 We use the fixed point algorithms described in (Hyvarinen & Oja 1999). [sent-43, score-0.117]
21 • Our representation of the gj as smoothing splines casts the estimation problem as density estimation in a reproducing kernel Hilbert space, an infinite family of smooth functions. [sent-44, score-0.77]
22 This makes it directly comparable with the "Kernel ICA" approach of Bach & Jordan (2001), with the advantage that we have O(N) algorithms available for the computation of our contrast function, and its first two derivatives. [sent-45, score-0.035]
23 ,XN we fit the model (3),(4) by maximum penalized likelihood. [sent-50, score-0.259]
24 We then maximize the criterion (5) subject to T (6) J a j ak bjk ¢(s)e 9j (slds (7) 't/j, k 1 't/j For fixed aj and hence Sij = aT Xi the solutions for 9j are known to be cubic splines with knots at each of the unique values of Sij (Silverman 1986). [sent-52, score-0.625]
25 The p terms decouple for fixed aj, leaving us p separate penalized density estimation problems. [sent-53, score-0.56]
26 We fit the functions 9j and directions aj by optimizing (5) in an alternating fashion , as described in Algorithm 1. [sent-54, score-0.291]
27 In step (a), we find the optimal 9j for fixed 9j; in Algorithm 1 Product Density leA algorithm 1. [sent-55, score-0.148]
28 Alternate until convergence of A, using the Amari metric (16). [sent-58, score-0.037]
29 9j (separately for each j), using the penalized density estimation algorithm 2. [sent-62, score-0.474]
30 ,p, perform one step of the fixed point algorithm 3 towards finding the optimal A. [sent-66, score-0.148]
31 In this sense Algorithm 1 can be seen to be maximizing the profile penalized log-likelihood w. [sent-68, score-0.222]
32 1 Penalized density estimation We focus on a single coordinate, with N observations Si, Si = Xi for some k). [sent-73, score-0.253]
33 Silverman (1982) shows that one can incorporate the integration constraint by using the modified criterion (without a Lagrange multiplier) N (9) ~ l:= {lOg¢(Si) + 9(Si )} >=1 J ¢(s)e 9 (slds - A J 91/ 2 (S)ds. [sent-78, score-0.044]
34 We construct a fine grid of L values s; in increments ~ covering the observed values Si, and let * (10) Y£ = #Si E (sf - ~/2, Sf N + ~/2) Typically we pick L to be 1000, which is more than adequate. [sent-80, score-0.028]
35 £=1 This last expression can be seen to be proportional to a penalized Poisson loglikelihood with response Y;! [sent-82, score-0.222]
36 ~ and penalty parameter A/~, and mean J-t(s) = ¢(s)e 9(s). [sent-83, score-0.041]
37 This is a generalized additive model (Hastie & Tibshirani 1990), with an offset term log(¢(s)), and can be fit using a Newton algorithm in O(L) operations. [sent-84, score-0.1]
38 As with other GAMs, the Newton algorithm is conveniently re-expressed as an iteratively reweighted penalized least squares regression problem, which we give in Algorithm 2. [sent-85, score-0.315]
39 Algorithm 2 Iteratively reweighted penalized least squares algorithm for fitting the tilted Gaussian spline density model. [sent-86, score-0.811]
40 + Ye - J-t(sf) J-t( sf) (c) Update g by solving the weighted penalized least squares problem (13) This amounts to fitting a weighted smoothing spline to the pairs (sf, ze) with weights w£ and tuning parameter 2A/~. [sent-94, score-0.496]
41 Despite the large number of knots and hence basis functions , the local support of the B-spline basis functions allows the solution to (13) to be obtained in O(L) computations. [sent-96, score-0.161]
42 • The first and second derivatives of 9 are immediately available, and are used in the second-order search for the direction aj in Algorithm 1. [sent-97, score-0.233]
43 • As an alternative to choosing a value for A, we can control the amount of smoothing through the effective number of parameters, given by the trace of the linear operator matrix implicit in (13) (Hastie & Tibshirani 1990). [sent-98, score-0.083]
44 • It can also be shown that because of the form of (9), the resulting density inherits the mean and variance of the data (0 and 1); details will be given in a longer version of this paper. [sent-99, score-0.198]
45 2 A fixed point method for finding the orthogonal frame For fixed functions g1> the penalty term in (5) does not playa role in the search for A. [sent-101, score-0.462]
46 Since all of the columns aj of any A under consideration are mutually orthogonal and unit norm, the Gaussian component p L log ¢(aJ Xi) j=l does not depend on A. [sent-102, score-0.341]
47 C(A) has the form of a sum of contrast functions for detecting departures from Gaussianity. [sent-104, score-0.138]
48 Hyvarinen, Karhunen & Oja (2001) refer to the expected log-likelihood ratio as the negentropy, and use simple contrast functions to approximate it in their FastICA algorithm. [sent-105, score-0.102]
49 Our regularized approach can be seen as a way to construct a flexible contrast function adaptively using a large set of basis functions . [sent-106, score-0.142]
50 Since we have first and second derivatives avaiable for each gj , we can mimic exactly the fast fixed point algorithm developed in (Hyvarinen et al. [sent-118, score-0.406]
51 Figure 1 shows the optimization criterion C (14) above, as well as the two criteria used to approximate negentropy in FastICA by Hyvarinen et al. [sent-120, score-0.234]
52 G 1 and G2 refer to the two functions used to define negentropy in FastICA. [sent-138, score-0.229]
53 In the left example the independent components are uniformly distributed, in the right a mixture of Gaussians. [sent-139, score-0.045]
54 In the left plot, all the procedures found the correct frame; in the right plot, only the spline based approach was successful. [sent-140, score-0.15]
55 3 Comparisons with fast ICA In this section we evaluate the performance of the product density approach (ProDenICA) , by mimicking some of the simulations performed by Bach & Jordan (2001) to demonstrate their Kernel ICA approach. [sent-142, score-0.238]
56 Here we compare ProDenICA only with FastICA; a future expanded version of this paper will include comparisons with other leA procedures as well. [sent-143, score-0.07]
57 The left panel in Figure 2 shows the 18 distributions used as a basis of comparison. [sent-144, score-0.078]
58 For each distribution, we generated a pair of independent components (N=1024) , and a random mixing matrix in ill? [sent-146, score-0.119]
59 We used our Splus implementation of the FastICA algorithm, using the negentropy criterion based on the nonlinearity G 1 (s) = log cosh(s) , and the symmetric orthogonalization scheme as in Algorithm 3 (Hyvarinen et al . [sent-148, score-0.328]
60 Each of the algorithms delivers an orthogonal mixing matrix A (the data were pre-whitenea) , which is available for comparison with the generating orthogonalized mixing matrix A o. [sent-154, score-0.301]
61 - (Lf=1I rijl -1) , 2p ~ 1 " "J max·lr··1 j= where rij = (AoA - 1 )ij . [sent-157, score-0.028]
62 The right panel in Figure 2 shows boxplots of the pairwise differences d(Ao, A F ) -d(Ao , Ap ) (x100), where the subscripts denote ProDenICA or FastICA. [sent-158, score-0.134]
63 4) for ProDenICA (Bach & Jordan (2001) report averages of 6. [sent-164, score-0.028]
64 ,---_ _ _ _ _ _ _ _ _ _ _ _ _----, JL ~ d ~ 9 ~ ~- JJL f h flL ~ ~ ~~~ j k m " I , ~~~ p ro ci q ~~~ N ci o ci abcdefghijklmnopqr distribution Figure 2: The left panel shows eighteen distributions used for comparisons. [sent-171, score-0.219]
65 These include the "t", uniform, exponential, mixtures of exponentials, symmetric and asymmetric gaussian mixtures. [sent-172, score-0.065]
66 The right panel shows boxplots of the improvement of ProDenICA over FastICA in each case, using the Amari metric, based on 30 simulations in lR? [sent-173, score-0.174]
67 6) for ProDenICA (Bach & Jordan (2001) report averages of 19 for FastICA , and 13 and 9 for their two K ernelICA methods). [sent-180, score-0.028]
68 4 Discussion The lCA model stipulates that after a suitable orthogonal transformation, the data are independently distributed. [sent-181, score-0.088]
69 Our model delivers estimates of both the mixing matrix A, and estimates of the densities of the independent components. [sent-183, score-0.139]
70 Many approaches to lCA, including FastICA, are based on minimizing approximations to entropy. [sent-184, score-0.04]
71 The argument, given in detail in Hyvarinen et al. [sent-185, score-0.028]
72 (2001) and reproduced in Hastie, Tibshirani & Friedman (2001), starts with minimizing the mutual information - the KL divergence between the full density and its independence version. [sent-186, score-0.301]
73 FastICA uses very simple approximations based on single (or a small number of) non-linear contrast functions , which work well for a variety of situations, but not at all well for the more complex gaussian mixtures. [sent-187, score-0.178]
74 The log-likelihood for the spline-based product-density model can be seen as a direct estimate of the mutual information; it uses the empirical distribution of the observed data to represent their joint density, and the product-density model to represent the independence density. [sent-188, score-0.135]
75 This approach works well in both the simple and complex situations automatically, at a very modest increase in computational effort. [sent-189, score-0.036]
76 As a side benefit, the form of our tilted Gaussian density estimate allows our log-likelihood criterion to be interpreted as an estimate of negentropy, a measure of departure from the Gaussian. [sent-190, score-0.355]
77 Bach & Jordan (2001) combine a nonparametric density approach (via reproducing kernel Hilbert function spaces) with a complex measure of independence based on the maximal correlation. [sent-191, score-0.336]
78 They motivate their independence measures as approximations to the mutual independence. [sent-193, score-0.143]
79 Since the smoothing splines are exactly function estimates in a RKHS, our method shares this flexibility with their Kernel approach (and is in fact a "Kernel" method). [sent-194, score-0.16]
80 Our objective function, however, is a much simpler estimate of the mutual information. [sent-195, score-0.054]
81 In the simulations we have performed so far , it seems we achieve comparable accuracy. [sent-196, score-0.04]
82 (2001), Kernel independent component analysis, Technical Report UCBjCSD-01-1166, Computer Science Division, University of California, Berkeley. [sent-199, score-0.04]
83 (1996), 'Using specially designed exponential families for density estimation' , Annals of Statistics 24(6), 2431-246l. [sent-202, score-0.198]
84 (1999), 'Independent component analysis: Algorithms and applications' , Neural Networks . [sent-216, score-0.04]
85 (1982), 'On the estimation of a probability density function by the maximum penalized likelihood method', Annals of Statistics 10(3),795-810. [sent-222, score-0.443]
wordName wordTfidf (topN-words)
[('fastica', 0.421), ('prodenica', 0.292), ('hyvarinen', 0.206), ('density', 0.198), ('penalized', 0.19), ('aj', 0.184), ('bach', 0.181), ('gj', 0.181), ('tibshirani', 0.165), ('negentropy', 0.162), ('sf', 0.158), ('ica', 0.147), ('silverman', 0.13), ('fixed', 0.117), ('tilted', 0.113), ('hastie', 0.112), ('spline', 0.108), ('oja', 0.103), ('lea', 0.097), ('slds', 0.097), ('amari', 0.095), ('si', 0.091), ('cubic', 0.09), ('orthogonal', 0.088), ('jordan', 0.088), ('knots', 0.085), ('smoothing', 0.083), ('cov', 0.082), ('panel', 0.078), ('splines', 0.077), ('fitting', 0.077), ('mixing', 0.074), ('fit', 0.069), ('lr', 0.068), ('ii', 0.065), ('bibby', 0.065), ('delivers', 0.065), ('departures', 0.065), ('irp', 0.065), ('orthogonalization', 0.065), ('sll', 0.065), ('gaussian', 0.065), ('frame', 0.061), ('moments', 0.057), ('boxplots', 0.056), ('efron', 0.056), ('kent', 0.056), ('mardia', 0.056), ('reweighted', 0.056), ('sij', 0.056), ('lca', 0.056), ('estimation', 0.055), ('mutual', 0.054), ('multivariate', 0.052), ('sj', 0.052), ('kernel', 0.051), ('independence', 0.049), ('derivatives', 0.049), ('ao', 0.048), ('karhunen', 0.048), ('ci', 0.047), ('components', 0.045), ('criterion', 0.044), ('jth', 0.043), ('procedures', 0.042), ('penalty', 0.041), ('simulations', 0.04), ('approximations', 0.04), ('component', 0.04), ('reproducing', 0.038), ('functions', 0.038), ('squares', 0.038), ('metric', 0.037), ('flexible', 0.037), ('newton', 0.037), ('page', 0.037), ('situations', 0.036), ('distinguish', 0.035), ('contrast', 0.035), ('smooth', 0.032), ('seen', 0.032), ('friedman', 0.032), ('algorithm', 0.031), ('hilbert', 0.03), ('stanford', 0.03), ('ratio', 0.029), ('initialize', 0.029), ('chapman', 0.029), ('define', 0.029), ('xi', 0.029), ('log', 0.029), ('et', 0.028), ('averages', 0.028), ('comparisons', 0.028), ('bjk', 0.028), ('microphones', 0.028), ('univariate', 0.028), ('fine', 0.028), ('orthogonality', 0.028), ('rij', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 111 nips-2002-Independent Components Analysis through Product Density Estimation
Author: Trevor Hastie, Rob Tibshirani
Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1
2 0.10883138 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
3 0.09512049 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
Author: Gil-jin Jang, Te-Won Lee
Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis filters in time domain that encode the sources in a statistically efficient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis filters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis filters and the associated coefficient densities. A flexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.
4 0.094894454 124 nips-2002-Learning Graphical Models with Mercer Kernels
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.
5 0.085219875 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
6 0.081535228 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
7 0.081490017 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
8 0.07094302 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
9 0.067209259 138 nips-2002-Manifold Parzen Windows
10 0.067021742 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
11 0.066440158 115 nips-2002-Informed Projections
12 0.064348511 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
13 0.061501663 106 nips-2002-Hyperkernels
14 0.06142639 10 nips-2002-A Model for Learning Variance Components of Natural Images
15 0.060822781 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
16 0.05843202 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA
17 0.056065891 181 nips-2002-Self Supervised Boosting
18 0.054849673 156 nips-2002-On the Complexity of Learning the Kernel Matrix
19 0.054099508 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
20 0.053524744 119 nips-2002-Kernel Dependency Estimation
topicId topicWeight
[(0, -0.176), (1, -0.051), (2, -0.027), (3, 0.014), (4, -0.048), (5, -0.028), (6, -0.055), (7, 0.057), (8, 0.019), (9, 0.001), (10, 0.077), (11, -0.05), (12, 0.023), (13, 0.026), (14, 0.074), (15, 0.088), (16, -0.049), (17, -0.053), (18, -0.018), (19, 0.007), (20, 0.004), (21, 0.022), (22, -0.019), (23, -0.037), (24, -0.003), (25, -0.053), (26, -0.083), (27, 0.042), (28, 0.113), (29, 0.019), (30, -0.048), (31, -0.087), (32, -0.093), (33, 0.029), (34, -0.071), (35, 0.095), (36, -0.034), (37, 0.096), (38, 0.054), (39, -0.043), (40, 0.154), (41, 0.099), (42, 0.247), (43, 0.024), (44, 0.11), (45, 0.194), (46, 0.094), (47, -0.218), (48, -0.165), (49, 0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.94039297 111 nips-2002-Independent Components Analysis through Product Density Estimation
Author: Trevor Hastie, Rob Tibshirani
Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1
2 0.5455088 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
Author: J. L. Shapiro
Abstract: Population-based Incremental Learning is shown require very sensitive scaling of its learning rate. The learning rate must scale with the system size in a problem-dependent way. This is shown in two problems: the needle-in-a haystack, in which the learning rate must vanish exponentially in the system size, and in a smooth function in which the learning rate must vanish like the square root of the system size. Two methods are proposed for removing this sensitivity. A learning dynamics which obeys detailed balance is shown to give consistent performance over the entire range of learning rates. An analog of mutation is shown to require a learning rate which scales as the inverse system size, but is problem independent. 1
3 0.48724756 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA
Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski
Abstract: Missing data is common in real-world datasets and is a problem for many estimation techniques. We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overfitting problems. This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data.
4 0.44580537 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.
5 0.44477177 124 nips-2002-Learning Graphical Models with Mercer Kernels
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.
6 0.42029357 178 nips-2002-Robust Novelty Detection with Single-Class MPM
7 0.41112965 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
8 0.40417829 138 nips-2002-Manifold Parzen Windows
9 0.39325204 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
10 0.38461214 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
11 0.37402859 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
12 0.3611269 115 nips-2002-Informed Projections
13 0.35431635 18 nips-2002-Adaptation and Unsupervised Learning
14 0.33604905 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
15 0.32645586 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
16 0.32559749 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
17 0.31725627 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
18 0.3148044 96 nips-2002-Generalized² Linear² Models
19 0.30521786 119 nips-2002-Kernel Dependency Estimation
20 0.29248065 181 nips-2002-Self Supervised Boosting
topicId topicWeight
[(11, 0.011), (42, 0.11), (54, 0.14), (55, 0.037), (65, 0.32), (67, 0.016), (68, 0.022), (74, 0.067), (92, 0.066), (98, 0.104)]
simIndex simValue paperId paperTitle
1 0.81845856 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
2 0.796166 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise
Author: Alon Fishbach, Bradford J. May
Abstract: Psychophysical data suggest that temporal modulations of stimulus amplitude envelopes play a prominent role in the perceptual segregation of concurrent sounds. In particular, the detection of an unmodulated signal can be significantly improved by adding amplitude modulation to the spectral envelope of a competing masking noise. This perceptual phenomenon is known as “Comodulation Masking Release” (CMR). Despite the obvious influence of temporal structure on the perception of complex auditory scenes, the physiological mechanisms that contribute to CMR and auditory streaming are not well known. A recent physiological study by Nelken and colleagues has demonstrated an enhanced cortical representation of auditory signals in modulated noise. Our study evaluates these CMR-like response patterns from the perspective of a hypothetical auditory edge-detection neuron. It is shown that this simple neural model for the detection of amplitude transients can reproduce not only the physiological data of Nelken et al., but also, in light of previous results, a variety of physiological and psychoacoustical phenomena that are related to the perceptual segregation of concurrent sounds. 1 In t rod u ct i on The temporal structure of a complex sound exerts strong influences on auditory physiology (e.g. [10, 16]) and perception (e.g. [9, 19, 20]). In particular, studies of auditory scene analysis have demonstrated the importance of the temporal structure of amplitude envelopes in the perceptual segregation of concurrent sounds [2, 7]. Common amplitude transitions across frequency serve as salient cues for grouping sound energy into unified perceptual objects. Conversely, asynchronous amplitude transitions enhance the separation of competing acoustic events [3, 4]. These general principles are manifested in perceptual phenomena as diverse as comodulation masking release (CMR) [13], modulation detection interference [22] and synchronous onset grouping [8]. Despite the obvious importance of timing information in psychoacoustic studies of auditory masking, the way in which the CNS represents the temporal structure of an amplitude envelope is not well understood. Certainly many physiological studies have demonstrated neural sensitivities to envelope transitions, but this sensitivity is only beginning to be related to the variety of perceptual experiences that are evoked by signals in noise. Nelken et al. [15] have suggested a correspondence between neural responses to time-varying amplitude envelopes and psychoacoustic masking phenomena. In their study of neurons in primary auditory cortex (A1), adding temporal modulation to background noise lowered the detection thresholds of unmodulated tones. This enhanced signal detection is similar to the perceptual phenomenon that is known as comodulation masking release [13]. Fishbach et al. [11] have recently proposed a neural model for the detection of “auditory edges” (i.e., amplitude transients) that can account for numerous physiological [14, 17, 18] and psychoacoustical [3, 21] phenomena. The encompassing utility of this edge-detection model suggests a common mechanism that may link the auditory processing and perception of auditory signals in a complex auditory scene. Here, it is shown that the auditory edge detection model can accurately reproduce the cortical CMR-like responses previously described by Nelken and colleagues. 2 Th e M od el The model is described in detail elsewhere [11]. In short, the basic operation of the model is the calculation of the first-order time derivative of the log-compressed envelope of the stimulus. A computational model [23] is used to convert the acoustic waveform to a physiologically plausible auditory nerve representation (Fig 1a). The simulated neural response has a medium spontaneous rate and a characteristic frequency that is set to the frequency of the target tone. To allow computation of the time derivative of the stimulus envelope, we hypothesize the existence of a temporal delay dimension, along which the stimulus is progressively delayed. The intermediate delay layer (Fig 1b) is constructed from an array of neurons with ascending membrane time constants (τ); each neuron is modeled by a conventional integrate-and-fire model (I&F;, [12]). Higher membrane time constant induces greater delay in the neuron’s response [1]. The output of the delay layer converges to a single output neuron (Fig. 1c) via a set of connection with various efficacies that reflect a receptive field of a gaussian derivative. This combination of excitatory and inhibitory connections carries out the time-derivative computation. Implementation details and parameters are given in [11]. The model has 2 adjustable and 6 fixed parameters, the former were used to fit the responses of the model to single unit responses to variety of stimuli [11]. The results reported here are not sensitive to these parameters. (a) AN model (b) delay-layer (c) edge-detector neuron τ=6 ms I&F; Neuron τ=4 ms τ=3 ms bandpass log d dt RMS Figure 1: Schematic diagram of the model and a block diagram of the basic operation of each model component (shaded area). The stimulus is converted to a neural representation (a) that approximates the average firing rate of a medium spontaneous-rate AN fiber [23]. The operation of this stage can be roughly described as the log-compressed rms output of a bandpass filter. The neural representation is fed to a series of neurons with ascending membrane time constant (b). The kernel functions that are used to simulate these neurons are plotted for a few neurons along with the time constants used. The output of the delay-layer neurons converge to a single I&F; neuron (c) using a set of connections with weights that reflect a shape of a gaussian derivative. Solid arrows represent excitatory connections and white arrows represent inhibitory connections. The absolute efficacy is represented by the width of the arrows. 3 Resu lt s Nelken et al. [15] report that amplitude modulation can substantially modify the noise-driven discharge rates of A1 neurons in Halothane-anesthetized cats. Many cortical neurons show only a transient onset response to unmodulated noise but fire in synchrony (“lock”) to the envelope of modulated noise. A significant reduction in envelope-locked discharge rates is observed if an unmodulated tone is added to modulated noise. As summarized in Fig. 2, this suppression of envelope locking can reveal the presence of an auditory signal at sound pressure levels that are not detectable in unmodulated noise. It has been suggested that this pattern of neural responding may represent a physiological equivalent of CMR. Reproduction of CMR-like cortical activity can be illustrated by a simplified case in which the analytical amplitude envelope of the stimulus is used as the input to the edge-detector model. In keeping with the actual physiological approach of Nelken et al., the noise envelope is shaped by a trapezoid modulator for these simulations. Each cycle of modulation, E N(t), is given by: t 0≤t < 3D E N (t ) = P P − D (t − 3 D ) 3 D ≤ t < 4 D 0 4 D ≤ t < 8D £ P D ¢ ¡ where P is the peak pressure level and D is set to 12.5 ms. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 150 300 Time (ms) Figure 2: Responses of an A1 unit to a combination of noise and tone at many tone levels, replotted from Nelken et al. [15]. (a) Unmodulated noise and (b) modulated noise. The noise envelope is illustrated by the thick line above each figure. Each row shows the response of the neuron to the noise plus the tone at the level specified on the ordinate. The dashed line in (b) indicates the detection threshold level for the tone. The detection threshold (as defined and calculated by Nelken et al.) in the unmodulated noise was not reached. Since the basic operation of the model is the calculation of the rectified timederivative of the log-compressed envelope of the stimulus, the expected noisedriven rate of the model can be approximated by: ( ) ¢ E (t ) P0 d A ln 1 + dt ¡ M N ( t ) = max 0, ¥ ¤ £ where A=20/ln(10) and P0 =2e-5 Pa. The expected firing rate in response to the noise plus an unmodulated signal (tone) can be similarly approximated by: ) ¨ E ( t ) + PS P0 ¦ ( d A ln 1 + dt § M N + S ( t ) = max 0, © where PS is the peak pressure level of the tone. Clearly, both MN (t) and MN+S (t) are identically zero outside the interval [0 D]. Within this interval it holds that: M N (t ) = AP D P0 + P D t 0≤t < D Clearly, M N + S < M N for the interval [0 D] of each modulation cycle. That is, the addition of a tone reduces the responses of the model to the rising part of the modulated envelope. Higher tone levels (Ps ) cause greater reduction in the model’s firing rate. (c) (b) Level derivative (dB SPL/ms) Level (dB SPL) (a) (d) Time (ms) Figure 3: An illustration of the basic operation of the model on various amplitude envelopes. The simplified operation of the model includes log compression of the amplitude envelope (a and c) and rectified time-derivative of the log-compressed envelope (b and d). (a) A 30 dB SPL tone is added to a modulated envelope (peak level of 70 dB SPL) 300 ms after the beginning of the stimulus (as indicated by the horizontal line). The addition of the tone causes a great reduction in the time derivative of the log-compressed envelope (b). When the envelope of the noise is unmodulated (c), the time-derivative of the log-compressed envelope (d) shows a tiny spike when the tone is added (marked by the arrow). Fig. 3 demonstrates the effect of a low-level tone on the time-derivative of the logcompressed envelope of a noise. When the envelope is modulated (Fig. 3a) the addition of the tone greatly reduces the derivative of the rising part of the modulation (Fig. 3b). In the absence of modulations (Fig. 3c), the tone presentation produces a negligible effect on the level derivative (Fig. 3d). Model simulations of neural responses to the stimuli used by Nelken et al. are plotted in Fig. 4. As illustrated schematically in Fig 3 (d), the presence of the tone does not cause any significant change in the responses of the model to the unmodulated noise (Fig. 4a). In the modulated noise, however, tones of relatively low levels reduce the responses of the model to the rising part of the envelope modulations. (b) Modulated noise 76 Spikes/sec Tone level (dB SPL) (a) Unmodulated noise 26 0 150 300 0 Time (ms) 150 300 Figure 4: Simulated responses of the model to a combination of a tone and Unmodulated noise (a) and modulated noise (b). All conventions are as in Fig. 2. 4 Di scu ssi on This report uses an auditory edge-detection model to simulate the actual physiological consequences of amplitude modulation on neural sensitivity in cortical area A1. The basic computational operation of the model is the calculation of the smoothed time-derivative of the log-compressed stimulus envelope. The ability of the model to reproduce cortical response patterns in detail across a variety of stimulus conditions suggests similar time-sensitive mechanisms may contribute to the physiological correlates of CMR. These findings augment our previous observations that the simple edge-detection model can successfully predict a wide range of physiological and perceptual phenomena [11]. Former applications of the model to perceptual phenomena have been mainly related to auditory scene analysis, or more specifically the ability of the auditory system to distinguish multiple sound sources. In these cases, a sharp amplitude transition at stimulus onset (“auditory edge”) was critical for sound segregation. Here, it is shown that the detection of acoustic signals also may be enhanced through the suppression of ongoing responses to the concurrent modulations of competing background sounds. Interestingly, these temporal fluctuations appear to be a common property of natural soundscapes [15]. The model provides testable predictions regarding how signal detection may be influenced by the temporal shape of amplitude modulation. Carlyon et al. [6] measured CMR in human listeners using three types of noise modulation: squarewave, sine wave and multiplied noise. From the perspective of the edge-detection model, these psychoacoustic results are intriguing because the different modulator types represent manipulations of the time derivative of masker envelopes. Squarewave modulation had the most sharply edged time derivative and produced the greatest masking release. Fig. 5 plots the responses of the model to a pure-tone signal in square-wave and sine-wave modulated noise. As in the psychoacoustical data of Carlyon et al., the simulated detection threshold was lower in the context of square-wave modulation. Our modeling results suggest that the sharply edged square wave evoked higher levels of noise-driven activity and therefore created a sensitive background for the suppressing effects of the unmodulated tone. (b) 60 Spikes/sec Tone level (dB SPL) (a) 10 0 200 400 600 0 Time (ms) 200 400 600 Figure 5: Simulated responses of the model to a combination of a tone at various levels and a sine-wave modulated noise (a) or a square-wave modulated noise (b). Each row shows the response of the model to the noise plus the tone at the level specified on the abscissa. The shape of the noise modulator is illustrated above each figure. The 100 ms tone starts 250 ms after the noise onset. Note that the tone detection threshold (marked by the dashed line) is 10 dB lower for the square-wave modulator than for the sine-wave modulator, in accordance with the psychoacoustical data of Carlyon et al. [6]. Although the physiological basis of our model was derived from studies of neural responses in the cat auditory system, the key psychoacoustical observations of Carlyon et al. have been replicated in recent behavioral studies of cats (Budelis et al. [5]). These data support the generalization of human perceptual processing to other species and enhance the possible correspondence between the neuronal CMR-like effect and the psychoacoustical masking phenomena. Clearly, the auditory system relies on information other than the time derivative of the stimulus envelope for the detection of auditory signals in background noise. Further physiological and psychoacoustic assessments of CMR-like masking effects are needed not only to refine the predictive abilities of the edge-detection model but also to reveal the additional sources of acoustic information that influence signal detection in constantly changing natural environments. Ackn ow led g men t s This work was supported in part by a NIDCD grant R01 DC004841. Refe ren ces [1] Agmon-Snir H., Segev I. (1993). “Signal delay and input synchronization in passive dendritic structure”, J. Neurophysiol. 70, 2066-2085. [2] Bregman A.S. (1990). “Auditory scene analysis: The perceptual organization of sound”, MIT Press, Cambridge, MA. [3] Bregman A.S., Ahad P.A., Kim J., Melnerich L. (1994) “Resetting the pitch-analysis system. 1. Effects of rise times of tones in noise backgrounds or of harmonics in a complex tone”, Percept. Psychophys. 56 (2), 155-162. [4] Bregman A.S., Ahad P.A., Kim J. (1994) “Resetting the pitch-analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones”, J. Acoust. Soc. Am. 96 (5), 2694-2703. [5] Budelis J., Fishbach A., May B.J. (2002) “Behavioral assessments of comodulation masking release in cats”, Abst. Assoc. for Res. in Otolaryngol. 25. [6] Carlyon R.P., Buus S., Florentine M. (1989) “Comodulation masking release for three types of modulator as a function of modulation rate”, Hear. Res. 42, 37-46. [7] Darwin C.J. (1997) “Auditory grouping”, Trends in Cog. Sci. 1(9), 327-333. [8] Darwin C.J., Ciocca V. (1992) “Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component”, J. Acoust. Soc. Am. 91 , 33813390. [9] Drullman R., Festen H.M., Plomp R. (1994) “Effect of temporal envelope smearing on speech reception”, J. Acoust. Soc. Am. 95 (2), 1053-1064. [10] Eggermont J J. (1994). “Temporal modulation transfer functions for AM and FM stimuli in cat auditory cortex. Effects of carrier type, modulating waveform and intensity”, Hear. Res. 74, 51-66. [11] Fishbach A., Nelken I., Yeshurun Y. (2001) “Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients”, J. Neurophysiol. 85, 2303–2323. [12] Gerstner W. (1999) “Spiking neurons”, in Pulsed Neural Networks , edited by W. Maass, C. M. Bishop, (MIT Press, Cambridge, MA). [13] Hall J.W., Haggard M.P., Fernandes M.A. (1984) “Detection in noise by spectrotemporal pattern analysis”, J. Acoust. Soc. Am. 76, 50-56. [14] Heil P. (1997) “Auditory onset responses revisited. II. Response strength”, J. Neurophysiol. 77, 2642-2660. [15] Nelken I., Rotman Y., Bar-Yosef O. (1999) “Responses of auditory cortex neurons to structural features of natural sounds”, Nature 397, 154-157. [16] Phillips D.P. (1988). “Effect of Tone-Pulse Rise Time on Rate-Level Functions of Cat Auditory Cortex Neurons: Excitatory and Inhibitory Processes Shaping Responses to Tone Onset”, J. Neurophysiol. 59, 1524-1539. [17] Phillips D.P., Burkard R. (1999). “Response magnitude and timing of auditory response initiation in the inferior colliculus of the awake chinchilla”, J. Acoust. Soc. Am. 105, 27312737. [18] Phillips D.P., Semple M.N., Kitzes L.M. (1995). “Factors shaping the tone level sensitivity of single neurons in posterior field of cat auditory cortex”, J. Neurophysiol. 73, 674-686. [19] Rosen S. (1992) “Temporal information in speech: acoustic, auditory and linguistic aspects”, Phil. Trans. R. Soc. Lond. B 336, 367-373. [20] Shannon R.V., Zeng F.G., Kamath V., Wygonski J, Ekelid M. (1995) “Speech recognition with primarily temporal cues”, Science 270, 303-304. [21] Turner C.W., Relkin E.M., Doucet J. (1994). “Psychophysical and physiological forward masking studies: probe duration and rise-time effects”, J. Acoust. Soc. Am. 96 (2), 795-800. [22] Yost W.A., Sheft S. (1994) “Modulation detection interference – across-frequency processing and auditory grouping”, Hear. Res. 79, 48-58. [23] Zhang X., Heinz M.G., Bruce I.C., Carney L.H. (2001). “A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression”, J. Acoust. Soc. Am. 109 (2), 648-670.
same-paper 3 0.79582608 111 nips-2002-Independent Components Analysis through Product Density Estimation
Author: Trevor Hastie, Rob Tibshirani
Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1
4 0.56813794 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
5 0.56600142 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
Author: Peter Sykacek, Stephen J. Roberts
Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
6 0.56116933 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
7 0.56023121 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
8 0.55965459 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
9 0.55962086 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
10 0.55819303 53 nips-2002-Clustering with the Fisher Score
11 0.55756831 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
12 0.55695659 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
13 0.5568797 3 nips-2002-A Convergent Form of Approximate Policy Iteration
14 0.55540383 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
15 0.55505824 138 nips-2002-Manifold Parzen Windows
16 0.55426651 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
17 0.55399191 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
18 0.55128407 169 nips-2002-Real-Time Particle Filters
19 0.55102164 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
20 0.54988229 190 nips-2002-Stochastic Neighbor Embedding