nips nips2013 nips2013-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
Reference: text
sentIndex sentText sentNum sentScore
1 We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. [sent-16, score-0.417]
2 We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. [sent-18, score-0.213]
3 We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. [sent-20, score-0.072]
4 We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. [sent-21, score-0.082]
5 The goal is to recover the latent signals under some appropriate structural assumption. [sent-23, score-0.13]
6 A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. [sent-24, score-0.084]
7 The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. [sent-26, score-0.154]
8 The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). [sent-29, score-0.146]
9 The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. [sent-30, score-0.17]
10 In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. [sent-31, score-0.106]
11 This is typically done using principal component analysis (PCA) and rescaling the appropriate components. [sent-34, score-0.073]
12 In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. [sent-35, score-0.187]
13 JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. [sent-39, score-0.107]
14 Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. [sent-41, score-0.472]
15 This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. [sent-43, score-0.07]
16 Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. [sent-44, score-0.103]
17 In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized. [sent-45, score-0.071]
18 1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). [sent-46, score-0.333]
19 In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. [sent-48, score-0.313]
20 However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. [sent-49, score-0.313]
21 In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. [sent-50, score-0.164]
22 We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. [sent-51, score-0.558]
23 For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. [sent-52, score-0.37]
24 We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. [sent-53, score-0.137]
25 The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. [sent-54, score-0.154]
26 Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. [sent-55, score-0.491]
27 We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. [sent-57, score-0.07]
28 We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. [sent-58, score-0.205]
29 However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. [sent-59, score-0.084]
30 To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. [sent-60, score-0.089]
31 Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. [sent-64, score-0.456]
32 In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. [sent-67, score-0.165]
33 In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. [sent-68, score-0.313]
34 The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. [sent-75, score-0.148]
35 In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. [sent-78, score-0.072]
36 The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. [sent-79, score-0.565]
37 That algorithm required estimating the full fourth-order cumulant tensor. [sent-81, score-0.313]
38 The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. [sent-83, score-0.774]
39 The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. [sent-84, score-0.316]
40 For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e. [sent-85, score-0.154]
41 [5] gives a provably robust algorithm for PCA under a sparse noise model. [sent-90, score-0.112]
42 [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. [sent-91, score-0.348]
43 However, cumulants have additional properties which allow independent random variables to be algebraically separated. [sent-93, score-0.417]
44 We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. [sent-94, score-0.541]
45 Denote by Qx the fourth order cumulant tensor for the random vector x. [sent-95, score-0.434]
46 So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). [sent-96, score-0.36]
47 Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. [sent-100, score-0.498]
48 (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. [sent-101, score-0.36]
49 If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). [sent-102, score-0.54]
50 (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. [sent-104, score-0.261]
51 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. [sent-108, score-0.626]
52 Univariate r-cumulants are additive for independent random variables, i. [sent-113, score-0.073]
53 (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. [sent-121, score-0.077]
54 A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. [sent-124, score-0.225]
55 Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . [sent-130, score-0.136]
56 Suppose that each component of s has non-zero fourth cumulant. [sent-133, score-0.167]
57 Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . [sent-135, score-0.136]
58 ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2. [sent-140, score-0.313]
59 Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). [sent-149, score-0.074]
60 1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). [sent-151, score-0.133]
61 For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. [sent-170, score-0.086]
62 There exists a rotation matrix R and a diagonal matrix D such that W A = RD. [sent-174, score-0.098]
63 Gaussian noise makes s s recovering the scaling matrix D impossible. [sent-177, score-0.106]
64 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. [sent-185, score-0.137]
65 In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. [sent-191, score-0.174]
66 The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. [sent-192, score-0.187]
67 Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. [sent-193, score-0.134]
68 The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. [sent-201, score-0.541]
69 We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. [sent-202, score-0.279]
70 We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. [sent-204, score-0.072]
71 As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. [sent-205, score-0.47]
72 While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. [sent-206, score-0.525]
73 Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. [sent-208, score-0.505]
74 Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. [sent-210, score-0.094]
75 s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. [sent-218, score-0.088]
76 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. [sent-226, score-0.078]
77 R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. [sent-232, score-0.219]
78 In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. [sent-251, score-0.233]
79 We expect the fourth or third cumulant function will typically be chosen for G. [sent-252, score-0.434]
80 We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. [sent-254, score-0.498]
81 However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. [sent-256, score-0.165]
82 Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. [sent-263, score-0.313]
83 Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. [sent-265, score-0.121]
84 However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. [sent-266, score-0.754]
85 If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. [sent-275, score-0.147]
86 In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. [sent-279, score-0.123]
87 The noise magnitude indicates the strength of an additive white Gaussian noise. [sent-284, score-0.235]
88 We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2. [sent-285, score-0.245]
89 It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). [sent-291, score-0.098]
90 From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. [sent-292, score-0.12]
91 Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. [sent-293, score-0.121]
92 The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1. [sent-294, score-0.107]
93 We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. [sent-321, score-0.147]
94 Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. [sent-327, score-0.162]
95 For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0. [sent-329, score-0.115]
96 The corresponding average number of iterations to convergence per independent component (at 0. [sent-332, score-0.126]
97 In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. [sent-336, score-0.19]
98 Blind signal separation in the presence of Gaussian noise. [sent-371, score-0.078]
99 Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. [sent-440, score-0.116]
100 Fast and robust fixed-point algorithms for independent component analysis. [sent-444, score-0.128]
wordName wordTfidf (topN-words)
[('fastica', 0.446), ('ica', 0.397), ('cumulants', 0.377), ('cumulant', 0.313), ('jade', 0.204), ('qx', 0.166), ('amari', 0.147), ('cum', 0.131), ('fourth', 0.121), ('whitening', 0.12), ('gi', 0.102), ('cosh', 0.098), ('quasi', 0.098), ('white', 0.097), ('xl', 0.082), ('aq', 0.082), ('hu', 0.08), ('ationary', 0.074), ('dqq', 0.074), ('hessians', 0.074), ('blind', 0.07), ('noise', 0.07), ('hessian', 0.069), ('columns', 0.061), ('xk', 0.057), ('cubic', 0.057), ('dreese', 0.056), ('noiseless', 0.056), ('ut', 0.056), ('orthogonal', 0.055), ('gaussian', 0.053), ('pca', 0.051), ('adat', 0.049), ('orthogonalizing', 0.049), ('uut', 0.049), ('signal', 0.048), ('recover', 0.046), ('component', 0.046), ('columbus', 0.045), ('orthogonalization', 0.045), ('signals', 0.044), ('univariate', 0.043), ('hyv', 0.043), ('homogeneity', 0.043), ('additivity', 0.043), ('robust', 0.042), ('xj', 0.041), ('independent', 0.04), ('convergence', 0.04), ('latent', 0.04), ('neil', 0.039), ('index', 0.039), ('oh', 0.037), ('ohio', 0.037), ('atrix', 0.037), ('cardoso', 0.037), ('noisless', 0.037), ('orthogonalize', 0.037), ('parallelepiped', 0.037), ('quasiorthogonalization', 0.037), ('rthogonalization', 0.037), ('uasi', 0.037), ('voss', 0.037), ('compatible', 0.037), ('matrix', 0.036), ('rh', 0.036), ('magnitude', 0.035), ('labs', 0.035), ('avenue', 0.035), ('sq', 0.035), ('orthogonality', 0.035), ('invariant', 0.034), ('rewritten', 0.033), ('rd', 0.033), ('gradient', 0.033), ('ijkl', 0.033), ('cocktail', 0.033), ('additive', 0.033), ('lemma', 0.033), ('fast', 0.032), ('demixing', 0.03), ('ind', 0.03), ('separation', 0.03), ('rademacher', 0.03), ('symmetric', 0.03), ('matlab', 0.029), ('xxt', 0.028), ('moment', 0.028), ('si', 0.028), ('num', 0.027), ('popularized', 0.027), ('principal', 0.027), ('step', 0.027), ('superposition', 0.026), ('diagonal', 0.026), ('noisy', 0.026), ('identity', 0.025), ('party', 0.025), ('whitened', 0.025), ('vanishing', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
2 0.072113499 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
3 0.070047744 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
4 0.066893935 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
5 0.064905025 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
Author: Yuhong Guo
Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Specifically, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components specific to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efficient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and significantly outperform both standard PCA and robust PCA with rank constraints. 1
6 0.06013713 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
7 0.056924772 91 nips-2013-Dirty Statistical Models
8 0.054039683 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
9 0.053233188 171 nips-2013-Learning with Noisy Labels
10 0.050315723 193 nips-2013-Mixed Optimization for Smooth Functions
11 0.049847614 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
12 0.049353573 233 nips-2013-Online Robust PCA via Stochastic Optimization
13 0.04888545 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
14 0.046980128 232 nips-2013-Online PCA for Contaminated Data
15 0.046664022 318 nips-2013-Structured Learning via Logistic Regression
16 0.045415491 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
17 0.045290668 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms
18 0.043586433 9 nips-2013-A Kernel Test for Three-Variable Interactions
19 0.043173928 186 nips-2013-Matrix factorization with binary components
20 0.041944876 75 nips-2013-Convex Two-Layer Modeling
topicId topicWeight
[(0, 0.137), (1, 0.046), (2, 0.042), (3, 0.053), (4, -0.015), (5, -0.004), (6, -0.029), (7, 0.032), (8, -0.011), (9, -0.009), (10, -0.003), (11, 0.018), (12, 0.011), (13, 0.002), (14, -0.0), (15, 0.016), (16, -0.009), (17, 0.005), (18, -0.029), (19, -0.009), (20, 0.004), (21, -0.005), (22, -0.002), (23, 0.002), (24, 0.034), (25, -0.02), (26, 0.01), (27, 0.036), (28, 0.028), (29, -0.016), (30, -0.007), (31, 0.033), (32, -0.013), (33, 0.062), (34, -0.003), (35, -0.08), (36, -0.023), (37, -0.014), (38, 0.028), (39, 0.051), (40, 0.004), (41, -0.02), (42, 0.015), (43, -0.021), (44, 0.015), (45, 0.047), (46, -0.015), (47, -0.017), (48, -0.059), (49, -0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.8841204 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
2 0.65221179 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
Author: Siwei Lyu, Xin Wang
Abstract: Nonnegative matrix factorization (NMF) is a popular data analysis method, the objective of which is to approximate a matrix with all nonnegative components into the product of two nonnegative matrices. In this work, we describe a new simple and efficient algorithm for multi-factor nonnegative matrix factorization (mfNMF) problem that generalizes the original NMF problem to more than two factors. Furthermore, we extend the mfNMF algorithm to incorporate a regularizer based on the Dirichlet distribution to encourage the sparsity of the components of the obtained factors. Our sparse mfNMF algorithm affords a closed form and an intuitive interpretation, and is more efficient in comparison with previous works that use fix point iterations. We demonstrate the effectiveness and efficiency of our algorithms on both synthetic and real data sets. 1
3 0.64712763 73 nips-2013-Convex Relaxations for Permutation Problems
Author: Fajwel Fogel, Rodolphe Jenatton, Francis Bach, Alexandre D'Aspremont
Abstract: Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We prove the equivalence between the seriation and the combinatorial 2-SUM problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-SUM problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We present numerical experiments on archeological data, Markov chains and gene sequences. 1
4 0.63952601 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
5 0.63534689 186 nips-2013-Matrix factorization with binary components
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
6 0.63153195 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
7 0.63034207 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
8 0.62290162 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms
9 0.61928666 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
10 0.61676574 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
11 0.61483139 247 nips-2013-Phase Retrieval using Alternating Minimization
12 0.6011802 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
13 0.60002333 284 nips-2013-Robust Spatial Filtering with Beta Divergence
14 0.59937412 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
15 0.59662569 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
16 0.59390646 188 nips-2013-Memory Limited, Streaming PCA
17 0.59015018 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
18 0.58854711 327 nips-2013-The Randomized Dependence Coefficient
19 0.56528014 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
20 0.553702 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
topicId topicWeight
[(16, 0.044), (33, 0.131), (34, 0.109), (41, 0.039), (49, 0.035), (56, 0.105), (67, 0.302), (70, 0.028), (85, 0.037), (89, 0.04), (93, 0.027)]
simIndex simValue paperId paperTitle
1 0.77070153 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
Author: Bogdan Savchynskyy, Jörg Hendrik Kappes, Paul Swoboda, Christoph Schnörr
Abstract: We consider energy minimization for undirected graphical models, also known as the MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a significant progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are often defined on sparse graphs and convex relaxation methods, such as linear programming relaxations then provide good approximations to integral solutions. We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve much larger problems. We demonstrate the efficacy of our approach on a computer vision energy minimization benchmark. 1
2 0.76248205 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
same-paper 3 0.74416101 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
4 0.65636379 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
Author: John Duchi, Martin J. Wainwright, Michael Jordan
Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1
5 0.57644439 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
Author: Zhuo Wang, Alan Stocker, Daniel Lee
Abstract: In many neural systems, information about stimulus variables is often represented in a distributed manner by means of a population code. It is generally assumed that the responses of the neural population are tuned to the stimulus statistics, and most prior work has investigated the optimal tuning characteristics of one or a small number of stimulus variables. In this work, we investigate the optimal tuning for diffeomorphic representations of high-dimensional stimuli. We analytically derive the solution that minimizes the L2 reconstruction loss. We compared our solution with other well-known criteria such as maximal mutual information. Our solution suggests that the optimal weights do not necessarily decorrelate the inputs, and the optimal nonlinearity differs from the conventional equalization solution. Results illustrating these optimal representations are shown for some input distributions that may be relevant for understanding the coding of perceptual pathways. 1
6 0.57611257 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
7 0.5756281 173 nips-2013-Least Informative Dimensions
8 0.5754258 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
9 0.57510364 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
11 0.5740782 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
12 0.57384241 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
13 0.57382739 350 nips-2013-Wavelets on Graphs via Deep Learning
14 0.57346535 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
15 0.57226866 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
16 0.57176125 318 nips-2013-Structured Learning via Logistic Regression
17 0.57175547 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
18 0.5713771 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
19 0.57123435 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
20 0.57027227 294 nips-2013-Similarity Component Analysis