nips nips2006 nips2006-194 knowledge-graph by maker-knowledge-mining

194 nips-2006-Towards a general independent subspace analysis

Source: pdf

Author: Fabian J. Theis

Abstract: The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. Subspace ICA models generalize the assumption of component independence to independence between groups of components. They are attractive candidates for dimensionality reduction methods, however are currently limited by the assumption of equal group sizes or less general semi-parametric models. By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by including previous results on non-Gaussian component analysis. After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. Simulations conﬁrm the feasibility of the algorithm. 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of ﬁxed group size k or simply k-ISA. So 1-ISA is equivalent to ICA. 1.1 Why extend ICA? An important structural aspect in the search for decompositions is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating many applications. It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement ﬁxed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulﬁll the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the latter two algorithms do indeed ﬁnd an ISA up to permutation, which will be explained in section 3. theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. In the literature, ICA is often applied to ﬁnd representations fulﬁlling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in ﬁgure 1. This work aims at ﬁnding an ISA model that allows applicability to any random vector. After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. 1.2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. ISA in the terminology of multidimensional ICA has ﬁrst been introduced by Cardoso [4] using geometrical motivations. His model as well as the related but independently proposed factorization of multivariate function classes [9] is quite general, however no identiﬁability results were presented, and applicability to an arbitrary random vector was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. Hyv¨ rinen and Hoyer presented a special case of k-ISA by combining it with invariant feature suba space analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able to propose more efﬁcient algorithms without having to resort to the problematic multidimensional density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed by their tree dependepent component analysis. Together with inter-cluster independence, this implies a search for a transformation of the mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not fully blind. In the following, no additional structures are necessary for the separation. 1.3 General ISA Deﬁnition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . . . , Sk ) with pairwise independent, irreducible random vectors Si . Note that in this case, the Si are independent components of X. The idea behind this deﬁnition is that in contrast to ICA and k-ISA, we do not ﬁx the size of the groups Si in advance. Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction is realized by allowing only irreducible components. The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si of the same dimension1. These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i.e. on existing covariance have been made. Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give a short sketch of the proof in the following. The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1.2 follows readily from Lemma 1.3. Let S = (S1 , . . . , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . If X is an irreducible component of S, then X ∼ Si for some i. Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following two lemmata each give a simpliﬁcation of lemma 1.3 by ordering the components Si according to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. Lemma 1.4. Let S and X be deﬁned as in lemma 1.3. In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. Then X ∼ Si for some i ≤ l. Lemma 1.5. Let S and X be deﬁned as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1 . In order to prove lemma 1.5 (and hence the theorem), it is sufﬁcient to show the following lemma: Lemma 1.6. Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. (i) follows after some linear algebra, and is necessary to show the more difﬁcult part (ii). For this, we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 1.4 Dealing with Gaussians In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identiﬁcation result would not be possible. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. In the example later in ﬁgure 3, these indeterminacies can easily be seen by comparing true and estimated sources. Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after ﬁrst applying theorem 1.7 to X and then theorem 1.2 to the non-Gaussian part. 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . . . , MK } is to ﬁnd an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . . . , K i.e. to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. For the actual minimization of f respectively maximization of g, we will use the common approach of Jacobi-like optimization by iterative applications of Givens rotation in two coordinates [5]. 2.1 Generalization to blocks In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. Depending on the application, we ﬁx the blockstructure in advance or try to determine it from M. We are not interested in the order of the blocks, so the block-structure is uniquely speciﬁed by ﬁxing a partition of n i.e. a way of writing n as a sum of positive integers, where the order of the addends is not signiﬁcant. So let2 n = m1 + . . . + mr with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1 , . . . , mr ) ∈ Nr . An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . .  .. .   . . . . 0 · · · Dr with arbitrary mi × mi matrices Di . As generalization of JD in the case of known the block structure, we can formulate the joint mK ˆ ˆ⊤ ˆ block diagonalization (m-JBD) problem as the minimization of f m (E) := k=1 E Mk E − m ˆ⊤ ˆ 2 with respect to the orthogonal matrix E, where diagMm (M) produces a ˆ diagM (E Mk E) F m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation deﬁned by a permutation matrix that only swaps blocks of the same size. Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if (E, m) = argmaxm | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we deﬁne an approximate general JBD by requiring f m (E) < ǫ for some ﬁxed constant ǫ > 0 instead of f m (E) = 0. 2.2 JBD by JD A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efﬁcient solution to JBD [1]. This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, and we can prove the conjecture partially: We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course f (EP) = f (E), so we will show why (certain) JBD solutions are minima of f . However, JD might have additional minima. First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). We will call such a JBD block-optimal in the following. Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f . Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting each Mk by E⊤ Mk E, we may already assume that Mk is m-block diagonal, so we have to show that E = I is a local maximum of g. Consider the elementary Givens rotation Gij (ǫ) deﬁned for i < j and ǫ√ (−1, 1) as the orthogonal ∈ matrix, where all diagonal elements are 1 except for the two elements 1 − ǫ2 in rows i and j and with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional manifold O(n) at I, simply by ι(ǫ12 , ǫ13 , . . . , ǫn−1,n ) := i < j. If i, j are in the same block of m, then h is locally maximal i.e. negative semi-deﬁnite at 0 in the direction ǫij because of block-optimality. Now assume i and j are from different blocks. After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . Then Gij (ǫ)⊤ Mk Gij (ǫ) can be easily calculated at coordinates (i, i) to (j, j), and indeed entries on the diagonal other than at indices (i, i) and (j, j) are not changed, so diag(Gij (ǫ)⊤ Mk Gij (ǫ)) 2 2 − diag(Mk ) 2 = 2 = −2ak (ak − bk )ǫ + 2bk (ak − bk )ǫ + 2(ak − bk )2 ǫ4 = −2(a2 + b2 )ǫ2 + 2(ak − bk )2 ǫ4 . k k Hence h(0, . . . , 0, ǫij , 0, . . . , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . Now either c = 0, then also d = 0 and h is constant zero in the direction ǫij . Or, more interestingly, c = 0, then c > 0 and therefore h is negative deﬁnite in the direction ǫij . Altogether we get a negative deﬁnite h at 0 except for ‘trivial directions’, and hence a local maximum at 0. 2.3 Recovering the permutation In order to perform JBD, we therefore only have to ﬁnd a JD E of M. What is left according to the above theorem is to ﬁnd a permutation matrix P such that EP block-diagonalizes M. In the case of known block-order m, we can employ similar techniques as used in [1, 10], which essentially ﬁnd P by some combinatorial optimization. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 40 35 40 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 ˆ⊤ (a) (unknown) block diagonal M1 (b) E E w/o recovered permutation 5 10 15 20 25 30 35 40 ˆ⊤ (c) E E Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6 in the presence of noise with SNR of 5dB. The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. In the case of unknown block-size, we propose to use the following simple permutation-recovery K algorithm: consider the mean diagonalized matrix D := K −1 k=1 E⊤ Mk E. Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, whose choice is non-trivial. We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the diagonal as closely as possible. This recovers the permutation as well as the partition m of n. Algorithm 1: Block-diagonality permutation ﬁnder Input: (n × n)-matrix D Output: block-diagonal matrix P(D) := D′ such that D′ = PDPT for a permutation matrix P D′ ← D for i ← 1 to n do repeat if (j0 ← min{j|j ≥ i and d′ = 0 and d′ = 0}) exists then ij ji if (k0 ← min{k|k > j0 and (d′ = 0 or d′ = 0)}) exists then ik ki swap column j0 of D′ with column k0 swap row j0 of D′ with row k0 until no swap has occurred ; We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). They have been generated in blocks of size m with coefﬁcients chosen randomly uniform from [−1, 1], and symmetrized by Dk ← (Dk + D⊤ )/2. After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of the JBD algorithm from above to {M1 , . . . , MK } with threshold θ = 0.1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in ﬁgure 2. 3 SJADE — a simple algorithm for general ISA As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5.5 6 5 5.5 5 1 5 2 4.5 4 3 5 4.5 4 3 4 2 5 4.5 4 3.5 4 3.5 6 3 1 2.5 0 5 3.5 3 7 3 2.5 8 4 3 2 2.5 2 5 4 2 1 1.5 7 8 9 10 11 12 13 14 2 3 4 5 (a) S2 6 7 8 9 1.5 3 3.5 4 (b) S3 4.5 5 5.5 6 6.5 10 1 0 7 (c) S4 5 9 3 2 0 1 14 3 4 5 6 7 8 9 10 ˆ (e) A−1 A −3 1 −3.5 4.5 2 (d) S5 13 250 0 −4 4 −1 −4.5 12 200 −5 −3 −6 −4 −6.5 11 150 −2 −5.5 3.5 −5 0 3 10 100 2.5 −1 −7 9 50 2 5 −2 4 3 −3 −7.5 2 −4 1.5 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 ˆ ˆ (f) (S1 , S2 ) −2.5 −2 0 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 ˆ (g) histogram of S3 0 8 −4.5 −4 −3.5 −3 −2.5 −2 (h) S4 −1.5 −1 −0.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 (i) S5 −4 −3.5 −3 −2.5 1 −5 0 (j) S6 Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the ˆ scatter plots i.e. densities of the source components and the mixing-separating map A−1 A. also Cov(S) = I. Then I = Cov(X) = A Cov(S)A⊤ = AA⊤ so A is orthogonal. Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. In order to ﬁnd transformations of the mixtures fulﬁlling this property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of the (whitened) contracted quadricovariance matrices deﬁned by Cij (X) := E X⊤ Eij XXX⊤ − Eij − E⊤ − tr(Eij )I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij , 1 ≤ i, j ≤ n. ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). More elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically ﬁnds the subspace partition m using the general JBD algorithm from section 2.3. 4 Experimental results In a ﬁrst example, we consider a general ISA problem in dimension n = 10 with the unknown partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are shown in ﬁgure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with coefﬁcients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally conﬁrms theorem 1.8. The algoˆ rithm found a partition m = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, but by using previous knowledge combination of the correct two 1d-sources yields the original 2dˆ ˆ source. The resulting recovered sources S := A−1 X, ﬁgures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in ﬁgure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. This explains why JADE retrieves meaningful components even in this non-ICA setting as observed in [4]. In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0.01. The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. Indeed, we get nearly taking the ad-hoc measure ι(P) := i=1 j=1 (pij + pji ) for P := A perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run diverges with ι(P ) = 3.48. In order to show that the algorithm really separates the Gaussian part from the other components, we compare the recovered source kurtoses. The median kurtoses are −0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The ﬁrst two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1.2, which equals the kurtosis of a uniform density. This conﬁrms the applicability of the algorithm in the general, noisy ISA setting. 5 Conclusion Previous approaches for independent subspace analysis were restricted either to ﬁxed group sizes or semi-parametric models. In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. In the present contribution we introduce the concept of irreducible independent components and give an identiﬁability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). However in practice in the ﬁnite sample case, due to estimation errors the general joint block diagonality only approximately holds. Our simple solution in this contribution was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be developed in future works. References [1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria, 2004. [2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 891–896, 2003. [3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨ ller. In search of non-gaussian u components of a high-dimensional distribution. JMLR, 7:247–282, 2006. [4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. [5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1):161–164, January 1995. [6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [7] A. Hyv¨ rinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. [8] A. Hyv¨ rinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computaa tion, 13(7):1525–1558, 2001. [9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, volume 10, pages 563–569, 1998. [10] B. Poczos and A. L¨ rincz. Independent subspace analysis using k-nearest neighborhood distances. In o Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. [11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 84(5):951–956, 2004. [12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 917–925, Charleston, USA, 2006. [13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 2001, pages 993–1000, 2001.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Towards a general independent subspace analysis Fabian J. [sent-1, score-0.173]

2 name Abstract The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. [sent-4, score-0.131]

3 Subspace ICA models generalize the assumption of component independence to independence between groups of components. [sent-5, score-0.183]

4 By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. [sent-7, score-0.316]

5 After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. [sent-9, score-0.643]

6 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. [sent-11, score-0.502]

7 The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. [sent-12, score-0.501]

8 If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). [sent-13, score-0.131]

9 Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of ﬁxed group size k or simply k-ISA. [sent-14, score-0.19]

10 Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. [sent-21, score-0.105]

11 Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. [sent-22, score-0.338]

12 And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. [sent-23, score-0.459]

13 It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. [sent-25, score-0.232]

14 Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulﬁll the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. [sent-27, score-0.219]

15 Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. [sent-28, score-0.103]

16 Clearly, the original mixing matrix could not be reconstructed in any of the experiments. [sent-29, score-0.103]

17 theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. [sent-31, score-0.24]

18 If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. [sent-32, score-0.587]

19 In the literature, ICA is often applied to ﬁnd representations fulﬁlling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in ﬁgure 1. [sent-33, score-0.181]

20 After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. [sent-35, score-0.142]

21 2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. [sent-37, score-0.121]

22 Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. [sent-41, score-0.106]

23 A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e. [sent-44, score-0.123]

24 Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. [sent-47, score-0.107]

25 A random vector S is said to be irreducible if it contains no lower-dimensional independent component. [sent-57, score-0.319]

26 An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . [sent-58, score-0.323]

27 , Sk ) with pairwise independent, irreducible random vectors Si . [sent-61, score-0.219]

28 Note that in this case, the Si are independent components of X. [sent-62, score-0.116]

29 This restriction is realized by allowing only irreducible components. [sent-65, score-0.219]

30 The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. [sent-66, score-0.331]

31 Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i. [sent-67, score-0.205]

32 invertible transformations within each Si and permutation of Si of the same dimension1. [sent-69, score-0.398]

33 These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i. [sent-70, score-0.174]

34 Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. [sent-75, score-0.18]

35 The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1. [sent-78, score-0.299]

36 , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . [sent-84, score-0.395]

37 If X is an irreducible component of S, then X ∼ Si for some i. [sent-85, score-0.287]

38 Here the equivalence relation ∼ denotes equality except for an invertible transformation. [sent-86, score-0.156]

39 In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. [sent-96, score-0.296]

40 Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. [sent-106, score-0.367]

41 If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. [sent-107, score-0.35]

42 Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identiﬁcation result would not be possible. [sent-112, score-0.114]

43 Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. [sent-113, score-0.17]

44 It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. [sent-114, score-0.259]

45 More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. [sent-115, score-0.263]

46 According to our previous notation, SN and SG are independent components of X. [sent-118, score-0.116]

47 It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. [sent-119, score-0.245]

48 In the example later in ﬁgure 3, these indeterminacies can easily be seen by comparing true and estimated sources. [sent-120, score-0.174]

49 The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. [sent-123, score-0.26]

50 2and to get a general decomposition theorem, which characterizes solutions of ISA. [sent-126, score-0.087]

51 Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. [sent-129, score-0.655]

52 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. [sent-135, score-0.421]

53 The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . [sent-136, score-0.195]

54 , MK } is to ﬁnd an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . [sent-139, score-0.151]

55 to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. [sent-144, score-0.135]

56 The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. [sent-145, score-0.132]

57 Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. [sent-149, score-0.369]

58 An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . [sent-164, score-0.137]

59 multiplication by an m-block diagonal matrix from the right, and m-permutation deﬁned by a permutation matrix that only swaps blocks of the same size. [sent-178, score-0.415]

60 Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. [sent-179, score-0.25]

61 For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. [sent-180, score-0.198]

62 structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). [sent-181, score-0.106]

63 In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we deﬁne an approximate general JBD by requiring f m (E) < ǫ for some ﬁxed constant ǫ > 0 instead of f m (E) = 0. [sent-183, score-0.112]

64 In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efﬁcient solution to JBD [1]. [sent-186, score-0.154]

65 a minimum of the function f m up to a permutation matrix. [sent-189, score-0.241]

66 if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). [sent-192, score-0.295]

67 First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). [sent-195, score-0.627]

68 If i, j are in the same block of m, then h is locally maximal i. [sent-209, score-0.174]

69 After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . [sent-213, score-0.17]

70 , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . [sent-221, score-0.172]

71 3 Recovering the permutation In order to perform JBD, we therefore only have to ﬁnd a JD E of M. [sent-226, score-0.241]

72 What is left according to the above theorem is to ﬁnd a permutation matrix P such that EP block-diagonalizes M. [sent-227, score-0.33]

73 The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2. [sent-230, score-0.677]

74 Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. [sent-233, score-0.32]

75 This recovers the permutation as well as the partition m of n. [sent-237, score-0.269]

76 After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i. [sent-240, score-0.154]

77 Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. [sent-242, score-0.105]

78 1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in ﬁgure 2. [sent-247, score-0.387]

79 The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5. [sent-249, score-0.267]

80 densities of the source components and the mixing-separating map A−1 A. [sent-298, score-0.092]

81 Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. [sent-301, score-0.124]

82 ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). [sent-305, score-0.142]

83 The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically ﬁnds the subspace partition m using the general JBD algorithm from section 2. [sent-307, score-0.243]

84 In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. [sent-310, score-0.219]

85 The sources S were mixed by a mixing matrix A with coefﬁcients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. [sent-314, score-0.21]

86 The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0. [sent-315, score-0.347]

87 ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally conﬁrms theorem 1. [sent-317, score-0.551]

88 The resulting recovered sources S := A−1 X, ﬁgures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. [sent-320, score-0.672]

89 When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in ﬁgure 1; however, some algorithms might recover the sources up to permutation. [sent-321, score-0.144]

90 Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. [sent-322, score-0.64]

91 For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. [sent-327, score-0.359]

92 We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0. [sent-328, score-0.103]

93 The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. [sent-330, score-0.142]

94 The ﬁrst two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1. [sent-347, score-0.194]

95 5 Conclusion Previous approaches for independent subspace analysis were restricted either to ﬁxed group sizes or semi-parametric models. [sent-350, score-0.211]

96 In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. [sent-351, score-0.14]

97 In the present contribution we introduce the concept of irreducible independent components and give an identiﬁability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. [sent-352, score-0.555]

98 As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). [sent-353, score-0.189]

99 However in practice in the ﬁnite sample case, due to estimation errors the general joint block diagonality only approximately holds. [sent-354, score-0.22]

100 Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. [sent-404, score-0.146]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jbd', 0.465), ('isa', 0.407), ('mk', 0.279), ('permutation', 0.241), ('ica', 0.235), ('irreducible', 0.219), ('indeterminacies', 0.174), ('block', 0.174), ('uniqueness', 0.142), ('jade', 0.135), ('jd', 0.135), ('subspace', 0.11), ('invertible', 0.108), ('diagonalization', 0.108), ('bss', 0.078), ('sjade', 0.078), ('multidimensional', 0.074), ('dim', 0.074), ('si', 0.073), ('sources', 0.072), ('component', 0.068), ('gij', 0.068), ('bk', 0.066), ('eij', 0.064), ('independent', 0.063), ('ak', 0.062), ('sg', 0.061), ('cov', 0.061), ('lemma', 0.061), ('mixing', 0.061), ('decomposition', 0.06), ('diagm', 0.058), ('diagonal', 0.058), ('components', 0.053), ('ij', 0.053), ('gaussians', 0.052), ('trivial', 0.052), ('orthogonal', 0.051), ('transformations', 0.049), ('except', 0.048), ('theorem', 0.047), ('joint', 0.046), ('scaling', 0.044), ('swap', 0.043), ('matrix', 0.042), ('matrices', 0.041), ('ful', 0.041), ('dk', 0.041), ('independence', 0.039), ('source', 0.039), ('sn', 0.039), ('recovered', 0.039), ('cardoso', 0.039), ('diagonalizer', 0.039), ('fabian', 0.039), ('givens', 0.039), ('kurtoses', 0.039), ('ngca', 0.039), ('ngsa', 0.039), ('hyv', 0.039), ('rinen', 0.039), ('sizes', 0.038), ('said', 0.037), ('groups', 0.037), ('indeed', 0.036), ('mixtures', 0.035), ('topographic', 0.035), ('rotation', 0.035), ('applicability', 0.035), ('subspaces', 0.034), ('blind', 0.034), ('mr', 0.034), ('kurtosis', 0.034), ('theis', 0.034), ('speak', 0.034), ('blocks', 0.032), ('separation', 0.032), ('unknown', 0.031), ('scalings', 0.031), ('existence', 0.029), ('conjecture', 0.029), ('diag', 0.029), ('hoyer', 0.029), ('fastica', 0.029), ('partition', 0.028), ('cij', 0.027), ('permutations', 0.027), ('solutions', 0.027), ('ep', 0.027), ('bach', 0.026), ('lling', 0.026), ('stochastically', 0.026), ('exists', 0.025), ('decompositions', 0.025), ('recovery', 0.025), ('factorization', 0.024), ('gure', 0.024), ('necessary', 0.024), ('altogether', 0.024), ('invariant', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

2 0.13499278 46 nips-2006-Blind source separation for over-determined delayed mixtures

Author: Lars Omlor, Martin Giese

Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and speciﬁcally for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1

3 0.13342325 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive ﬁeld characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

4 0.09418314 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

5 0.066211537 98 nips-2006-Inferring Network Structure from Co-Occurrences

Author: Michael G. Rabbat, Mário Figueiredo, Robert Nowak

Abstract: We consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difﬁcult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shufﬂed samples of a random walk on the network. We derive a computationally efﬁcient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability. 1

6 0.054195337 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

7 0.052087024 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

8 0.051648799 65 nips-2006-Denoising and Dimension Reduction in Feature Space

9 0.050606892 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

10 0.049545433 116 nips-2006-Learning from Multiple Sources

11 0.048152365 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

12 0.048142537 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

13 0.047436055 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

14 0.047278944 181 nips-2006-Stability of $K$-Means Clustering

15 0.046409078 175 nips-2006-Simplifying Mixture Models through Function Approximation

16 0.046308845 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.045625784 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

18 0.044606097 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

19 0.043533266 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

20 0.042380195 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.149), (1, 0.023), (2, 0.019), (3, 0.017), (4, -0.025), (5, -0.059), (6, 0.08), (7, 0.014), (8, -0.057), (9, 0.109), (10, 0.017), (11, -0.04), (12, 0.029), (13, -0.1), (14, 0.056), (15, -0.125), (16, 0.088), (17, 0.041), (18, 0.132), (19, -0.06), (20, -0.053), (21, -0.002), (22, 0.01), (23, 0.016), (24, 0.063), (25, 0.018), (26, 0.012), (27, -0.036), (28, -0.032), (29, 0.095), (30, 0.004), (31, -0.021), (32, 0.022), (33, 0.008), (34, 0.15), (35, -0.079), (36, -0.096), (37, -0.057), (38, -0.114), (39, 0.041), (40, 0.016), (41, -0.042), (42, 0.009), (43, 0.019), (44, -0.049), (45, -0.013), (46, -0.059), (47, 0.134), (48, -0.131), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94837636 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

2 0.68460578 46 nips-2006-Blind source separation for over-determined delayed mixtures

Author: Lars Omlor, Martin Giese

3 0.6021204 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

4 0.51242399 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

Author: Michael I. Mandel, Daniel P. Ellis, Tony Jebara

Abstract: We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for ﬁnding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation. 1

5 0.44505784 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

6 0.40939134 116 nips-2006-Learning from Multiple Sources

7 0.37512034 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

8 0.3652547 149 nips-2006-Nonnegative Sparse PCA

9 0.34489411 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

10 0.32258525 129 nips-2006-Map-Reduce for Machine Learning on Multicore

11 0.31521896 181 nips-2006-Stability of $K$-Means Clustering

12 0.31378481 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.30308363 153 nips-2006-Online Clustering of Moving Hyperplanes

14 0.30192271 175 nips-2006-Simplifying Mixture Models through Function Approximation

15 0.29671669 57 nips-2006-Conditional mean field

16 0.29487303 117 nips-2006-Learning on Graph with Laplacian Regularization

17 0.29445639 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

18 0.29377705 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

19 0.29339617 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

20 0.28992 98 nips-2006-Inferring Network Structure from Co-Occurrences

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.121), (3, 0.02), (7, 0.052), (9, 0.056), (20, 0.013), (22, 0.086), (44, 0.075), (57, 0.062), (65, 0.043), (69, 0.028), (71, 0.013), (92, 0.332)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8344928 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers

Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the speciﬁc words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being speciﬁc to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a speciﬁc word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the speciﬁc word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to ﬁnd documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the speciﬁc word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the speciﬁc constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly speciﬁc in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and speciﬁcity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more speciﬁc method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-speciﬁc word distributions to capture both general as well as speciﬁc information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and speciﬁcity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and speciﬁcity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-speciﬁc “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and speciﬁc to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-speciﬁc trade-off between the beneﬁts of topic-based abstraction and speciﬁc word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classiﬁcation problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of speciﬁc individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and speciﬁcity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are ﬁxed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Grifﬁths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efﬁciently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution speciﬁc for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-speciﬁc multinomial Ψ or a corpus speciﬁc multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-speciﬁc multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufﬁcient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-speciﬁc Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the ﬁnal word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are speciﬁc to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included ﬁrst and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were ﬁt using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more speciﬁc details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is deﬁned as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-speciﬁc parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity ﬁgures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have signiﬁcantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both speciﬁc and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often speciﬁc query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more speciﬁc matches than either LDA or LSI, indicating that they have the ability to match at the level of speciﬁc words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) ﬁelds. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is ﬁrst mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical signiﬁcance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not signiﬁcant. In ﬁgure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically signiﬁcant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically signiﬁcant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and speciﬁc aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account speciﬁc aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner speciﬁc to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very speciﬁc query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Grifﬁths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-speciﬁc scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classiﬁers. Journal of the Artiﬁcial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Grifﬁths, T. L., and Steyvers, M. (2004) Finding scientiﬁc topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.

same-paper 2 0.74093294 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

3 0.69410431 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1

4 0.51171982 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

5 0.50930649 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its eﬃciency for real-world applications. First, it is computationally expensive and diﬃcult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It signiﬁcantly improves the computational eﬃciency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the eﬃciency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

6 0.50652075 20 nips-2006-Active learning for misspecified generalized linear models

7 0.50453144 175 nips-2006-Simplifying Mixture Models through Function Approximation

8 0.50357234 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

9 0.50208896 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

10 0.50188607 61 nips-2006-Convex Repeated Games and Fenchel Duality

11 0.49932069 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

12 0.49905375 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

13 0.49887007 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

14 0.49644423 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

15 0.49439424 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

16 0.49308044 131 nips-2006-Mixture Regression for Covariate Shift

17 0.49224886 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

18 0.49163288 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

19 0.4914138 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

20 0.49135673 138 nips-2006-Multi-Task Feature Learning