nips nips2006 nips2006-76 knowledge-graph by maker-knowledge-mining

76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis


Source: pdf

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Emergence of conjunctive visual features by quadratic independent component analysis J. [sent-1, score-0.786]

2 fi Abstract In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. [sent-8, score-0.764]

3 Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. [sent-9, score-0.997]

4 These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. [sent-10, score-0.332]

5 In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. [sent-11, score-0.357]

6 Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. [sent-12, score-0.356]

7 1 Introduction Sparse coding of natural images has led to models that resemble the receptive fields in the primate primary visual cortex area V1 (see e. [sent-13, score-0.544]

8 In this paper we estimate quadratic models for natural images using Independent Component Analysis (ICA). [sent-19, score-0.54]

9 The used quadratic functions are a natural extension to linear functions (i. [sent-20, score-0.452]

10 Another wellknown interpretation of components in a quadratic model is as outputs of two-layer neural networks, which is based on an eigenvalue decomposition and will be discussed below. [sent-26, score-0.706]

11 Estimating a quadratic model for natural images with ICA, we report here the emergence of receptive field models that respond strongly only if the stimulus contains two features that are in a correct spatial arrangement. [sent-27, score-1.008]

12 With a heavy dimensionality reduction, the conjuncted features are mostly collinear (i. [sent-28, score-0.378]

13 prefer edges or bars), but with a smaller reduction, additional components emerge that appear to prefer more complex stimuli such as angles or corners. [sent-30, score-0.597]

14 We show that in both cases, the emerging components approximately operate by computing products between the outputs of two linear submodels that have V1 simple cell characteristics. [sent-31, score-0.507]

15 In section 2 we describe the quadratic ICA in detail. [sent-33, score-0.346]

16 If the independent components are sparse, this is equivalent to performing sparse coding (for an account of ICA, see e. [sent-40, score-0.349]

17 It has been proposed that ICA for quadratic models can be performed by first making a quadratic basis expansion on each x and then applying standard linear ICA [9]. [sent-43, score-0.768]

18 Let the new data vectors z ∈ Rn(n+1)/2+n in quadratic space be z = Φ([x1 , x2 , . [sent-44, score-0.395]

19 Then the columns wi of WT make up the quadratic components (cell models, polynomial filters) of the model. [sent-60, score-0.617]

20 It is well known that the Hi can be represented in another form by eigenvalue decomposition, leading to the expression n T αj (vj x)2 + lT x, i si = (5) j=1 where αj are the decreasingly sorted eigenvalues of Hi and vj the corresponding eigenvectors. [sent-62, score-0.347]

21 5 can help to understand the model, since the individual eigenvectors can be interpreted as linear receptive fields. [sent-64, score-0.427]

22 A quadratic function in this form is illustrated on the left in figure 1. [sent-65, score-0.346]

23 However, for our model estimated with quadratic ICA, many of the eigenvectors vj did not resemble V1 simple cell receptive fields. [sent-66, score-0.975]

24 Assume that the two eigenvalues of Hi which are largest in absolute value have opposite signs; we will refer to these as the dominant eigenvalues and denote them by α1 and αn . [sent-68, score-0.479]

25 Now, including just the two corresponding dominant eigenvectors and ignoring the linear term, we denote v+ = ( |α1 |v1 + |αn |vn ), v− = ( |α1 |v1 − |αn |vn ), (6) x x x T XYZ[ / _^]\ (v1 x)2 DD DD α DD1 . [sent-70, score-0.423]

26 P αn T XYZ[ HIJK / ONML / _^]\ (vn x)2 < zz zz 1 z zz zz z z PQRS / WVUT lT x x /s x T _^]\ / XYZ[ (v+ x) BB BB 1 BB BB BB ! [sent-77, score-0.304]

27 Using the eigenvalue decomposition, quadratic forms can be interpreted as networks. [sent-79, score-0.461]

28 Left, the computation of a single component w, where the vi are the eigenvectors and the αi the eigenvalues of the matrix H. [sent-80, score-0.431]

29 Right, its product approximation, which is possible if the variance is concentrated on just two eigenvectors with eigenvalues of opposite signs. [sent-81, score-0.372]

30 Providing that the approximation is good, the intuition is that the component is essentially computing the product of the responses of two linear filters, analogous to a logical AND operation, or a conjunction. [sent-86, score-0.287]

31 We will empirically show that the vectors v+ and v− have resemblance to V1 simple cell receptive fields for our model even if the respective two dominant eigenvectors have more complicated shapes. [sent-87, score-0.823]

32 3 Materials and methods In our experiments we used the natural image dataset provided by van Hateren and van der Schaaf [2]. [sent-88, score-0.291]

33 This whitening filter cuts the highest frequencies, and balances the frequency distribution otherwise by dampening the dominant low frequencies. [sent-97, score-0.285]

34 After preprocessing each image as a whole, we sampled 300, 000 small image patches from the images, each patch having a resolution of 9 × 9. [sent-101, score-0.306]

35 These patches then formed the data we used to estimate the quadratic ICA model. [sent-103, score-0.406]

36 The model fitting was done by transforming the data to the quadratic space using eq. [sent-104, score-0.346]

37 The input dimension in the quadratic space was Figure 2: The quadratic ICA components when the model size is very small (81 components). [sent-107, score-0.949]

38 Each quadruple displays the two dominant eigenvectors v1 and vn (top row), and the corresponding vectors v+ and v− (bottom row). [sent-108, score-0.586]

39 The components have been sorted by collinearity of the conjuncted features. [sent-110, score-0.479]

40 This resulted in estimation of 400 independent components (or second-order polynomial filters). [sent-113, score-0.305]

41 4 Results In general, interpreting quadratic models can be difficult, and several strategies have been proposed in the literature (see e. [sent-117, score-0.387]

42 Figure 2 shows the quadratic ICA components when a small model was estimated with only 81 components. [sent-124, score-0.64]

43 7, the dominant eigenvectors shown at the top row of each quadruple are equal to the two unit-norm stimuli that the component reacts most highly to (e. [sent-126, score-0.632]

44 fi/u/jtlindgr/stuff/ Figure 3: Quadratic ICA components picked from 10 bootstrap iterations with 400 components estimated on each run. [sent-133, score-0.515]

45 All 4000 components were ordered by collinearity of the conjuncted features, and a small sample of each tail is shown. [sent-134, score-0.404]

46 Top, some components that prefer conjunctions of two collinear features. [sent-136, score-0.586]

47 other hand, both vectors v+ and v− must respond to a stimuli with a non-zero value if the component is to respond strongly. [sent-140, score-0.432]

48 In the case of this small model size, many of the conjuncted features v+ and v− are collinear, and respond strongly to edge- or bar-like stimuli. [sent-141, score-0.325]

49 The feature conjunctions that are not collinear remain more unstructured and appear to react highly to blob-like stimuli. [sent-142, score-0.398]

50 However, both component types are quite different from ordinary linear detectors for edges, bars and blobs, since their conjunctive nature makes them much more selective. [sent-143, score-0.344]

51 With the higher dimensionality allowed, the diversity of the emerging components increased. [sent-145, score-0.293]

52 Figure 3 shows quadratic ICA components picked from 10 experiments repeated with different subsets of the input patches and different random seeds. [sent-146, score-0.663]

53 Now, in addition to collinear conjunctions (on the top in the image), we also get components that conjunct more orthogonal stimuli (on the bottom). [sent-147, score-0.704]

54 The latter components appear to respond favourably to intuitive visual concepts such as angles and corners. [sent-148, score-0.56]

55 In this case, the benefits of the decomposition to vectors v+ and v− becomes more apparent, as many of the receptive field models retain their resemblance to Gabor-like filters (as in e. [sent-149, score-0.37]

56 First, it turns out that the eigenvalue distributions decay fast for the quadratic forms Hi of the estimated components. [sent-154, score-0.596]

57 This is illustrated on the left in figure 4, which shows the mean sorted eigenvalues for the 400 components (for a model of 81 components, the figure was similar). [sent-155, score-0.438]

58 Since all the eigenvectors have equal norms, the eigenvalues imply the magnitude of the contribution of its respective eigenvector to the component output value. [sent-156, score-0.477]

59 Due to the fast decay of the eigenvalues, the two dominant eigenvectors are largely responsible for the component output, providing that the linear term l is insignificant (for some discussion on the linear term in quadratic models, see [12]). [sent-157, score-0.962]

60 Figure 4: The conjunctive nature of the components is due to the eigenvalues of the quadratic forms Hi typically emerging as heavily dominated by just two eigenvectors with opposite-sign eigenvalues. [sent-158, score-1.248]

61 Left, sorted eigenvalues of Hi averaged over all 400 components for both quadratic ICA and quadratic PCA. [sent-160, score-1.13]

62 Right, the relative mean square error of the product approximation for the 400 quadratic ICA components. [sent-162, score-0.422]

63 The components have been sorted by the error of approximation when Gabor functions have been used to model v+ and v− . [sent-163, score-0.335]

64 Here the quadratic part tended to dominate the component responses, which may be because the (co)variances were much larger for the quadratic dimensions of the space than for the linear ones. [sent-164, score-0.861]

65 Hence, the product approximation appears to capture the behaviour of the components rather well. [sent-171, score-0.372]

66 To better understand the obtained error rates, we also fitted linear models to approximate the estimated quadratic cells using least-squares regression. [sent-174, score-0.624]

67 This revealed the quadratic components to have highly nonlinear behaviour. [sent-175, score-0.567]

68 Since the product approximation only covers the two dominant eigenvectors, it is possible that the rest of the eigenvectors might code for interesting phenomena through further excitatory and inhibitory effects. [sent-177, score-0.535]

69 However, the quick decay of the eigenvalues in our estimated model should make any such effects rather minor. [sent-178, score-0.277]

70 Following the ideas and methods of [12], we explored the possibility that the nondominant eigenvectors coded for invariances of the component. [sent-179, score-0.313]

71 Finally, we performed some experiments to examine to what extent the method of quadratic ICA on the one hand, and the natural image input data on the other, are responsible for the reported results. [sent-184, score-0.529]

72 For example, it could be argued that the quadratic ICA components might be very similar to the quadratic PCA components. [sent-185, score-0.913]

73 It can be seen that the PCA components quickly lose resemblance to Gabor-like filters as the eigenvalues decrease. [sent-187, score-0.432]

74 Also, the conjunctive nature of the estimated features is not as clear for the PCA-based components. [sent-188, score-0.337]

75 This is shown on the left in Figure 5: Left, the two top rows show the vectors v+ and v− for the first 8 quadratic PCA components. [sent-189, score-0.395]

76 It can be seen that the PCA components quickly lose any resemblance to Gabor-like receptive fields of V1 simple cells. [sent-191, score-0.455]

77 Right, some typical quadratic ICA components in terms of the vectors v+ and v− when the model was estimated on white noise. [sent-192, score-0.689]

78 figure 4, demonstrating that on the average, the eigenvalues of the quadratic forms decay slower for the PCA components. [sent-194, score-0.611]

79 If the whole set of PCA components is studied (not shown), it can be seen that the components appear to change from low-pass filters to high-pass filters as the eigenvalues decrease. [sent-195, score-0.617]

80 Comparing figure 5 to figure 3, both outputs seem characteristic to the method applied, the differences resembling those observed when linear ICA and linear PCA are used to code natural image data. [sent-196, score-0.321]

81 To verify that the emerging component structures are not artifacts of the modelling methodology, we generated a dataset of artificial images of white noise, having the luminance distribution of the original dataset, but with no spatial or spectral structure. [sent-197, score-0.328]

82 Some of the components estimated on random data are displayed on the right in figure 5 in terms of vectors v+ and v− . [sent-202, score-0.343]

83 5 Discussion In this paper, we specified a quadratic model for natural images and estimated its parameters with independent component analysis. [sent-203, score-0.702]

84 We reported the emergence of cell models exhibiting strongly nonlinear behaviour. [sent-204, score-0.306]

85 In particular, we demonstrated that the estimated cells were essentially computing products between outputs of two linear filters that had V1 simple cell characteristics. [sent-205, score-0.382]

86 Many of these feature conjunctions preferred two collinear features, and yet others corresponded to combinations of more orthogonal stimuli, reacting strongly to angles and corners. [sent-206, score-0.468]

87 Our results indicate that sparse coding of natural images might partially explain the development of angle- or corner-preferring cells in V2. [sent-207, score-0.342]

88 There has been some previous work describing quadratic models of natural image data (i. [sent-208, score-0.534]

89 Bartsch and Obermayer [13] report curvature detecting cells, but the patch size used and the number of components estimated were very small, making the results inconclusive. [sent-212, score-0.294]

90 Hashimoto [7] sought to replicate complex cell properties with an algorithm based on Kullback-Leibler divergences, and does not report conjunctive features or cells with preferences for angles or corners. [sent-213, score-0.561]

91 Instead, most of the estimated quadratic forms on static image data had only one dominant eigenvector. [sent-214, score-0.751]

92 Our work extends the previous research by reporting the emergence of conjunctive components that combine responses of V1-like linear filters. [sent-215, score-0.623]

93 , number of principal components retained) was seen to affect the feature diversity, and with only 81 components, the conjuncted features were mostly collinear, producing highly selective edge or bar detectors. [sent-219, score-0.408]

94 It is worthwhile to note that despite differences to previous work [13, 7, 9], invariances resembling complex cell behaviour did not emerge with our method either, although the class of quadratic models contains the classic energy-detector models of complex cells (fitted in e. [sent-222, score-0.869]

95 It could be that static images and optimization of sparsity alone may not work towards the emergence of invariances, or equivalently, behaviour resembling logical OR operation, unless the model is further constrained (for example as in [3]). [sent-225, score-0.376]

96 Optimizing model likelihood can also be preferable to optimizing output sparseness, but for quadratic ICA it is not clear how to construct a proper probabilistic model. [sent-226, score-0.346]

97 Finally, an important open question regarding the current work is to what extent the obtained conjunctive features reflect real structures present in the image data. [sent-227, score-0.34]

98 Independent component filters of natural images compared with simple cells in primary visual cortex. [sent-244, score-0.39]

99 Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. [sent-254, score-0.284]

100 On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. [sent-303, score-0.572]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('quadratic', 0.346), ('ica', 0.331), ('components', 0.221), ('conjunctive', 0.213), ('dominant', 0.195), ('eigenvectors', 0.193), ('collinear', 0.191), ('receptive', 0.165), ('eigenvalues', 0.142), ('conjuncted', 0.136), ('conjunctions', 0.136), ('pca', 0.119), ('emergence', 0.114), ('hi', 0.113), ('xyz', 0.109), ('lters', 0.108), ('cell', 0.106), ('vn', 0.102), ('stimuli', 0.101), ('angles', 0.096), ('component', 0.096), ('cells', 0.095), ('preprocessing', 0.094), ('respond', 0.093), ('whitening', 0.09), ('bb', 0.086), ('images', 0.082), ('image', 0.076), ('zz', 0.076), ('dd', 0.076), ('sorted', 0.075), ('behaviour', 0.075), ('estimated', 0.073), ('lt', 0.073), ('hyv', 0.072), ('emerging', 0.072), ('natural', 0.071), ('favourably', 0.071), ('resemblance', 0.069), ('invariances', 0.065), ('resembling', 0.065), ('decay', 0.062), ('forms', 0.061), ('gure', 0.06), ('patches', 0.06), ('bartsch', 0.055), ('conjunct', 0.055), ('helsinki', 0.055), ('hijk', 0.055), ('nondominant', 0.055), ('onml', 0.055), ('macaque', 0.054), ('rinen', 0.054), ('eigenvalue', 0.054), ('van', 0.054), ('coding', 0.054), ('features', 0.051), ('polynomial', 0.05), ('resemble', 0.05), ('vectors', 0.049), ('quadruple', 0.047), ('hateren', 0.047), ('collinearity', 0.047), ('finland', 0.047), ('lgn', 0.047), ('decomposition', 0.046), ('gabor', 0.046), ('visual', 0.046), ('respective', 0.046), ('strongly', 0.045), ('intensity', 0.044), ('vj', 0.042), ('artifacts', 0.042), ('models', 0.041), ('logical', 0.04), ('responses', 0.04), ('sparse', 0.04), ('approximation', 0.039), ('outputs', 0.039), ('prefer', 0.038), ('monkey', 0.038), ('tended', 0.038), ('react', 0.038), ('product', 0.037), ('dataset', 0.036), ('excitatory', 0.036), ('lter', 0.036), ('frequencies', 0.036), ('input', 0.036), ('linear', 0.035), ('edges', 0.035), ('inhibitory', 0.035), ('primate', 0.035), ('emerge', 0.035), ('olshausen', 0.035), ('si', 0.034), ('understand', 0.034), ('products', 0.034), ('independent', 0.034), ('appear', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

2 0.21956913 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

3 0.16579896 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

4 0.13342325 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

Abstract: The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. Subspace ICA models generalize the assumption of component independence to independence between groups of components. They are attractive candidates for dimensionality reduction methods, however are currently limited by the assumption of equal group sizes or less general semi-parametric models. By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by including previous results on non-Gaussian component analysis. After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. Simulations confirm the feasibility of the algorithm. 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA. So 1-ISA is equivalent to ICA. 1.1 Why extend ICA? An important structural aspect in the search for decompositions is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating many applications. It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulfill the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the latter two algorithms do indeed find an ISA up to permutation, which will be explained in section 3. theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. In the literature, ICA is often applied to find representations fulfilling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in figure 1. This work aims at finding an ISA model that allows applicability to any random vector. After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. 1.2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. ISA in the terminology of multidimensional ICA has first been introduced by Cardoso [4] using geometrical motivations. His model as well as the related but independently proposed factorization of multivariate function classes [9] is quite general, however no identifiability results were presented, and applicability to an arbitrary random vector was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. Hyv¨ rinen and Hoyer presented a special case of k-ISA by combining it with invariant feature suba space analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able to propose more efficient algorithms without having to resort to the problematic multidimensional density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed by their tree dependepent component analysis. Together with inter-cluster independence, this implies a search for a transformation of the mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not fully blind. In the following, no additional structures are necessary for the separation. 1.3 General ISA Definition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . . . , Sk ) with pairwise independent, irreducible random vectors Si . Note that in this case, the Si are independent components of X. The idea behind this definition is that in contrast to ICA and k-ISA, we do not fix the size of the groups Si in advance. Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction is realized by allowing only irreducible components. The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si of the same dimension1. These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i.e. on existing covariance have been made. Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give a short sketch of the proof in the following. The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1.2 follows readily from Lemma 1.3. Let S = (S1 , . . . , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . If X is an irreducible component of S, then X ∼ Si for some i. Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following two lemmata each give a simplification of lemma 1.3 by ordering the components Si according to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. Lemma 1.4. Let S and X be defined as in lemma 1.3. In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. Then X ∼ Si for some i ≤ l. Lemma 1.5. Let S and X be defined as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1 . In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the following lemma: Lemma 1.6. Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. (i) follows after some linear algebra, and is necessary to show the more difficult part (ii). For this, we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 1.4 Dealing with Gaussians In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identification result would not be possible. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. In the example later in figure 3, these indeterminacies can easily be seen by comparing true and estimated sources. Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after first applying theorem 1.7 to X and then theorem 1.2 to the non-Gaussian part. 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . . . , MK } is to find an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . . . , K i.e. to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. For the actual minimization of f respectively maximization of g, we will use the common approach of Jacobi-like optimization by iterative applications of Givens rotation in two coordinates [5]. 2.1 Generalization to blocks In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. Depending on the application, we fix the blockstructure in advance or try to determine it from M. We are not interested in the order of the blocks, so the block-structure is uniquely specified by fixing a partition of n i.e. a way of writing n as a sum of positive integers, where the order of the addends is not significant. So let2 n = m1 + . . . + mr with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1 , . . . , mr ) ∈ Nr . An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . .  .. .   . . . . 0 · · · Dr with arbitrary mi × mi matrices Di . As generalization of JD in the case of known the block structure, we can formulate the joint mK ˆ ˆ⊤ ˆ block diagonalization (m-JBD) problem as the minimization of f m (E) := k=1 E Mk E − m ˆ⊤ ˆ 2 with respect to the orthogonal matrix E, where diagMm (M) produces a ˆ diagM (E Mk E) F m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation defined by a permutation matrix that only swaps blocks of the same size. Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if (E, m) = argmaxm | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we define an approximate general JBD by requiring f m (E) < ǫ for some fixed constant ǫ > 0 instead of f m (E) = 0. 2.2 JBD by JD A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efficient solution to JBD [1]. This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, and we can prove the conjecture partially: We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course f (EP) = f (E), so we will show why (certain) JBD solutions are minima of f . However, JD might have additional minima. First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). We will call such a JBD block-optimal in the following. Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f . Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting each Mk by E⊤ Mk E, we may already assume that Mk is m-block diagonal, so we have to show that E = I is a local maximum of g. Consider the elementary Givens rotation Gij (ǫ) defined for i < j and ǫ√ (−1, 1) as the orthogonal ∈ matrix, where all diagonal elements are 1 except for the two elements 1 − ǫ2 in rows i and j and with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional manifold O(n) at I, simply by ι(ǫ12 , ǫ13 , . . . , ǫn−1,n ) := i < j. If i, j are in the same block of m, then h is locally maximal i.e. negative semi-definite at 0 in the direction ǫij because of block-optimality. Now assume i and j are from different blocks. After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . Then Gij (ǫ)⊤ Mk Gij (ǫ) can be easily calculated at coordinates (i, i) to (j, j), and indeed entries on the diagonal other than at indices (i, i) and (j, j) are not changed, so diag(Gij (ǫ)⊤ Mk Gij (ǫ)) 2 2 − diag(Mk ) 2 = 2 = −2ak (ak − bk )ǫ + 2bk (ak − bk )ǫ + 2(ak − bk )2 ǫ4 = −2(a2 + b2 )ǫ2 + 2(ak − bk )2 ǫ4 . k k Hence h(0, . . . , 0, ǫij , 0, . . . , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . Now either c = 0, then also d = 0 and h is constant zero in the direction ǫij . Or, more interestingly, c = 0, then c > 0 and therefore h is negative definite in the direction ǫij . Altogether we get a negative definite h at 0 except for ‘trivial directions’, and hence a local maximum at 0. 2.3 Recovering the permutation In order to perform JBD, we therefore only have to find a JD E of M. What is left according to the above theorem is to find a permutation matrix P such that EP block-diagonalizes M. In the case of known block-order m, we can employ similar techniques as used in [1, 10], which essentially find P by some combinatorial optimization. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 40 35 40 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 ˆ⊤ (a) (unknown) block diagonal M1 (b) E E w/o recovered permutation 5 10 15 20 25 30 35 40 ˆ⊤ (c) E E Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6 in the presence of noise with SNR of 5dB. The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. In the case of unknown block-size, we propose to use the following simple permutation-recovery K algorithm: consider the mean diagonalized matrix D := K −1 k=1 E⊤ Mk E. Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, whose choice is non-trivial. We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the diagonal as closely as possible. This recovers the permutation as well as the partition m of n. Algorithm 1: Block-diagonality permutation finder Input: (n × n)-matrix D Output: block-diagonal matrix P(D) := D′ such that D′ = PDPT for a permutation matrix P D′ ← D for i ← 1 to n do repeat if (j0 ← min{j|j ≥ i and d′ = 0 and d′ = 0}) exists then ij ji if (k0 ← min{k|k > j0 and (d′ = 0 or d′ = 0)}) exists then ik ki swap column j0 of D′ with column k0 swap row j0 of D′ with row k0 until no swap has occurred ; We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). They have been generated in blocks of size m with coefficients chosen randomly uniform from [−1, 1], and symmetrized by Dk ← (Dk + D⊤ )/2. After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of the JBD algorithm from above to {M1 , . . . , MK } with threshold θ = 0.1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in figure 2. 3 SJADE — a simple algorithm for general ISA As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5.5 6 5 5.5 5 1 5 2 4.5 4 3 5 4.5 4 3 4 2 5 4.5 4 3.5 4 3.5 6 3 1 2.5 0 5 3.5 3 7 3 2.5 8 4 3 2 2.5 2 5 4 2 1 1.5 7 8 9 10 11 12 13 14 2 3 4 5 (a) S2 6 7 8 9 1.5 3 3.5 4 (b) S3 4.5 5 5.5 6 6.5 10 1 0 7 (c) S4 5 9 3 2 0 1 14 3 4 5 6 7 8 9 10 ˆ (e) A−1 A −3 1 −3.5 4.5 2 (d) S5 13 250 0 −4 4 −1 −4.5 12 200 −5 −3 −6 −4 −6.5 11 150 −2 −5.5 3.5 −5 0 3 10 100 2.5 −1 −7 9 50 2 5 −2 4 3 −3 −7.5 2 −4 1.5 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 ˆ ˆ (f) (S1 , S2 ) −2.5 −2 0 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 ˆ (g) histogram of S3 0 8 −4.5 −4 −3.5 −3 −2.5 −2 (h) S4 −1.5 −1 −0.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 (i) S5 −4 −3.5 −3 −2.5 1 −5 0 (j) S6 Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the ˆ scatter plots i.e. densities of the source components and the mixing-separating map A−1 A. also Cov(S) = I. Then I = Cov(X) = A Cov(S)A⊤ = AA⊤ so A is orthogonal. Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. In order to find transformations of the mixtures fulfilling this property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of the (whitened) contracted quadricovariance matrices defined by Cij (X) := E X⊤ Eij XXX⊤ − Eij − E⊤ − tr(Eij )I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij , 1 ≤ i, j ≤ n. ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). More elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically finds the subspace partition m using the general JBD algorithm from section 2.3. 4 Experimental results In a first example, we consider a general ISA problem in dimension n = 10 with the unknown partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are shown in figure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally confirms theorem 1.8. The algoˆ rithm found a partition m = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, but by using previous knowledge combination of the correct two 1d-sources yields the original 2dˆ ˆ source. The resulting recovered sources S := A−1 X, figures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. This explains why JADE retrieves meaningful components even in this non-ICA setting as observed in [4]. In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0.01. The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. Indeed, we get nearly taking the ad-hoc measure ι(P) := i=1 j=1 (pij + pji ) for P := A perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run diverges with ι(P ) = 3.48. In order to show that the algorithm really separates the Gaussian part from the other components, we compare the recovered source kurtoses. The median kurtoses are −0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of the algorithm in the general, noisy ISA setting. 5 Conclusion Previous approaches for independent subspace analysis were restricted either to fixed group sizes or semi-parametric models. In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. In the present contribution we introduce the concept of irreducible independent components and give an identifiability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). However in practice in the finite sample case, due to estimation errors the general joint block diagonality only approximately holds. Our simple solution in this contribution was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be developed in future works. References [1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria, 2004. [2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 891–896, 2003. [3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨ ller. In search of non-gaussian u components of a high-dimensional distribution. JMLR, 7:247–282, 2006. [4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. [5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1):161–164, January 1995. [6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [7] A. Hyv¨ rinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. [8] A. Hyv¨ rinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computaa tion, 13(7):1525–1558, 2001. [9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, volume 10, pages 563–569, 1998. [10] B. Poczos and A. L¨ rincz. Independent subspace analysis using k-nearest neighborhood distances. In o Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. [11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 84(5):951–956, 2004. [12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 917–925, Charleston, USA, 2006. [13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 2001, pages 993–1000, 2001.

5 0.12489806 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf

Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1

6 0.12362319 16 nips-2006-A Theory of Retinal Population Coding

7 0.12241675 175 nips-2006-Simplifying Mixture Models through Function Approximation

8 0.11976909 46 nips-2006-Blind source separation for over-determined delayed mixtures

9 0.10552008 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

10 0.09835013 75 nips-2006-Efficient sparse coding algorithms

11 0.095182739 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

12 0.094334051 149 nips-2006-Nonnegative Sparse PCA

13 0.091773674 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

14 0.085142165 80 nips-2006-Fundamental Limitations of Spectral Clustering

15 0.083746009 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

16 0.080682687 79 nips-2006-Fast Iterative Kernel PCA

17 0.076739967 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

18 0.069203071 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

19 0.067783795 126 nips-2006-Logistic Regression for Single Trial EEG Classification

20 0.066639259 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.24), (1, -0.038), (2, 0.105), (3, 0.055), (4, -0.096), (5, -0.156), (6, -0.04), (7, -0.046), (8, -0.039), (9, 0.198), (10, 0.073), (11, -0.014), (12, 0.091), (13, -0.251), (14, -0.011), (15, -0.074), (16, 0.093), (17, 0.19), (18, 0.108), (19, -0.073), (20, -0.007), (21, -0.073), (22, -0.003), (23, -0.087), (24, 0.025), (25, 0.053), (26, 0.021), (27, -0.01), (28, 0.021), (29, -0.039), (30, -0.14), (31, -0.002), (32, 0.0), (33, 0.075), (34, 0.054), (35, 0.0), (36, -0.071), (37, 0.047), (38, -0.055), (39, 0.025), (40, 0.015), (41, 0.045), (42, -0.031), (43, -0.044), (44, -0.007), (45, -0.077), (46, -0.005), (47, 0.096), (48, -0.131), (49, -0.096)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97248292 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

2 0.67837995 149 nips-2006-Nonnegative Sparse PCA

Author: Ron Zass, Amnon Shashua

Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1

3 0.67641765 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

4 0.65894324 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

Abstract: The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. Subspace ICA models generalize the assumption of component independence to independence between groups of components. They are attractive candidates for dimensionality reduction methods, however are currently limited by the assumption of equal group sizes or less general semi-parametric models. By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by including previous results on non-Gaussian component analysis. After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. Simulations confirm the feasibility of the algorithm. 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA. So 1-ISA is equivalent to ICA. 1.1 Why extend ICA? An important structural aspect in the search for decompositions is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating many applications. It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulfill the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the latter two algorithms do indeed find an ISA up to permutation, which will be explained in section 3. theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. In the literature, ICA is often applied to find representations fulfilling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in figure 1. This work aims at finding an ISA model that allows applicability to any random vector. After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. 1.2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. ISA in the terminology of multidimensional ICA has first been introduced by Cardoso [4] using geometrical motivations. His model as well as the related but independently proposed factorization of multivariate function classes [9] is quite general, however no identifiability results were presented, and applicability to an arbitrary random vector was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. Hyv¨ rinen and Hoyer presented a special case of k-ISA by combining it with invariant feature suba space analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able to propose more efficient algorithms without having to resort to the problematic multidimensional density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed by their tree dependepent component analysis. Together with inter-cluster independence, this implies a search for a transformation of the mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not fully blind. In the following, no additional structures are necessary for the separation. 1.3 General ISA Definition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . . . , Sk ) with pairwise independent, irreducible random vectors Si . Note that in this case, the Si are independent components of X. The idea behind this definition is that in contrast to ICA and k-ISA, we do not fix the size of the groups Si in advance. Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction is realized by allowing only irreducible components. The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si of the same dimension1. These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i.e. on existing covariance have been made. Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give a short sketch of the proof in the following. The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1.2 follows readily from Lemma 1.3. Let S = (S1 , . . . , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . If X is an irreducible component of S, then X ∼ Si for some i. Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following two lemmata each give a simplification of lemma 1.3 by ordering the components Si according to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. Lemma 1.4. Let S and X be defined as in lemma 1.3. In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. Then X ∼ Si for some i ≤ l. Lemma 1.5. Let S and X be defined as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1 . In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the following lemma: Lemma 1.6. Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. (i) follows after some linear algebra, and is necessary to show the more difficult part (ii). For this, we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 1.4 Dealing with Gaussians In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identification result would not be possible. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. In the example later in figure 3, these indeterminacies can easily be seen by comparing true and estimated sources. Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after first applying theorem 1.7 to X and then theorem 1.2 to the non-Gaussian part. 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . . . , MK } is to find an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . . . , K i.e. to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. For the actual minimization of f respectively maximization of g, we will use the common approach of Jacobi-like optimization by iterative applications of Givens rotation in two coordinates [5]. 2.1 Generalization to blocks In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. Depending on the application, we fix the blockstructure in advance or try to determine it from M. We are not interested in the order of the blocks, so the block-structure is uniquely specified by fixing a partition of n i.e. a way of writing n as a sum of positive integers, where the order of the addends is not significant. So let2 n = m1 + . . . + mr with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1 , . . . , mr ) ∈ Nr . An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . .  .. .   . . . . 0 · · · Dr with arbitrary mi × mi matrices Di . As generalization of JD in the case of known the block structure, we can formulate the joint mK ˆ ˆ⊤ ˆ block diagonalization (m-JBD) problem as the minimization of f m (E) := k=1 E Mk E − m ˆ⊤ ˆ 2 with respect to the orthogonal matrix E, where diagMm (M) produces a ˆ diagM (E Mk E) F m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation defined by a permutation matrix that only swaps blocks of the same size. Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if (E, m) = argmaxm | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we define an approximate general JBD by requiring f m (E) < ǫ for some fixed constant ǫ > 0 instead of f m (E) = 0. 2.2 JBD by JD A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efficient solution to JBD [1]. This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, and we can prove the conjecture partially: We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course f (EP) = f (E), so we will show why (certain) JBD solutions are minima of f . However, JD might have additional minima. First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). We will call such a JBD block-optimal in the following. Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f . Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting each Mk by E⊤ Mk E, we may already assume that Mk is m-block diagonal, so we have to show that E = I is a local maximum of g. Consider the elementary Givens rotation Gij (ǫ) defined for i < j and ǫ√ (−1, 1) as the orthogonal ∈ matrix, where all diagonal elements are 1 except for the two elements 1 − ǫ2 in rows i and j and with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional manifold O(n) at I, simply by ι(ǫ12 , ǫ13 , . . . , ǫn−1,n ) := i < j. If i, j are in the same block of m, then h is locally maximal i.e. negative semi-definite at 0 in the direction ǫij because of block-optimality. Now assume i and j are from different blocks. After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . Then Gij (ǫ)⊤ Mk Gij (ǫ) can be easily calculated at coordinates (i, i) to (j, j), and indeed entries on the diagonal other than at indices (i, i) and (j, j) are not changed, so diag(Gij (ǫ)⊤ Mk Gij (ǫ)) 2 2 − diag(Mk ) 2 = 2 = −2ak (ak − bk )ǫ + 2bk (ak − bk )ǫ + 2(ak − bk )2 ǫ4 = −2(a2 + b2 )ǫ2 + 2(ak − bk )2 ǫ4 . k k Hence h(0, . . . , 0, ǫij , 0, . . . , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . Now either c = 0, then also d = 0 and h is constant zero in the direction ǫij . Or, more interestingly, c = 0, then c > 0 and therefore h is negative definite in the direction ǫij . Altogether we get a negative definite h at 0 except for ‘trivial directions’, and hence a local maximum at 0. 2.3 Recovering the permutation In order to perform JBD, we therefore only have to find a JD E of M. What is left according to the above theorem is to find a permutation matrix P such that EP block-diagonalizes M. In the case of known block-order m, we can employ similar techniques as used in [1, 10], which essentially find P by some combinatorial optimization. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 40 35 40 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 ˆ⊤ (a) (unknown) block diagonal M1 (b) E E w/o recovered permutation 5 10 15 20 25 30 35 40 ˆ⊤ (c) E E Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6 in the presence of noise with SNR of 5dB. The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. In the case of unknown block-size, we propose to use the following simple permutation-recovery K algorithm: consider the mean diagonalized matrix D := K −1 k=1 E⊤ Mk E. Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, whose choice is non-trivial. We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the diagonal as closely as possible. This recovers the permutation as well as the partition m of n. Algorithm 1: Block-diagonality permutation finder Input: (n × n)-matrix D Output: block-diagonal matrix P(D) := D′ such that D′ = PDPT for a permutation matrix P D′ ← D for i ← 1 to n do repeat if (j0 ← min{j|j ≥ i and d′ = 0 and d′ = 0}) exists then ij ji if (k0 ← min{k|k > j0 and (d′ = 0 or d′ = 0)}) exists then ik ki swap column j0 of D′ with column k0 swap row j0 of D′ with row k0 until no swap has occurred ; We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). They have been generated in blocks of size m with coefficients chosen randomly uniform from [−1, 1], and symmetrized by Dk ← (Dk + D⊤ )/2. After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of the JBD algorithm from above to {M1 , . . . , MK } with threshold θ = 0.1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in figure 2. 3 SJADE — a simple algorithm for general ISA As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5.5 6 5 5.5 5 1 5 2 4.5 4 3 5 4.5 4 3 4 2 5 4.5 4 3.5 4 3.5 6 3 1 2.5 0 5 3.5 3 7 3 2.5 8 4 3 2 2.5 2 5 4 2 1 1.5 7 8 9 10 11 12 13 14 2 3 4 5 (a) S2 6 7 8 9 1.5 3 3.5 4 (b) S3 4.5 5 5.5 6 6.5 10 1 0 7 (c) S4 5 9 3 2 0 1 14 3 4 5 6 7 8 9 10 ˆ (e) A−1 A −3 1 −3.5 4.5 2 (d) S5 13 250 0 −4 4 −1 −4.5 12 200 −5 −3 −6 −4 −6.5 11 150 −2 −5.5 3.5 −5 0 3 10 100 2.5 −1 −7 9 50 2 5 −2 4 3 −3 −7.5 2 −4 1.5 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 ˆ ˆ (f) (S1 , S2 ) −2.5 −2 0 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 ˆ (g) histogram of S3 0 8 −4.5 −4 −3.5 −3 −2.5 −2 (h) S4 −1.5 −1 −0.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 (i) S5 −4 −3.5 −3 −2.5 1 −5 0 (j) S6 Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the ˆ scatter plots i.e. densities of the source components and the mixing-separating map A−1 A. also Cov(S) = I. Then I = Cov(X) = A Cov(S)A⊤ = AA⊤ so A is orthogonal. Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. In order to find transformations of the mixtures fulfilling this property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of the (whitened) contracted quadricovariance matrices defined by Cij (X) := E X⊤ Eij XXX⊤ − Eij − E⊤ − tr(Eij )I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij , 1 ≤ i, j ≤ n. ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). More elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically finds the subspace partition m using the general JBD algorithm from section 2.3. 4 Experimental results In a first example, we consider a general ISA problem in dimension n = 10 with the unknown partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are shown in figure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally confirms theorem 1.8. The algoˆ rithm found a partition m = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, but by using previous knowledge combination of the correct two 1d-sources yields the original 2dˆ ˆ source. The resulting recovered sources S := A−1 X, figures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. This explains why JADE retrieves meaningful components even in this non-ICA setting as observed in [4]. In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0.01. The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. Indeed, we get nearly taking the ad-hoc measure ι(P) := i=1 j=1 (pij + pji ) for P := A perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run diverges with ι(P ) = 3.48. In order to show that the algorithm really separates the Gaussian part from the other components, we compare the recovered source kurtoses. The median kurtoses are −0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of the algorithm in the general, noisy ISA setting. 5 Conclusion Previous approaches for independent subspace analysis were restricted either to fixed group sizes or semi-parametric models. In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. In the present contribution we introduce the concept of irreducible independent components and give an identifiability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). However in practice in the finite sample case, due to estimation errors the general joint block diagonality only approximately holds. Our simple solution in this contribution was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be developed in future works. References [1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria, 2004. [2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 891–896, 2003. [3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨ ller. In search of non-gaussian u components of a high-dimensional distribution. JMLR, 7:247–282, 2006. [4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. [5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1):161–164, January 1995. [6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [7] A. Hyv¨ rinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. [8] A. Hyv¨ rinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computaa tion, 13(7):1525–1558, 2001. [9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, volume 10, pages 563–569, 1998. [10] B. Poczos and A. L¨ rincz. Independent subspace analysis using k-nearest neighborhood distances. In o Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. [11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 84(5):951–956, 2004. [12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 917–925, Charleston, USA, 2006. [13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 2001, pages 993–1000, 2001.

5 0.56228966 16 nips-2006-A Theory of Retinal Population Coding

Author: Eizaburo Doi, Michael S. Lewicki

Abstract: Efficient coding models predict that the optimal code for natural images is a population of oriented Gabor receptive fields. These results match response properties of neurons in primary visual cortex, but not those in the retina. Does the retina use an optimal code, and if so, what is it optimized for? Previous theories of retinal coding have assumed that the goal is to encode the maximal amount of information about the sensory signal. However, the image sampled by retinal photoreceptors is degraded both by the optics of the eye and by the photoreceptor noise. Therefore, de-blurring and de-noising of the retinal signal should be important aspects of retinal coding. Furthermore, the ideal retinal code should be robust to neural noise and make optimal use of all available neurons. Here we present a theoretical framework to derive codes that simultaneously satisfy all of these desiderata. When optimized for natural images, the model yields filters that show strong similarities to retinal ganglion cell (RGC) receptive fields. Importantly, the characteristics of receptive fields vary with retinal eccentricities where the optical blur and the number of RGCs are significantly different. The proposed model provides a unified account of retinal coding, and more generally, it may be viewed as an extension of the Wiener filter with an arbitrary number of noisy units. 1

6 0.54048592 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

7 0.48271048 65 nips-2006-Denoising and Dimension Reduction in Feature Space

8 0.45837522 75 nips-2006-Efficient sparse coding algorithms

9 0.43567309 46 nips-2006-Blind source separation for over-determined delayed mixtures

10 0.42407152 79 nips-2006-Fast Iterative Kernel PCA

11 0.41342875 29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

12 0.4047336 129 nips-2006-Map-Reduce for Machine Learning on Multicore

13 0.38048592 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

14 0.37463227 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

15 0.36561 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

16 0.36166385 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

17 0.35472509 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

18 0.35255784 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

19 0.34516197 175 nips-2006-Simplifying Mixture Models through Function Approximation

20 0.33923128 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.104), (3, 0.026), (7, 0.054), (9, 0.096), (18, 0.228), (20, 0.03), (22, 0.09), (44, 0.05), (57, 0.088), (65, 0.067), (69, 0.022), (71, 0.034), (90, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83456016 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

2 0.65609103 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

3 0.64630759 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

4 0.64200217 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

5 0.63948518 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

6 0.63619643 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

7 0.63609719 79 nips-2006-Fast Iterative Kernel PCA

8 0.63430691 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

9 0.63287199 175 nips-2006-Simplifying Mixture Models through Function Approximation

10 0.63102102 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

11 0.62994969 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

12 0.62986213 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.62905669 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

14 0.62869614 61 nips-2006-Convex Repeated Games and Fenchel Duality

15 0.62840873 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.62664545 7 nips-2006-A Local Learning Approach for Clustering

17 0.62571859 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

18 0.6245625 194 nips-2006-Towards a general independent subspace analysis

19 0.62328297 154 nips-2006-Optimal Change-Detection and Spiking Neurons

20 0.62189561 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency