nips nips2006 nips2006-167 knowledge-graph by maker-knowledge-mining

167 nips-2006-Recursive ICA


Source: pdf

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. [sent-5, score-0.467]

2 Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. [sent-6, score-0.116]

3 Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. [sent-7, score-0.176]

4 This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. [sent-8, score-0.138]

5 1 Introduction Linear implementations of Barlow’s efficient encoding hypothesis1 , such as ICA [1] and sparse coding [2], have been used to explain the very first layers of auditory and visual information processing in the cerebral cortex [1, 2, 3]. [sent-9, score-0.51]

6 Nevertheless, many interesting structures are nonlinear functions of the stimulus inputs, which are unlikely to be captured by a linear model. [sent-10, score-0.171]

7 For example, for natural images, it has been observed that there is still significant statistical dependency between the variance of the filter outputs [4]. [sent-11, score-0.276]

8 Several extensions of the linear ICA algorithm [5, 6, 7, 8] have been proposed to reduce such residual nonlinear redundancy, with an explicit or implicit aim of explaining higher perceptual layers, such as complex cells in V1. [sent-12, score-0.299]

9 In this paper, we propose a hierarchical redundancy reduction model in which the problem of modeling the residual nonlinear dependency is transformed into another LEE problem, as illustrated in Figure 1. [sent-14, score-0.59]

10 For example, fMRI studies have shown that removal of one sensory modality leads to neural reorganization of the remaining modalities [9], suggesting that the same principles must be at work across modalities. [sent-18, score-0.214]

11 Since the LEE model has been so successful in explaining the very first layer of perceptual information processing in the cerebral cortex, it seems reasonable to hypothesize that higher layers might also be explained by a LEE model. [sent-19, score-0.719]

12 The problem at hand is then how to transform the problem of modeling the residual nonlinear dependency into a LEE problem. [sent-20, score-0.282]

13 After that, we will derive the transformation function that “prepares” the output of ICA for its recursive application, and then test this model on natural images. [sent-23, score-0.188]

14 After the first layer of linear efficient encoding, sensory inputs X are now represented by S. [sent-26, score-0.666]

15 Then coordinate-wise nonlinear activation functions gi are applied to each dimension of S, so that the input of the next layer X = g(|S|) satisfies the input constraints imposed by the LEE model. [sent-28, score-0.802]

16 The statistical structure among dimensions of X are then extracted by the next layer of linear efficient encoding. [sent-29, score-0.505]

17 Barlow provided the insight that the statistical structure is measured by the redundancy of the stimuli and that completely independent stimuli cannot be distinguished from random noise [11]. [sent-31, score-0.493]

18 He also hypothesized that one way for the neural system to capture the statistical structure is to remove the redundancy in the sensory outputs. [sent-32, score-0.58]

19 This so-called redundancy reduction principle forms the foundation of ICA algorithms. [sent-33, score-0.322]

20 Algorithms following the sparse coding principle are also able to find interesting structures when applied to natural image patches [2]. [sent-34, score-0.398]

21 Later it was realized that although ICA and sparse coding algorithms started out from different principles and goals, their implementations can be summarized in the same Bayesian framework [12]. [sent-35, score-0.216]

22 Also, it is assumed that the features Sj are independent from each other, and that the marginal distribution of Sj is sparse. [sent-37, score-0.148]

23 For the sparse coding algorithm described in [2], although it started from the goal of finding sparse features, the algorithm’s implementation implicitly assumes the independence of Sj ’s. [sent-38, score-0.253]

24 For the infomax ICA algorithm [1], although it aimed at finding independent features, the algorithm’s implementation assumes a sparse marginal prior (p(Sj ) ∝ sech(Sj )). [sent-39, score-0.171]

25 A sparseness and independence assumption about the data distribution is appropriate because: (1) independence allows the system to capture the statistical structure of the stimuli, as described above, and (2) sparse distribution of the sensory outputs is energy-economic. [sent-42, score-0.588]

26 The linear efficient encoding model captures the important characteristics of sensory coding: capturing the statistical structure (independence) of sensory stimuli with minimum cost (sparseness). [sent-44, score-0.559]

27 As to the LEE model, there is a clear constraint on the marginal distribution of Xi . [sent-49, score-0.148]

28 Here we limit our study on those ICA algorithms that produce basis functions resembling the simple cells’ receptive fields when applied on natural image patches. [sent-50, score-0.352]

29 Such algorithms [1, 13, 15] typically adopt a symmetric 2 and sparse marginal prior for Sj ’s that can be well approximated by a generalized Gaussian distribution. [sent-51, score-0.255]

30 In fact, if we apply linear filters resembling the receptive fields of simple cells on natural images, the distribution of the filter responses can be well approximated by a generalized Gaussian distribution. [sent-52, score-0.316]

31 In the above Bayesian framework, we assume that Sj ’s are independent and the marginal distribution of Sj is symmetric about zero. [sent-55, score-0.186]

32 A surprising fact about our perceptual system is that there does exist such a process that regularizes the marginal distribution of the sensory inputs. [sent-57, score-0.388]

33 The functional role of this process is generally described as removing pairwise redundancy, as natural images (as well as natural sounds) obey the 1/f power law in the frequency domain [16]. [sent-59, score-0.238]

34 However, as shown in Figure 2, it also regulates the marginal distribution of the input to follow a generalized-gaussian-like distribution3 . [sent-60, score-0.148]

35 We believe that besides the functional role of removing second-order redundancy, whitening might also serve as a role of formatting the sensory input for the cortex. [sent-62, score-0.218]

36 Figure 2: The distribution of pixel values of whiten images follows a generalized Gaussian distribution (see Section 2). [sent-64, score-0.184]

37 094, which means that the marginal distribution of the inputs to the LEE model is already very sparse. [sent-66, score-0.193]

38 For all the image patches we tried, the distribution of pixel values on whitened image patches can be well fitted by a generalized Gaussian distribution. [sent-68, score-0.375]

39 3 In this work, we will make the assumption that the marginal distribution of the inputs to the LEE model is a generalized gaussian distribution, as this enables the LEE model to work more efficiently. [sent-71, score-0.239]

40 3 Reducing Residual Redundancy For the filter outputs S of a layer of LEE, we will first discard information that provides no interesting structure (i. [sent-73, score-0.64]

41 , redundancy), and find an activation function such that the marginal distribution obeys the input requirements of the next layer. [sent-75, score-0.384]

42 1 Discarding the Signs It has been argued that the signs of the filter outputs do not carry any redundancy [5]. [sent-77, score-0.481]

43 We have observed the usefulness of this process in a study of natural image statistics. [sent-79, score-0.134]

44 We applied the FastICA algorithm [15] to 20x20 natural image patches, and studied the joint distribution of the filter outputs. [sent-80, score-0.175]

45 As shown in the left plot of Figure 3, p(si |sj ) = p(si | − sj ), i. [sent-81, score-0.228]

46 the conditional probability of si on sj only depends on the absolute value of sj . [sent-83, score-0.55]

47 In other words, the signs of S do not provide any dependency among the dimensions. [sent-84, score-0.183]

48 By removing the sign and applying our transformation (described in the next section), the nonlinear dependency between the si ’s is exposed (see Figure 3, right). [sent-85, score-0.277]

49 Figure 3: Left: s1 and s2 are ICA filter responses on natural image patches. [sent-86, score-0.134]

50 2 Nonlinear Activation Function The only problem left is to find the coordinate-wise activation function gi for each dimension of S such that Xi = gi (|Si |) follows a generalized Gaussian distribution, as required by the next layer of LEE. [sent-90, score-0.801]

51 By doing so, we force the LEE model of the higher layer to set more Aj,i to nonzero values (so that the Central Limit Theorem takes effect to make Xi a Gaussian distribution), which leads to more global structures at the higher layer. [sent-92, score-0.55]

52 We used two methods to find this activation function in our experiments. [sent-93, score-0.236]

53 This transformation can be seen as three consecutive steps: • Discard the sign: u ← |s|, now u bears pdf g(u; σ, θ) = and cdf γ( |s|θ 1 , ) σθ θ 1 Γ( θ ) θ 1 σΓ( θ ) u exp{− σ θ }, 0 ≤ u < ∞ , 0 ≤ u < ∞. [sent-98, score-0.173]

54 • Transform to a Gaussian distribution by applying the inverse cdf of N (0, 1): u ← F −1 (u). [sent-100, score-0.173]

55 Nonparametric Activation Function When the number of samples N is sufficiently large, a non-parametric activation function works more efficiently. [sent-101, score-0.236]

56 For each sample s, cdf (|s|) is approximated by the ratio of its ranking in the list with N . [sent-103, score-0.132]

57 Note that since ui depends only on the rank order of |si |, the results would be the same if the signs are discarded by taking s2 . [sent-105, score-0.101]

58 i 4 Experiments on Natural Images To test the behavior of our model, we applied it to small patches taken from digitized natural images. [sent-106, score-0.134]

59 The nonparametric method was used to transform the marginal distribution of the outputs’ absolute values to a standard normal distribution. [sent-112, score-0.324]

60 Then the FastICA algorithm was applied again to retrieve 100 basis functions5 . [sent-113, score-0.112]

61 We adopted the visualization method employed by [12] to investigate what kind of structures the second layer units are fond of. [sent-114, score-0.763]

62 The basis functions are fitted to Gabor filter functions using a gradient descent algorithm [12]. [sent-115, score-0.128]

63 The connection weights from a layer-2 unit to layer-1 units are shown in Figure 5, arranged by either the center or frequency/orientation of the fitted Gabor filters. [sent-116, score-0.273]

64 The layer-2 units are qualitatively similar to those found in [18]. [sent-117, score-0.273]

65 Some units welcome strong activation of layer-1 units within a certain orientation range but have no preference for locations, while others have a location preference but welcome activation of layer-1 units of all frequencies and orientations, and some develop a picky appetite for both. [sent-118, score-1.387]

66 Again, the nonparametric method was used to transform the marginal distribution of the absolute values of the outputs from the second layer to a standard normal distribution, and FastICA was applied to retrieve 20 basis functions. [sent-119, score-0.964]

67 We had no initial guess of what kind of statistical structure these third layer units might capture. [sent-120, score-0.82]

68 The activation map of a couple of these units, however, seemed to suggest that they might be tuned to respond to complicated textures. [sent-121, score-0.299]

69 We think that probably a larger database than merely 10 images, and larger image patches would be helpful for producing cleaner high level units. [sent-123, score-0.122]

70 edu/bruno/sparsenet/ This reduction in the number of units follows the example of [18]. [sent-128, score-0.304]

71 In general, there appears to be less information in later layers (as assessed by eigenvalue analysis), most likely due to the discarding of the sign. [sent-129, score-0.171]

72 5 Figure 4: A subset of the 397 ICA image basis functions. [sent-130, score-0.129]

73 The upper panel arranges the connection weights from layer-2 units to layer-1 units by the centers of the fitted Gabor filters. [sent-134, score-0.602]

74 For example, the leftmost unit prefers strong activation of layer-1 units located on the right and weak activation of layer-1 units on the left. [sent-137, score-1.123]

75 The lower panel arranges the connection weights by the frequencies and the orientations of the fitted Gabor filters. [sent-138, score-0.114]

76 The third leftmost unit welcomes strong activation of Gabor filters whose orientations are around 3 π but prefers no/little activation from those whose orientations 4 are around 1 π. [sent-140, score-0.735]

77 4 5 Discussion The key idea of our model is to transform the high-order residual redundancy to linear dependency that can be easily exploited again by the LEE model. [sent-141, score-0.516]

78 By using activation functions that are dependent on the marginal distribution of the outputs, a normal Gaussian interface is provided at every layer. [sent-142, score-0.468]

79 As the redundancy is reduced progressively along the layers, statistical structures are also captured to progressively higher orders. [sent-144, score-0.515]

80 Our simulation of a three layer Recursive ICA shows the effectiveness of our model. [sent-145, score-0.439]

81 The second layer, however, produces basis functions that qualitatively resemble those produced by a previous hierarchical generative model [7]. [sent-147, score-0.141]

82 This is remarkable given that our model is essentially a filtering model with no assumptions of underlying independent variables, but merely targeting redundancy reduction. [sent-148, score-0.291]

83 The advantage of our model is the theoretical simplicity of generalization to a third layer or more. [sent-149, score-0.481]

84 For the Karklin and Lewicki model, the assumption that the ultimate independent causal variables are two layers away from the images has to be reworked for a three layer system. [sent-150, score-0.593]

85 It is not clear how the variables at every layer should affect the next when an extra layer is added. [sent-151, score-0.878]

86 The energy function used at the first layer made it essentially a linear ICA algorithm, thus it also produces Gabor like filters. [sent-154, score-0.484]

87 The first layer outputs are squared to discard the signs and then fed to the next layer. [sent-155, score-0.707]

88 The inputs for the second layer are thus all positive and bear a very different marginal distribution from those for the first layer. [sent-156, score-0.669]

89 The energy function is changed accordingly and the second layer is essentially doing nonnegative ICA. [sent-157, score-0.484]

90 The output of this layer, however, will all be positive, which makes discarding the signs no longer an effective way of exposing higher-order dependence. [sent-158, score-0.174]

91 Thus, to extend to another layer, new activation functions and new energy function must be derived. [sent-159, score-0.311]

92 The third layer of our model produces some interesting results in that some units seem to have preferences for complicated textures (Figure 6). [sent-160, score-0.793]

93 Also, as Figure 6: Activation maps on two images (upper and lower panel respectively) for two units per layer. [sent-162, score-0.329]

94 The second left column to the rightmost column are activation maps of two units from the first layer to the third respectively. [sent-164, score-0.99]

95 The first layer units respond to small local edges, the second layer units respond to larger borders, and the third layer units seem to respond to large area of textures. [sent-165, score-2.367]

96 units at the second layer have larger receptive field than those at the first layer, it is reasonable to expect the third layer to bear even larger ones. [sent-166, score-1.306]

97 We believe that a wider range of visual structure will be picked up by the third layer units with a larger patch size on a larger training set. [sent-167, score-0.822]

98 Emergence of simple-cell receptive field properties by learning a sparse code for natural images. [sent-178, score-0.213]

99 A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. [sent-190, score-0.301]

100 A hierarchical bayesian model for learning non-linear statistical regularities in non-stationary natural signals. [sent-200, score-0.148]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('layer', 0.439), ('redundancy', 0.291), ('units', 0.273), ('ica', 0.273), ('activation', 0.236), ('sj', 0.228), ('sensory', 0.182), ('gabor', 0.158), ('cdf', 0.132), ('lee', 0.121), ('marginal', 0.107), ('signs', 0.101), ('layers', 0.098), ('outputs', 0.089), ('coding', 0.088), ('residual', 0.086), ('tted', 0.084), ('fastica', 0.082), ('dependency', 0.082), ('discard', 0.078), ('receptive', 0.076), ('recursive', 0.074), ('lter', 0.074), ('natural', 0.073), ('discarding', 0.073), ('karklin', 0.073), ('stimuli', 0.068), ('basis', 0.068), ('bruno', 0.066), ('leftmost', 0.066), ('sparse', 0.064), ('respond', 0.063), ('cerebral', 0.062), ('si', 0.061), ('image', 0.061), ('patches', 0.061), ('encoding', 0.061), ('osindero', 0.058), ('orientations', 0.058), ('perceptual', 0.058), ('nonlinear', 0.057), ('transform', 0.057), ('images', 0.056), ('arranges', 0.056), ('horace', 0.056), ('normal', 0.054), ('simon', 0.051), ('structures', 0.051), ('metabolism', 0.048), ('welcome', 0.048), ('barlow', 0.048), ('yan', 0.048), ('eero', 0.048), ('ggd', 0.048), ('generalized', 0.046), ('energy', 0.045), ('inputs', 0.045), ('retrieve', 0.044), ('hyvarinen', 0.044), ('aapo', 0.044), ('resembling', 0.044), ('whitened', 0.044), ('geoffrey', 0.044), ('hierarchical', 0.043), ('third', 0.042), ('michael', 0.042), ('transformation', 0.041), ('hypothesized', 0.041), ('deviate', 0.041), ('distribution', 0.041), ('gi', 0.04), ('textures', 0.039), ('lewicki', 0.039), ('progressively', 0.039), ('prefers', 0.039), ('survival', 0.039), ('symmetric', 0.038), ('independence', 0.037), ('bear', 0.037), ('auditory', 0.037), ('removing', 0.036), ('cells', 0.036), ('olshausen', 0.035), ('cortex', 0.034), ('visual', 0.034), ('structure', 0.034), ('absolute', 0.033), ('lters', 0.033), ('captured', 0.033), ('nonparametric', 0.032), ('statistical', 0.032), ('principles', 0.032), ('implementations', 0.032), ('explaining', 0.032), ('david', 0.031), ('reduction', 0.031), ('sparseness', 0.031), ('functions', 0.03), ('higher', 0.03), ('shape', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

2 0.29708162 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

3 0.21956913 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

4 0.14720213 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun

Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1

5 0.13817711 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: We propose a non-linear generative model for human motion data that uses an undirected model with binary latent variables and real-valued “visible” variables that represent joint angles. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. Such an architecture makes on-line inference efficient and allows us to use a simple approximate learning procedure. After training, the model finds a single set of parameters that simultaneously capture several different kinds of motion. We demonstrate the power of our approach by synthesizing various motion sequences and by performing on-line filling in of data lost during motion capture. Website: http://www.cs.toronto.edu/∼gwtaylor/publications/nips2006mhmublv/

6 0.13795148 75 nips-2006-Efficient sparse coding algorithms

7 0.12008224 46 nips-2006-Blind source separation for over-determined delayed mixtures

8 0.11961122 86 nips-2006-Graph-Based Visual Saliency

9 0.11803487 16 nips-2006-A Theory of Retinal Population Coding

10 0.10350916 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

11 0.10243677 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

12 0.09418314 194 nips-2006-Towards a general independent subspace analysis

13 0.082429022 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

14 0.068135366 81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

15 0.067804977 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

16 0.065870546 192 nips-2006-Theory and Dynamics of Perceptual Bistability

17 0.061317947 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests

18 0.061043445 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

19 0.055805907 154 nips-2006-Optimal Change-Detection and Spiking Neurons

20 0.05557676 153 nips-2006-Online Clustering of Moving Hyperplanes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.217), (1, -0.069), (2, 0.147), (3, -0.093), (4, -0.028), (5, -0.122), (6, -0.047), (7, -0.065), (8, -0.065), (9, 0.264), (10, 0.05), (11, -0.101), (12, -0.076), (13, -0.182), (14, -0.203), (15, -0.021), (16, 0.066), (17, 0.265), (18, 0.222), (19, -0.202), (20, -0.089), (21, -0.057), (22, 0.092), (23, -0.056), (24, -0.041), (25, 0.058), (26, 0.1), (27, 0.025), (28, -0.029), (29, -0.031), (30, -0.014), (31, 0.027), (32, 0.06), (33, -0.007), (34, -0.087), (35, -0.051), (36, 0.108), (37, 0.049), (38, -0.042), (39, 0.132), (40, -0.044), (41, -0.102), (42, -0.016), (43, 0.083), (44, -0.074), (45, 0.04), (46, -0.026), (47, 0.043), (48, -0.052), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96539187 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

2 0.72378927 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

3 0.67796892 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun

Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like filters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the first layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical filter maps. 1

4 0.62253785 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

Author: J.t. Lindgren, Aapo Hyvärinen

Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1

5 0.44591352 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

Abstract: The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. Subspace ICA models generalize the assumption of component independence to independence between groups of components. They are attractive candidates for dimensionality reduction methods, however are currently limited by the assumption of equal group sizes or less general semi-parametric models. By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by including previous results on non-Gaussian component analysis. After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. Simulations confirm the feasibility of the algorithm. 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of fixed group size k or simply k-ISA. So 1-ISA is equivalent to ICA. 1.1 Why extend ICA? An important structural aspect in the search for decompositions is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating many applications. It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement fixed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulfill the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the latter two algorithms do indeed find an ISA up to permutation, which will be explained in section 3. theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. In the literature, ICA is often applied to find representations fulfilling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in figure 1. This work aims at finding an ISA model that allows applicability to any random vector. After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. 1.2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. ISA in the terminology of multidimensional ICA has first been introduced by Cardoso [4] using geometrical motivations. His model as well as the related but independently proposed factorization of multivariate function classes [9] is quite general, however no identifiability results were presented, and applicability to an arbitrary random vector was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. Hyv¨ rinen and Hoyer presented a special case of k-ISA by combining it with invariant feature suba space analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able to propose more efficient algorithms without having to resort to the problematic multidimensional density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed by their tree dependepent component analysis. Together with inter-cluster independence, this implies a search for a transformation of the mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not fully blind. In the following, no additional structures are necessary for the separation. 1.3 General ISA Definition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . . . , Sk ) with pairwise independent, irreducible random vectors Si . Note that in this case, the Si are independent components of X. The idea behind this definition is that in contrast to ICA and k-ISA, we do not fix the size of the groups Si in advance. Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction is realized by allowing only irreducible components. The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si of the same dimension1. These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i.e. on existing covariance have been made. Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give a short sketch of the proof in the following. The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1.2 follows readily from Lemma 1.3. Let S = (S1 , . . . , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . If X is an irreducible component of S, then X ∼ Si for some i. Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following two lemmata each give a simplification of lemma 1.3 by ordering the components Si according to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. Lemma 1.4. Let S and X be defined as in lemma 1.3. In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. Then X ∼ Si for some i ≤ l. Lemma 1.5. Let S and X be defined as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1 . In order to prove lemma 1.5 (and hence the theorem), it is sufficient to show the following lemma: Lemma 1.6. Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. (i) follows after some linear algebra, and is necessary to show the more difficult part (ii). For this, we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 1.4 Dealing with Gaussians In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identification result would not be possible. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. In the example later in figure 3, these indeterminacies can easily be seen by comparing true and estimated sources. Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after first applying theorem 1.7 to X and then theorem 1.2 to the non-Gaussian part. 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . . . , MK } is to find an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . . . , K i.e. to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. For the actual minimization of f respectively maximization of g, we will use the common approach of Jacobi-like optimization by iterative applications of Givens rotation in two coordinates [5]. 2.1 Generalization to blocks In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. Depending on the application, we fix the blockstructure in advance or try to determine it from M. We are not interested in the order of the blocks, so the block-structure is uniquely specified by fixing a partition of n i.e. a way of writing n as a sum of positive integers, where the order of the addends is not significant. So let2 n = m1 + . . . + mr with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1 , . . . , mr ) ∈ Nr . An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . .  .. .   . . . . 0 · · · Dr with arbitrary mi × mi matrices Di . As generalization of JD in the case of known the block structure, we can formulate the joint mK ˆ ˆ⊤ ˆ block diagonalization (m-JBD) problem as the minimization of f m (E) := k=1 E Mk E − m ˆ⊤ ˆ 2 with respect to the orthogonal matrix E, where diagMm (M) produces a ˆ diagM (E Mk E) F m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation defined by a permutation matrix that only swaps blocks of the same size. Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if (E, m) = argmaxm | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we define an approximate general JBD by requiring f m (E) < ǫ for some fixed constant ǫ > 0 instead of f m (E) = 0. 2.2 JBD by JD A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efficient solution to JBD [1]. This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, and we can prove the conjecture partially: We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course f (EP) = f (E), so we will show why (certain) JBD solutions are minima of f . However, JD might have additional minima. First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). We will call such a JBD block-optimal in the following. Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f . Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting each Mk by E⊤ Mk E, we may already assume that Mk is m-block diagonal, so we have to show that E = I is a local maximum of g. Consider the elementary Givens rotation Gij (ǫ) defined for i < j and ǫ√ (−1, 1) as the orthogonal ∈ matrix, where all diagonal elements are 1 except for the two elements 1 − ǫ2 in rows i and j and with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional manifold O(n) at I, simply by ι(ǫ12 , ǫ13 , . . . , ǫn−1,n ) := i < j. If i, j are in the same block of m, then h is locally maximal i.e. negative semi-definite at 0 in the direction ǫij because of block-optimality. Now assume i and j are from different blocks. After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . Then Gij (ǫ)⊤ Mk Gij (ǫ) can be easily calculated at coordinates (i, i) to (j, j), and indeed entries on the diagonal other than at indices (i, i) and (j, j) are not changed, so diag(Gij (ǫ)⊤ Mk Gij (ǫ)) 2 2 − diag(Mk ) 2 = 2 = −2ak (ak − bk )ǫ + 2bk (ak − bk )ǫ + 2(ak − bk )2 ǫ4 = −2(a2 + b2 )ǫ2 + 2(ak − bk )2 ǫ4 . k k Hence h(0, . . . , 0, ǫij , 0, . . . , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . Now either c = 0, then also d = 0 and h is constant zero in the direction ǫij . Or, more interestingly, c = 0, then c > 0 and therefore h is negative definite in the direction ǫij . Altogether we get a negative definite h at 0 except for ‘trivial directions’, and hence a local maximum at 0. 2.3 Recovering the permutation In order to perform JBD, we therefore only have to find a JD E of M. What is left according to the above theorem is to find a permutation matrix P such that EP block-diagonalizes M. In the case of known block-order m, we can employ similar techniques as used in [1, 10], which essentially find P by some combinatorial optimization. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 40 35 40 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 ˆ⊤ (a) (unknown) block diagonal M1 (b) E E w/o recovered permutation 5 10 15 20 25 30 35 40 ˆ⊤ (c) E E Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6 in the presence of noise with SNR of 5dB. The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. In the case of unknown block-size, we propose to use the following simple permutation-recovery K algorithm: consider the mean diagonalized matrix D := K −1 k=1 E⊤ Mk E. Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, whose choice is non-trivial. We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the diagonal as closely as possible. This recovers the permutation as well as the partition m of n. Algorithm 1: Block-diagonality permutation finder Input: (n × n)-matrix D Output: block-diagonal matrix P(D) := D′ such that D′ = PDPT for a permutation matrix P D′ ← D for i ← 1 to n do repeat if (j0 ← min{j|j ≥ i and d′ = 0 and d′ = 0}) exists then ij ji if (k0 ← min{k|k > j0 and (d′ = 0 or d′ = 0)}) exists then ik ki swap column j0 of D′ with column k0 swap row j0 of D′ with row k0 until no swap has occurred ; We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). They have been generated in blocks of size m with coefficients chosen randomly uniform from [−1, 1], and symmetrized by Dk ← (Dk + D⊤ )/2. After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of the JBD algorithm from above to {M1 , . . . , MK } with threshold θ = 0.1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in figure 2. 3 SJADE — a simple algorithm for general ISA As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5.5 6 5 5.5 5 1 5 2 4.5 4 3 5 4.5 4 3 4 2 5 4.5 4 3.5 4 3.5 6 3 1 2.5 0 5 3.5 3 7 3 2.5 8 4 3 2 2.5 2 5 4 2 1 1.5 7 8 9 10 11 12 13 14 2 3 4 5 (a) S2 6 7 8 9 1.5 3 3.5 4 (b) S3 4.5 5 5.5 6 6.5 10 1 0 7 (c) S4 5 9 3 2 0 1 14 3 4 5 6 7 8 9 10 ˆ (e) A−1 A −3 1 −3.5 4.5 2 (d) S5 13 250 0 −4 4 −1 −4.5 12 200 −5 −3 −6 −4 −6.5 11 150 −2 −5.5 3.5 −5 0 3 10 100 2.5 −1 −7 9 50 2 5 −2 4 3 −3 −7.5 2 −4 1.5 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 ˆ ˆ (f) (S1 , S2 ) −2.5 −2 0 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 ˆ (g) histogram of S3 0 8 −4.5 −4 −3.5 −3 −2.5 −2 (h) S4 −1.5 −1 −0.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 (i) S5 −4 −3.5 −3 −2.5 1 −5 0 (j) S6 Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the ˆ scatter plots i.e. densities of the source components and the mixing-separating map A−1 A. also Cov(S) = I. Then I = Cov(X) = A Cov(S)A⊤ = AA⊤ so A is orthogonal. Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. In order to find transformations of the mixtures fulfilling this property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of the (whitened) contracted quadricovariance matrices defined by Cij (X) := E X⊤ Eij XXX⊤ − Eij − E⊤ − tr(Eij )I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij , 1 ≤ i, j ≤ n. ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). More elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically finds the subspace partition m using the general JBD algorithm from section 2.3. 4 Experimental results In a first example, we consider a general ISA problem in dimension n = 10 with the unknown partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are shown in figure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with coefficients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally confirms theorem 1.8. The algoˆ rithm found a partition m = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, but by using previous knowledge combination of the correct two 1d-sources yields the original 2dˆ ˆ source. The resulting recovered sources S := A−1 X, figures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in figure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. This explains why JADE retrieves meaningful components even in this non-ICA setting as observed in [4]. In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0.01. The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. Indeed, we get nearly taking the ad-hoc measure ι(P) := i=1 j=1 (pij + pji ) for P := A perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run diverges with ι(P ) = 3.48. In order to show that the algorithm really separates the Gaussian part from the other components, we compare the recovered source kurtoses. The median kurtoses are −0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The first two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1.2, which equals the kurtosis of a uniform density. This confirms the applicability of the algorithm in the general, noisy ISA setting. 5 Conclusion Previous approaches for independent subspace analysis were restricted either to fixed group sizes or semi-parametric models. In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. In the present contribution we introduce the concept of irreducible independent components and give an identifiability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). However in practice in the finite sample case, due to estimation errors the general joint block diagonality only approximately holds. Our simple solution in this contribution was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be developed in future works. References [1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria, 2004. [2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 891–896, 2003. [3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨ ller. In search of non-gaussian u components of a high-dimensional distribution. JMLR, 7:247–282, 2006. [4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. [5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1):161–164, January 1995. [6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [7] A. Hyv¨ rinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. [8] A. Hyv¨ rinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computaa tion, 13(7):1525–1558, 2001. [9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, volume 10, pages 563–569, 1998. [10] B. Poczos and A. L¨ rincz. Independent subspace analysis using k-nearest neighborhood distances. In o Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. [11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 84(5):951–956, 2004. [12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 917–925, Charleston, USA, 2006. [13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 2001, pages 993–1000, 2001.

6 0.42328426 75 nips-2006-Efficient sparse coding algorithms

7 0.41758934 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

8 0.40427458 16 nips-2006-A Theory of Retinal Population Coding

9 0.34842661 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

10 0.34787303 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

11 0.33398259 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

12 0.33106536 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

13 0.3086338 129 nips-2006-Map-Reduce for Machine Learning on Multicore

14 0.30085194 46 nips-2006-Blind source separation for over-determined delayed mixtures

15 0.3006677 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

16 0.27788419 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures

17 0.26804957 86 nips-2006-Graph-Based Visual Saliency

18 0.25157103 52 nips-2006-Clustering appearance and shape by learning jigsaws

19 0.24853604 179 nips-2006-Sparse Representation for Signal Classification

20 0.24630681 13 nips-2006-A Scalable Machine Learning Approach to Go


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.127), (3, 0.02), (7, 0.066), (9, 0.115), (10, 0.15), (18, 0.014), (20, 0.022), (22, 0.045), (44, 0.073), (57, 0.071), (64, 0.016), (65, 0.047), (69, 0.1), (71, 0.033), (90, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90038174 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

2 0.82690948 193 nips-2006-Tighter PAC-Bayes Bounds

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1

3 0.7873044 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

4 0.77831966 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

5 0.76512778 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

Author: Graham Mcneill, Sethu Vijayakumar

Abstract: Correspondence algorithms typically struggle with shapes that display part-based variation. We present a probabilistic approach that matches shapes using independent part transformations, where the parts themselves are learnt during matching. Ideas from semi-supervised learning are used to bias the algorithm towards finding ‘perceptually valid’ part structures. Shapes are represented by unlabeled point sets of arbitrary size and a background component is used to handle occlusion, local dissimilarity and clutter. Thus, unlike many shape matching techniques, our approach can be applied to shapes extracted from real images. Model parameters are estimated using an EM algorithm that alternates between finding a soft correspondence and computing the optimal part transformations using Procrustes analysis.

6 0.75946206 65 nips-2006-Denoising and Dimension Reduction in Feature Space

7 0.75933379 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

8 0.7558623 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

9 0.75148481 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

10 0.74895078 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

11 0.74849254 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

12 0.74835527 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

13 0.74359131 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

14 0.74059159 75 nips-2006-Efficient sparse coding algorithms

15 0.74043196 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.73718566 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

17 0.73618257 80 nips-2006-Fundamental Limitations of Spectral Clustering

18 0.73602277 175 nips-2006-Simplifying Mixture Models through Function Approximation

19 0.73362166 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

20 0.73354924 117 nips-2006-Learning on Graph with Laplacian Regularization