nips nips2006 nips2006-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Nonnegative Sparse PCA Ron Zass and Amnon Shashua ∗ Abstract We describe a nonnegative variant of the ”Sparse PCA” problem. [sent-1, score-0.333]
2 The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. [sent-2, score-0.348]
3 What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. [sent-3, score-0.427]
4 Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. [sent-4, score-0.459]
5 We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. [sent-5, score-0.029]
6 1 Introduction Both nonnegative and sparse decompositions of data are desirable in domains where the underlying factors have a physical interpretation: In economics, sparseness increases the efficiency of a portfolio, while nonnegativity both increases its efficiency and reduces its risk [7]. [sent-6, score-1.118]
7 Principal Component Analysis (PCA) is a popular wide spread method of data decomposition with applications throughout science and engineering. [sent-9, score-0.04]
8 The decomposition performed by PCA is a linear combination of the input coordinates where the coefficients of the combination (the principal vectors) form a low-dimensional subspace that corresponds to the direction of maximal variance in the data. [sent-10, score-0.603]
9 The maximum variance property provides a way to compress the data with minimal information loss. [sent-12, score-0.103]
10 In fact, the principal vectors provide the closest (in least squares sense) linear subspace to the data. [sent-13, score-0.411]
11 Second, the representation of the data in the projected space is uncorrelated, which is a useful property for subsequent statistical analysis. [sent-14, score-0.038]
12 Third, the PCA decomposition can be achieved via an eigenvalue decomposition of the data covariance matrix. [sent-15, score-0.08]
13 Two particular drawbacks of PCA are the lack of sparseness of the principal vectors, i. [sent-16, score-0.536]
14 , all the data coordinates participate in the linear combination, and the fact that the linear combination may mix both positive and negative weights, which might partly cancel each other. [sent-18, score-0.18]
15 The purpose of our work is to incorporate both nonnegativity and sparseness into PCA, maintaining the maximal variance property of PCA. [sent-19, score-0.637]
16 In other words, the goal is to find a collection of sparse nonnegative principal ∗ School of Engineering and Computer Science, Hebrew University of Jerusalem, Jerusalem 91904, Israel. [sent-20, score-0.846]
17 vectors spanning a low-dimensional space that preserves as much as possible the variance of the data. [sent-21, score-0.146]
18 These references above can be divided into two paradigms: (i) adding L1 norm terms to the PCA formulation as it is known that L1 approximates L0 much better than L2 , (ii) relaxing a hard cardinality (L0 norm) constraint on the principal vectors. [sent-25, score-0.519]
19 In both cases the orthonormality of the principal vector set is severely compromised or even abandoned and it is left unclear to what degree the resulting principal basis explains most of the variance present in the data. [sent-26, score-0.786]
20 While the above methods do not deal with nonnegativity at all, other approaches focus on nonnegativity but are neutral to the variance of the resulting factors, and hence recover parts which are not necessarily informative. [sent-27, score-0.777]
21 A popular example is the Nonnegative Matrix Factorization (NMF) [10] and the sparse versions of it [6, 11, 5, 4] that seek the best reconstruction of the input using nonnegative (sparse) prototypes and weights. [sent-28, score-0.516]
22 An interesting direct byproduct of nonnegativity in PCA is that the coordinates split among the principal vectors. [sent-30, score-0.826]
23 This makes the principal vectors disjoint, where each coordinate is non-zero in at most one vector. [sent-31, score-0.461]
24 We can therefore view the principal vectors as parts. [sent-32, score-0.411]
25 We then relax the disjoint property, as for most applications some overlap among parts is desired, allowing some overlap among the principal vectors. [sent-33, score-0.807]
26 We further introduce a ”sparseness” term to the optimization criterion to cover situations where the part (or semi-part) decomposition is not sufficient to guarantee sparsity (such as when the dimension of the input space far exceeds the number of principal vectors). [sent-34, score-0.413]
27 An efficient coordinate descent algorithm for finding a local optimum is derived in Section 4. [sent-36, score-0.05]
28 We will later relax the disjoint property, as it is too excessive for most applications. [sent-39, score-0.101]
29 , xn ∈ Rd form a zero mean collection of data points, arranged as the columns of the matrix X ∈ Rd×n , and u1 , . [sent-43, score-0.046]
30 , uk ∈ Rd be the desired principal vectors, arranged as the columns of the matrix U ∈ Rd×k . [sent-46, score-0.354]
31 Adding a nonnegativity constraint to PCA gives us the following optimization problem: 1 T max U X 2 s. [sent-47, score-0.328]
32 Clearly, the combination of U U = I and U ≥ 0 entails that U is disjoint, meaning that each row of U contains at most one non-zero element. [sent-50, score-0.053]
33 While having disjoint principal component may be considered as a kind of sparseness, it is too restrictive for most problems. [sent-51, score-0.419]
34 For example, a stock may be a part of more than one sector, genes are typically involved in several biological processes [1], a pixel may be a shared among several image parts, and so forth. [sent-52, score-0.037]
35 We therefore wish to allow some overlap among the principal vectors. [sent-53, score-0.49]
36 The degree of coordinate overlap can be represented by an orthonormality distance measure which is nonnegative and vanishes iff U is orthonormal. [sent-54, score-0.567]
37 275–277) as a measure for orthonormality and the relaxed version of eqn. [sent-57, score-0.108]
38 3 Nonnegative Sparse PCA (NSPCA) While semi-disjoint principal components can be considered sparse when the number of coordinates is small, it may be too dense when the number of coordinates highly exceeds the number of principal vectors. [sent-63, score-1.03]
39 In such case, the average number of non-zero elements per principal vector would be high. [sent-64, score-0.354]
40 We therefore consider minimizing the number of non-zero elements directly, k n U L0 = i=1 j=1 δuij , where δx equals one if x is non-zero and zero otherwise. [sent-65, score-0.024]
41 U ≥ 0 where β ≥ 0 controls the amount of additional sparseness required. [sent-69, score-0.258]
42 The L0 norm could be relaxed by replacing it with a L1 term and since U is nonnegative we obtain the relaxed sparseness term: U L1 = 1T U 1, where 1 is a column vector with all elements equal to one. [sent-70, score-0.683]
43 The relaxed problem becomes, 1 T α max U X 2 − I − U T U 2 − β1T U 1 s. [sent-71, score-0.047]
44 Setting the derivative with respect to urs to zero we obtain a cubic equation, ∂f = −αu3 + c2 urs + c1 = 0 rs ∂urs (5) Evaluating eqn. [sent-82, score-0.77]
45 5 and zero, the nonnegative global maximum of f (urs ) can be found (see Fig. [sent-84, score-0.333]
46 Note that as urs approaches ∞ the criteria goes to −∞, and since the function is continues a nonnegative maximum must exist. [sent-86, score-0.713]
47 In order to find the global nonnegative maximum, the function has to be inspected at all nonnegative extrema (where the derivative is zero) and at urs = 0. [sent-88, score-1.043]
48 constrained objective function, as summarized bellow: Algorithm 1 Nonnegative Sparse PCA (NSPCA) • Start with an initial guess for U . [sent-89, score-0.023]
49 • Iterate over entries (r, s) of U until convergence: – Set the value of urs to the global nonnegative maximizer of eqn. [sent-90, score-0.685]
50 4 by evaluating it over all nonnegative roots of eqn. [sent-91, score-0.366]
51 It is also worthwhile to compare this nonnegative coordinate-descent scheme with the nonnegative coordinate-descent scheme of Lee and Seung [10]. [sent-98, score-0.724]
52 Second, since it is multiplicative, the perseverance of nonnegativity is built upon the nonnegativity of the input, and therefore it cannot be applied to our problem while our scheme can be also applied to NMF. [sent-101, score-0.685]
53 In other words, a practical aspect our the NSPCA algorithm is that it can handle general (not necessarily non-negative) input matrices — such as zero-mean covariance matrices. [sent-102, score-0.022]
54 5 Experiments We start by demonstrating the role of the α and β parameters in the task of extracting face parts. [sent-103, score-0.111]
55 We use the MIT CBCL Face Dataset #1 of 2429 aligned face images, 19 by 19 pixels each, a dataset that was extensively used to demonstrate the ability of Nonnegative Matrix Factorization (NMF) [10] methods. [sent-104, score-0.206]
56 We start with α = 2 × 107 and β = 0 to extract the 10 principal vectors in Fig. [sent-105, score-0.411]
57 2(a), and then increase α to 5 × 108 to get the principal vectors in Fig. [sent-106, score-0.436]
58 Note that as α increases the overlap among the principal vectors decreases and the holistic nature of some of the vectors in Fig. [sent-108, score-0.723]
59 The vectors also become sparser, but this is only a byproduct of their nonoverlapping nature. [sent-110, score-0.118]
60 3(a) shows the amount of overlap I − U T U as a function of α, showing a consistence drop in the overlap as α increases. [sent-112, score-0.363]
61 2(a), but set the value of β to be 2 × 106 to get the factors in Fig. [sent-114, score-0.042]
62 The vectors become sparser as β increases, but this time the sparseness emerges from a drop of less informative pixels within the original vectors of Fig. [sent-116, score-0.536]
63 2(a), rather than a replacement of the holistic principal vectors with ones that are part based in nature. [sent-117, score-0.458]
64 The amount of non-zero elements in the principal vectors, U L0 , is plotted as a function of β in Fig. [sent-118, score-0.406]
65 3(b), showing the increment in sparseness as β increases. [sent-119, score-0.247]
66 (a) (b) (c) (d) (e) (f) Figure 2: The role of α and β is demonstrated in the task of extracting ten image features using the MIT-CBCL Face Dataset #1. [sent-120, score-0.054]
67 In (b) we increase α to 5 × 108 while β stays zero, to get more localized parts that has lower amount of overlap. [sent-122, score-0.154]
68 In (c) we reset α to be 2 × 107 as in (a), but increase β to be 2 × 106 . [sent-123, score-0.025]
69 While we increase β, pixels that explain less variance are dropped from the factors, but the overlapping nature of the factors remains. [sent-124, score-0.215]
70 ) In (d) we show the ten leading principal components of PCA, in (e) the ten factors of NMF, and in (f) the leading principal vectors of GSPCA when allowing 55 active pixels per principal vector. [sent-127, score-1.218]
71 Next we study how the different dimensional reduction methods aid the generalization ability of SVM in the task of face detection. [sent-128, score-0.174]
72 To measure the generalization ability we use the Receiver Operating Characteristics (ROC) curve, a two dimensional graph measuring the classification ability of an algorithm over a dataset, showing the amount of true-positives as a function of the amount of false-positives. [sent-129, score-0.253]
73 The wider the area under this curve is, the better the generalization is. [sent-130, score-0.078]
74 Again, we use the MIT CBCL Face Dataset #1, where 1000 face images and 2000 non-face images were used as a training set, and the rest of the dataset used as a test set. [sent-131, score-0.189]
75 The dimensional reduction was performed over the 1000 face images of the training set. [sent-132, score-0.134]
76 We run linear SVM on the ten features extracted by NSPCA when using different values of α and β, showing in Fig. [sent-133, score-0.108]
77 4(a) that as the principal factors become less overlapping (higher α) and sparser (higher β), the ROC curve is higher, meaning that SVM is able to generalize better. [sent-134, score-0.512]
78 Next, we compare the ROC curve produced by linear SVM when using the NSPCA extracted features (with α = 5 × 108 and β = 2 × 106 ) to the ones produced when using PCA and NMF (the principal vectors are displayed in Fig. [sent-135, score-0.484]
79 As a representative of the Sparse PCA methods we use the recent Greedy Sparse PCA (GSPCA) of [12] that shows comparable or better results to all other Sparse PCA methods (see the principal vectors in Fig. [sent-138, score-0.411]
80 4(b) shows that better generalization is achieved when using the NSPCA extracted features, and hence a more reliable face detection. [sent-141, score-0.163]
81 Since NSPCA is limited to nonnegative entries of the principal vectors, it can inherently explain less variance than Sparse PCA algorithms which are not constrained in that way, similarly to the fact that Sparse PCA algorithms can explain less variance than PCA. [sent-142, score-0.904]
82 While this limitation holds, NSPCA still manages to explain a large amount of the variance. [sent-143, score-0.122]
83 5, where we compare the amount of cumulative explained variance and cumulative cardinality of different Sparse PCA algorithms over the Pit Props dataset, a classic dataset used throughout the Sparse PCA literature. [sent-145, score-0.382]
84 In domains where nonnegativity is intrinsic to the problem, however, using NSPCA extracted features improves the generalization ability of learning algorithms, as we have demonstrated above for the face detection problem. [sent-146, score-0.515]
85 6 Summary Our method differs substantially from previous approaches to sparse PCA — a difference that begins with the definition of the problem itself. [sent-147, score-0.161]
86 In addition, most sparse PCA methods focus on the task of finding a single principal vector. [sent-153, score-0.491]
87 Our method, on the other hand, splits the coordinates among the different principal vectors, and therefore its input is the number of principal vectors, or parts, rather than the size of each part. [sent-154, score-0.813]
88 As a consequence, the natural way to use our algorithm is to search for all principal vectors together. [sent-155, score-0.411]
89 In that sense, it bears resemblance to the Nonnegative Matrix Factorization problem, from which our method departs significantly in the sense that it focus on informative parts, as it maximizes the variance. [sent-156, score-0.043]
90 Furthermore, the non-negativity of the output does not rely on having non-negative input matrices to the process thereby permitting zero-mean covariance matrices to be fed into the process just as being done with PCA. [sent-157, score-0.022]
91 Sparse factorizations of gene expression guided by binding data. [sent-159, score-0.032]
92 A direct formulation for sparse PCA using semidefinite programming. [sent-165, score-0.161]
93 Learning non-negative sparse image codes by convex o programming. [sent-177, score-0.161]
94 A modified principal component technique based on the LASSO. [sent-202, score-0.352]
95 Learning the parts of objects by non-negative matrix factorization. [sent-209, score-0.056]
96 Spectral bounds for sparse pca: Exact and greedy algorithms. [sent-219, score-0.161]
wordName wordTfidf (topN-words)
[('nspca', 0.352), ('urs', 0.352), ('nonnegative', 0.333), ('principal', 0.33), ('nonnegativity', 0.328), ('pca', 0.323), ('sparseness', 0.206), ('sparse', 0.161), ('overlap', 0.123), ('nmf', 0.112), ('coordinates', 0.094), ('gspca', 0.094), ('face', 0.09), ('vectors', 0.081), ('cardinality', 0.08), ('sparser', 0.074), ('dspca', 0.07), ('scotlass', 0.07), ('spca', 0.07), ('disjoint', 0.067), ('variance', 0.065), ('cbcl', 0.061), ('orthonormality', 0.061), ('parts', 0.056), ('cumulative', 0.054), ('dataset', 0.053), ('amount', 0.052), ('positives', 0.051), ('coordinate', 0.05), ('relaxed', 0.047), ('espca', 0.047), ('holistic', 0.047), ('props', 0.047), ('relaxing', 0.047), ('explain', 0.044), ('factors', 0.042), ('roc', 0.042), ('rs', 0.041), ('showing', 0.041), ('ian', 0.041), ('economics', 0.041), ('pcs', 0.041), ('pit', 0.041), ('uij', 0.041), ('decomposition', 0.04), ('generalization', 0.039), ('pixels', 0.039), ('curve', 0.039), ('factorization', 0.038), ('property', 0.038), ('among', 0.037), ('patrik', 0.037), ('byproduct', 0.037), ('adding', 0.036), ('cancel', 0.035), ('relax', 0.034), ('rd', 0.034), ('extracted', 0.034), ('ten', 0.033), ('roots', 0.033), ('gene', 0.032), ('orthogonality', 0.031), ('emerges', 0.031), ('const', 0.031), ('svm', 0.031), ('higher', 0.029), ('scheme', 0.029), ('multiplicative', 0.029), ('criteria', 0.028), ('jerusalem', 0.028), ('meaning', 0.027), ('norm', 0.026), ('limitation', 0.026), ('combination', 0.026), ('increase', 0.025), ('partly', 0.025), ('derivative', 0.025), ('increases', 0.024), ('classic', 0.024), ('ability', 0.024), ('elements', 0.024), ('arranged', 0.024), ('drop', 0.024), ('images', 0.023), ('maximizes', 0.023), ('constrained', 0.023), ('input', 0.022), ('component', 0.022), ('collection', 0.022), ('exceeds', 0.021), ('localized', 0.021), ('dimensional', 0.021), ('extracting', 0.021), ('tradeoff', 0.02), ('dordrecht', 0.02), ('departs', 0.02), ('schn', 0.02), ('zou', 0.02), ('amnon', 0.02), ('zass', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
2 0.14999329 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
3 0.094334051 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
Author: J.t. Lindgren, Aapo Hyvärinen
Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1
4 0.086194195 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
5 0.074049704 96 nips-2006-In-Network PCA and Anomaly Detection
Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
6 0.072669387 179 nips-2006-Sparse Representation for Signal Classification
7 0.07123933 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
8 0.069343038 79 nips-2006-Fast Iterative Kernel PCA
9 0.061183207 186 nips-2006-Support Vector Machines on a Budget
10 0.058305964 46 nips-2006-Blind source separation for over-determined delayed mixtures
11 0.057198677 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
12 0.057183273 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
13 0.053117596 75 nips-2006-Efficient sparse coding algorithms
14 0.052207708 105 nips-2006-Large Margin Component Analysis
15 0.051837277 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
16 0.046053931 153 nips-2006-Online Clustering of Moving Hyperplanes
17 0.044600796 120 nips-2006-Learning to Traverse Image Manifolds
18 0.043886635 167 nips-2006-Recursive ICA
19 0.043356396 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
20 0.043319896 66 nips-2006-Detecting Humans via Their Pose
topicId topicWeight
[(0, -0.151), (1, 0.021), (2, 0.037), (3, 0.066), (4, -0.033), (5, -0.034), (6, -0.044), (7, -0.043), (8, -0.014), (9, 0.095), (10, -0.074), (11, 0.028), (12, 0.044), (13, -0.193), (14, 0.061), (15, -0.059), (16, -0.031), (17, 0.038), (18, 0.048), (19, 0.038), (20, 0.078), (21, -0.018), (22, 0.058), (23, -0.021), (24, 0.021), (25, -0.038), (26, -0.116), (27, 0.049), (28, 0.036), (29, -0.077), (30, -0.117), (31, -0.006), (32, -0.027), (33, -0.015), (34, -0.016), (35, 0.042), (36, -0.087), (37, -0.01), (38, -0.018), (39, 0.012), (40, 0.104), (41, 0.101), (42, -0.135), (43, -0.092), (44, -0.049), (45, -0.144), (46, 0.022), (47, 0.098), (48, -0.07), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.96781814 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
2 0.65156913 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
Author: J.t. Lindgren, Aapo Hyvärinen
Abstract: In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images. 1
3 0.59655255 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
4 0.5051102 79 nips-2006-Fast Iterative Kernel PCA
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
5 0.48195317 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
6 0.45846894 96 nips-2006-In-Network PCA and Anomaly Detection
7 0.44882146 129 nips-2006-Map-Reduce for Machine Learning on Multicore
8 0.42130071 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
9 0.41250175 179 nips-2006-Sparse Representation for Signal Classification
10 0.39780188 194 nips-2006-Towards a general independent subspace analysis
11 0.38859919 105 nips-2006-Large Margin Component Analysis
12 0.38273549 75 nips-2006-Efficient sparse coding algorithms
13 0.38253599 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
14 0.3612648 138 nips-2006-Multi-Task Feature Learning
15 0.33660713 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
16 0.33610266 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
17 0.33404106 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
18 0.33023709 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
19 0.3297326 153 nips-2006-Online Clustering of Moving Hyperplanes
20 0.32705161 186 nips-2006-Support Vector Machines on a Budget
topicId topicWeight
[(1, 0.089), (3, 0.014), (7, 0.04), (9, 0.533), (20, 0.011), (22, 0.042), (44, 0.039), (57, 0.075), (65, 0.042), (69, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.95068413 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
2 0.93525314 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall
Author: Máté Lengyel, Peter Dayan
Abstract: Many neural areas, notably, the hippocampus, show structured, dynamical, population behavior such as coordinated oscillations. It has long been observed that such oscillations provide a substrate for representing analog information in the firing phases of neurons relative to the underlying population rhythm. However, it has become increasingly clear that it is essential for neural populations to represent uncertainty about the information they capture, and the substantial recent work on neural codes for uncertainty has omitted any analysis of oscillatory systems. Here, we observe that, since neurons in an oscillatory network need not only fire once in each cycle (or even at all), uncertainty about the analog quantities each neuron represents by its firing phase might naturally be reported through the degree of concentration of the spikes that it fires. We apply this theory to memory in a model of oscillatory associative recall in hippocampal area CA3. Although it is not well treated in the literature, representing and manipulating uncertainty is fundamental to competent memory; our theory enables us to view CA3 as an effective uncertainty-aware, retrieval system. 1
3 0.93448848 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
4 0.83469975 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
5 0.58637607 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.58175468 158 nips-2006-PG-means: learning the number of clusters in data
7 0.55827171 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
8 0.53230006 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
9 0.53173411 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
10 0.53171486 80 nips-2006-Fundamental Limitations of Spectral Clustering
11 0.51616383 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
12 0.49717984 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
13 0.49505356 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
14 0.49220082 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
15 0.48465931 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity
16 0.48253393 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.47440335 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
18 0.47257599 179 nips-2006-Sparse Representation for Signal Classification
19 0.45194033 181 nips-2006-Stability of $K$-Means Clustering
20 0.45042875 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds