nips nips2011 nips2011-142 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Sparse PCA provides a linear combination of small number of features that maximizes variance across data. [sent-5, score-0.033]
2 In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. [sent-7, score-0.204]
3 This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. [sent-8, score-0.348]
4 We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. [sent-9, score-0.597]
5 We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. [sent-10, score-0.077]
6 These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. [sent-11, score-0.077]
7 1 Introduction The sparse Principal Component Analysis (Sparse PCA) problem is a variant of the classical PCA problem, which accomplishes a trade-off between the explained variance along a normalized vector, and the number of non-zero components of that vector. [sent-12, score-0.237]
8 Various researchers have proposed different formulations and algorithms for this problem, ranging from ad-hoc methods such as factor rotation techniques [3] and simple thresholding [4], to greedy algorithms [5, 6]. [sent-14, score-0.038]
9 The 1 -norm based semidefinite relaxation DSPCA, as introduced in [1], does guarantee global convergence and as such, is an attractive alternative to local methods. [sent-17, score-0.047]
10 However, the first-order algorithm for solving √ DSPCA, as developed in [1], has a computational complexity of O(n4 log n), with n the number of 1 features, which is too high for many large-scale data sets. [sent-19, score-0.192]
11 At first glance, this complexity estimate indicates that solving sparse PCA is much more expensive than PCA, since we can compute one principal component with a complexity of O(n2 ). [sent-20, score-0.573]
12 In this paper we show that solving DSPCA is in fact computationally easier than PCA, and hence can be applied to very large-scale data sets. [sent-21, score-0.096]
13 Based on that formulation, we describe a safe feature elimination method for that problem, which leads to an often important reduction in problem size, prior to solving the problem. [sent-23, score-0.561]
14 Then we develop a block coordinate ascent algorithm, with a computational complexity of O(n3 ) to solve DSPCA, which is much faster than the firstorder algorithm proposed in [1]. [sent-24, score-0.637]
15 Finally, we observe that real data sets typically allow for a dramatic reduction in problem size as afforded by our safe feature elimination result. [sent-25, score-0.553]
16 Now the comparison between sparse PCA and PCA becomes O(ˆ 3 ) v. [sent-26, score-0.159]
17 O(n2 ) with n n ˆ n, which can make sparse PCA surprisingly easier than PCA. [sent-28, score-0.204]
18 In Section 2, we review the 1 -norm based DSPCA formulation, and relate it to an approximation to the 0 -norm based formulation and highlight the safe feature elimination mechanism as a powerful pre-processing technique. [sent-29, score-0.51]
19 We use Section 3 to present our fast block coordinate ascent algorithm. [sent-30, score-0.558]
20 R(Y ) denotes the range of matrix Y , and Y † its pseudo-inverse. [sent-33, score-0.038]
21 The notation log refers to the extended-value function, with log x = −∞ if x ≤ 0. [sent-34, score-0.138]
22 Given a n × n positive-semidefinite matrix Σ, the “sparse PCA” problem introduced in [1] is : φ = max Tr ΣZ − λ Z 1 : Z 0, Tr Z = 1 (1) Z where λ ≥ 0 is a parameter encouraging sparsity. [sent-36, score-0.105]
23 Without loss of generality we may assume that Σ 0. [sent-37, score-0.042]
24 Problem (1) is in fact a relaxation to a PCA problem with a penalty on the cardinality of the variable: ψ = max xT Σx − λ x x 0 : x 2 =1 (2) Where x 0 denotes the cardinality (number of non-zero elemements) in x. [sent-38, score-0.44]
25 This can be seen by first writing problem (2) as: max Tr ΣZ − λ Z Z 0 : Z 0, Tr Z = 1, Rank(Z) = 1 where Z 0 is the cardinality (number of non-zero elements) of Z. [sent-39, score-0.208]
26 Since Z Z 0 , we obtain the relaxation max Tr ΣZ − λ Z Z 1 : Z 1 ≤ Z 0 Z F = 0, Tr Z = 1, Rank(Z) = 1 Further drop the rank constraint, leading to problem (1). [sent-40, score-0.167]
27 By viewing problem (1) as a convex approximation to the non-convex problem (2), we can leverage the safe feature elimination theorem first presented in [6, 12] for problem (2): Theorem 2. [sent-41, score-0.51]
28 i ψ = max ξ 2 =1 i=1 An optimal non-zero pattern corresponds to indices i with λ < (aT ξ)2 at optimum. [sent-47, score-0.067]
29 i We observe that the i-th feature is absent at optimum if (aT ξ)2 ≤ λ for every ξ, i we can safely remove feature i ∈ {1, . [sent-48, score-0.216]
30 First, if we are interested in solving problem (1) as a relaxation to problem (2), we first calculate and rank all the feature variances, which takes O(nm) and O(n log(n)) respectively. [sent-53, score-0.207]
31 Then we can safely eliminate any feature with variance less than λ. [sent-54, score-0.155]
32 However, when looking for extremely sparse solutions, applying this safe feature elimination test with a large λ can dramatically reduce problem size and lead to huge computational savings, as will be demonstrated empirically in Section 4. [sent-56, score-0.669]
33 Third, in practice, when PCA is performed on large data sets, some similar variance-based criteria is routinely employed to bring problem sizes down to a manageable level. [sent-57, score-0.033]
34 This purely heuristic practice has a rigorous interpretation in the context of sparse PCA, as the above theorem states explicitly the features that can be safely discarded. [sent-58, score-0.26]
35 3 Block Coordinate Ascent Algorithm The first-order algorithm developed in [1] to solve problem (1) has a computational complexity √ of O(n4 log n). [sent-59, score-0.181]
36 In this section, we develop a block coordinate ascent algorithm with better dependence on problem size (O(n3 )), that in practice converges much faster. [sent-61, score-0.591]
37 This algorithm appeared in the specific context of sparse covariance estimation in [13], and extended to a large class of SDPs in [14]. [sent-64, score-0.23]
38 Precisely, it applies to problems of the form min f (X) − β log det X : L ≤ X ≤ U, X 0, (4) X where X = X T is a n × n matrix variable, L, U impose component-wise bounds on X, f is convex, and β > 0. [sent-65, score-0.215]
39 However, if we try to update the row/columns of Z in problem (1), the trace constraint will imply that we never modify the diagonal elements of Z. [sent-66, score-0.039]
40 Indeed at each step, we update only one diagonal element, and it is entirely fixed given all the other diagonal elements. [sent-67, score-0.078]
41 The authors in [14] propose an augmented Lagrangian method to deal with such constraints, with a complication due to the choice of appropriate penalty parameters. [sent-69, score-0.081]
42 In our case, we can apply a technique resembling the augmented Lagrangian technique, without this added complication. [sent-70, score-0.071]
43 This is due to the homogeneous nature of the objective function and of the conic constraint. [sent-71, score-0.051]
44 1), we can always assume without loss of generality that λ < σmin := min1≤i≤n Σii . [sent-74, score-0.042]
45 We can express problem (1) as 1 2 1 φ = max Tr ΣX − λ X 1 − (Tr X)2 : X 0. [sent-76, score-0.067]
46 That is, we address the problem max Tr ΣX − λ X X 1 1 − (Tr X)2 + β log det X, : X 2 0, (6) where β > 0 is a penalty parameter. [sent-82, score-0.219]
47 Without loss of generality, we consider the problem of updating the last row/column of the matrix variable X. [sent-85, score-0.038]
48 Partition the latter and the covariance matrix S as Y y S s X= , Σ= , yT x sT σ 3 where Y, S ∈ R(n−1)×(n−1) , y, s ∈ Rn−1 , and x, σ ∈ R. [sent-86, score-0.109]
49 The conic constraint X 0 translates as y T Y † y ≤ x, y ∈ R(Y ), where R(Y ) is the range of the matrix Y . [sent-89, score-0.089]
50 We obtain the sub-problem 1 2(y T s − λ y 1 ) + (σ − λ)x − 2 (t + x)2 T † +β log(x − y Y y) ψ := max x,y : y ∈ R(Y ). [sent-90, score-0.067]
51 In addition, βz T † g(z) := max y T s − λ y 1 − (y Y y) 2 y∈R(Y ) βz T † = max y T s + min y T v − (y Y y) 2 y∈R(Y ) v : v ∞ ≤λ βz T † = min max (y T (s + v) − (y Y y)) 2 v : v ∞ ≤λ y∈R(Y ) βz T † = min max (y T u − (y Y y)) 2 u : u−s ∞ ≤λ y∈R(Y ) 1 T = min u Y u. [sent-94, score-0.544]
52 βz = (8) (9) Putting all this together, we obtain the dual of problem (7): with ψ := ψ+β+ 1 t2 , and c := σ−λ−t, 2 we have 1 T 1 ψ = min u Y u − β log z + (c + βz)2 : z > 0, u − s ∞ ≤ λ. [sent-96, score-0.138]
53 u,z βz 2 Since β is small, we can avoid large numbers in the above, with the change of variable τ = βz: 1 1 ψ − β log β = min uT Y u − β log τ + (c + τ )2 : τ > 0, u − s ∞ ≤ λ. [sent-97, score-0.207]
54 First, we solve the box-constrained QP R2 := min uT Y u : u−s u ∞ ≤ λ, (11) using a simple coordinate descent algorithm to exploit sparsity of Y . [sent-100, score-0.306]
55 Without loss of generality, we consider the problem of updating the first coordinate of u. [sent-101, score-0.197]
56 ˆ ˆ ˆ We obtain the subproblem min y1 η 2 + (2ˆT u)η : y ˆ η η − s1 ≤ λ for which we can solve for η analytically using the formula given below. [sent-103, score-0.156]
57 ˆ ˆ Next, we set τ by solving the one-dimensional problem: R2 1 − β log τ + (c + τ )2 . [sent-105, score-0.12]
58 τ >0 τ 2 The above can be reduced to a bisection problem over τ , or by solving a polynomial equation of degree 3. [sent-106, score-0.145]
59 Once the above problem is solved, we can obtain the primal 1 variables y, x, as follows. [sent-108, score-0.053]
60 Using formula (9), with βz = τ , we set y = τ Y u. [sent-109, score-0.047]
61 For the diagonal element x, we use formula (8): x = c + τ = σ − λ − t + τ . [sent-110, score-0.086]
62 Notation: for any symmetric matrix A ∈ Rn×n , let A\i\j denote the matrix produced by removing row i and column j. [sent-113, score-0.111]
63 Let Aj denote column j (or row j) with the diagonal element Ajj removed. [sent-114, score-0.074]
64 Therefore, the convergence result from [14] readily applies and hence every limit point that our block coordinate ascent algorithm converges to is the global optimizer. [sent-117, score-0.591]
65 The simple coordinate descent algorithm solving problem (11) only involves a vector product and can take sparsity in Y easily. [sent-118, score-0.248]
66 Therefore, our algorithm has a computational complexity of O(Kn3 ), where K is the number of sweeps through columns. [sent-120, score-0.039]
67 Hence our algorithm has better dependence on the problem size √ compared to O(n4 log n) required of the first order algorithm developed in [1]. [sent-122, score-0.102]
68 Fig 1 shows that our algorithm converges much faster than the first order algorithm. [sent-123, score-0.033]
69 On the left, both algorithms are run on a covariance matrix Σ = F T F with F Gaussian. [sent-124, score-0.109]
70 On the right, the covariance matrix comes from a ”spiked model” similar to that in [2], with Σ = uuT +V V T /m, where u ∈ Rn is the true sparse leading eigenvector, with Card(u) = 0. [sent-125, score-0.268]
71 1n, V ∈ Rn×m is a noise matrix with Vij ∼ N (0, 1) and m is the number of observations. [sent-126, score-0.038]
72 4 Numerical Examples In this section, we analyze two publicly available large data sets, the NYTimes news articles data and the PubMed abstracts data, available from the UCI Machine Learning Repository [16]. [sent-127, score-0.12]
73 Both 5 Algorithm 1 Block Coordinate Ascent Algorithm Input: The covariance matrix Σ, and a parameter ρ > 0. [sent-128, score-0.109]
74 The NYTtimes text collection contains 300, 000 articles and has a dictionary of 102, 660 unique words, resulting in a file of size 1 GB. [sent-132, score-0.143]
75 The even larger PubMed data set has 8, 200, 000 abstracts with 141, 043 unique words in them, giving a file of size 7. [sent-133, score-0.173]
76 However with the preprocessing technique presented in Section 2 and the block coordinate ascent algorithm developed in Section 3, we are able to perform sparse PCA analysis of these data, also thanks to the fact that variances of words decrease drastically when we rank them as shown in Fig 2. [sent-136, score-1.064]
77 Note that the feature elimination result only requires the computation of each feature’s variance, and that this task is easy to parallelize. [sent-137, score-0.313]
78 By doing sparse PCA analysis of these text data, we hope to find interpretable principal components that can be used to summarize and explore the large corpora. [sent-138, score-0.496]
79 Therefore, we set the target cardinality for each principal component to be 5. [sent-139, score-0.426]
80 The top 5 sparse principal components are shown in Table 1 for NYTimes and in Table 2 for PubMed. [sent-141, score-0.419]
81 Clearly the first principal component for NYTimes is about business, the second one about sports, the third about U. [sent-142, score-0.285]
82 , the fourth about politics and the fifth about education. [sent-144, score-0.031]
83 Bear in mind that the NYTimes data from UCI Machine Learning Repository “have no class labels, and for copyright reasons no filenames or other document-level metadata” [16]. [sent-145, score-0.031]
84 The sparse principal components still unambiguously identify and perfectly correspond to the topics used by The New York Times itself to classify articles on its own website. [sent-146, score-0.485]
85 In the case of PubMed, our algorithm only needs to work on covariance matrices of order at most n = 1000, instead of the full order (n = 141, 043) 7 covariance matrix. [sent-150, score-0.142]
86 Thus, at values of the penalty parameter λ that target cardinality of 5 commands, we observe a dramatic reduction in problem sizes, about 150 ∼ 200 times smaller than the original sizes respectively. [sent-151, score-0.261]
87 This motivates our conclusion that sparse PCA is in a sense, easier than PCA itself. [sent-152, score-0.204]
88 5 Conclusion The safe feature elimination result, coupled with a fast block coordinate ascent algorithm, allows to solve sparse PCA problems for very large scale, real-life data sets. [sent-153, score-1.267]
89 The overall method works especially well when the target cardinality of the result is small, which is often the case in applications where interpretability by a human is key. [sent-154, score-0.179]
90 The algorithm we proposed has better computational complexity, and in practice converges much faster than, the first-order algorithm developed in [1]. [sent-155, score-0.066]
91 Our experiments on text data also show that the sparse PCA can be a promising approach towards summarizing and organizing a large text corpus. [sent-156, score-0.313]
92 A direct formulation of sparse PCA using semidefinite programming. [sent-162, score-0.159]
93 High-dimensional analysis of semidefinite relaxations for sparse principal components. [sent-168, score-0.412]
94 Loadings and correlations in the interpretation of principal components. [sent-179, score-0.215]
95 Spectral bounds for sparse PCA: exact and greedy algorithms. [sent-185, score-0.159]
96 A modified principal component technique based on the LASSO. [sent-199, score-0.319]
97 Sparse principal component analysis via regularized low rank matrix approximation. [sent-209, score-0.376]
98 Generalized power method for sparse e a principal component analysis. [sent-219, score-0.444]
99 On the quality of a semidefinite programming bound for sparse principal component analysis. [sent-235, score-0.444]
100 Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. [sent-241, score-0.159]
wordName wordTfidf (topN-words)
[('pca', 0.344), ('nytimes', 0.286), ('elimination', 0.257), ('dspca', 0.252), ('ascent', 0.23), ('principal', 0.215), ('coordinate', 0.197), ('safe', 0.197), ('pc', 0.181), ('pubmed', 0.179), ('tr', 0.174), ('sparse', 0.159), ('cardinality', 0.141), ('block', 0.131), ('words', 0.119), ('el', 0.104), ('semide', 0.091), ('aspremont', 0.077), ('text', 0.077), ('variances', 0.073), ('scotlass', 0.072), ('spca', 0.072), ('covariance', 0.071), ('component', 0.07), ('log', 0.069), ('min', 0.069), ('max', 0.067), ('safely', 0.066), ('articles', 0.066), ('yt', 0.065), ('lagrangian', 0.064), ('bisection', 0.063), ('ghaoui', 0.058), ('berkeley', 0.057), ('feature', 0.056), ('abstracts', 0.054), ('zx', 0.054), ('word', 0.054), ('primal', 0.053), ('rank', 0.053), ('repository', 0.052), ('sdps', 0.051), ('conic', 0.051), ('uci', 0.051), ('solving', 0.051), ('rn', 0.048), ('relaxation', 0.047), ('formula', 0.047), ('jj', 0.046), ('easier', 0.045), ('components', 0.045), ('ut', 0.044), ('penalty', 0.044), ('dramatic', 0.043), ('generality', 0.042), ('business', 0.042), ('fig', 0.04), ('solve', 0.04), ('complexity', 0.039), ('seconds', 0.039), ('diagonal', 0.039), ('det', 0.039), ('matrix', 0.038), ('interpretability', 0.038), ('relaxations', 0.038), ('rotation', 0.038), ('optimum', 0.038), ('augmented', 0.037), ('rigorous', 0.035), ('row', 0.035), ('thanks', 0.035), ('cpu', 0.035), ('technique', 0.034), ('variance', 0.033), ('sizes', 0.033), ('developed', 0.033), ('converges', 0.033), ('duo', 0.031), ('carcinoma', 0.031), ('campaign', 0.031), ('commands', 0.031), ('mice', 0.031), ('politics', 0.031), ('infection', 0.031), ('season', 0.031), ('ajj', 0.031), ('cadima', 0.031), ('copyright', 0.031), ('haipeng', 0.031), ('jianhua', 0.031), ('jolliffe', 0.031), ('katya', 0.031), ('metadata', 0.031), ('moghaddam', 0.031), ('president', 0.031), ('shiqian', 0.031), ('spiked', 0.031), ('uut', 0.031), ('polynomial', 0.031), ('children', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
2 0.14998488 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
3 0.14746937 165 nips-2011-Matrix Completion for Multi-label Image Classification
Author: Ricardo S. Cabral, Fernando Torre, Joao P. Costeira, Alexandre Bernardino
Abstract: Recently, image categorization has been an active research topic due to the urgent need to retrieve and browse digital images via semantic keywords. This paper formulates image categorization as a multi-label classification problem using recent advances in matrix completion. Under this setting, classification of testing data is posed as a problem of completing unknown label entries on a data matrix that concatenates training and testing features with training labels. We propose two convex algorithms for matrix completion based on a Rank Minimization criterion specifically tailored to visual data, and prove its convergence properties. A major advantage of our approach w.r.t. standard discriminative classification methods for image categorization is its robustness to outliers, background noise and partial occlusions both in the feature and label space. Experimental validation on several datasets shows how our method outperforms state-of-the-art algorithms, while effectively capturing semantic concepts of classes. 1
4 0.14379653 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
5 0.13703927 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
6 0.13225037 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
7 0.11466209 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
8 0.11311071 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
9 0.11185341 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
10 0.091888897 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
11 0.089104079 261 nips-2011-Sparse Filtering
12 0.087658606 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
13 0.087473966 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
14 0.084742971 258 nips-2011-Sparse Bayesian Multi-Task Learning
15 0.083503924 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
16 0.081684552 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
17 0.079704113 259 nips-2011-Sparse Estimation with Structured Dictionaries
18 0.078648426 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
19 0.073687337 276 nips-2011-Structured sparse coding via lateral inhibition
20 0.073587634 145 nips-2011-Learning Eigenvectors for Free
topicId topicWeight
[(0, 0.21), (1, 0.009), (2, -0.042), (3, -0.139), (4, -0.061), (5, 0.042), (6, 0.078), (7, 0.117), (8, -0.004), (9, 0.018), (10, 0.151), (11, 0.061), (12, -0.008), (13, -0.001), (14, 0.001), (15, -0.01), (16, -0.026), (17, 0.047), (18, 0.059), (19, -0.061), (20, 0.038), (21, 0.005), (22, 0.101), (23, -0.043), (24, -0.053), (25, 0.026), (26, -0.0), (27, 0.028), (28, -0.075), (29, 0.066), (30, -0.126), (31, -0.093), (32, 0.054), (33, 0.06), (34, 0.005), (35, 0.07), (36, -0.061), (37, -0.004), (38, -0.306), (39, -0.039), (40, 0.056), (41, -0.075), (42, -0.064), (43, -0.012), (44, 0.078), (45, 0.09), (46, 0.105), (47, -0.045), (48, 0.012), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.95610273 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
2 0.68702334 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
3 0.63072199 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
4 0.62640363 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
5 0.61436599 68 nips-2011-Demixed Principal Component Analysis
Author: Wieland Brendel, Ranulfo Romo, Christian K. Machens
Abstract: In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response. 1
6 0.60488331 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
7 0.55562693 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
8 0.52341831 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
9 0.52001601 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
10 0.51746643 165 nips-2011-Matrix Completion for Multi-label Image Classification
11 0.50769562 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
12 0.50673139 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
13 0.49013314 259 nips-2011-Sparse Estimation with Structured Dictionaries
14 0.47926033 143 nips-2011-Learning Anchor Planes for Classification
15 0.47134376 14 nips-2011-A concave regularization technique for sparse mixture models
16 0.47028825 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
17 0.46707404 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
18 0.4613983 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
19 0.45380333 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
20 0.44979343 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
topicId topicWeight
[(0, 0.013), (4, 0.037), (20, 0.012), (26, 0.014), (31, 0.026), (33, 0.019), (43, 0.046), (45, 0.655), (57, 0.018), (74, 0.041), (83, 0.028), (99, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99515915 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
2 0.99087822 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.9863109 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
4 0.98445475 272 nips-2011-Stochastic convex optimization with bandit feedback
Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1
5 0.97855437 143 nips-2011-Learning Anchor Planes for Classification
Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari
Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1
6 0.97690266 293 nips-2011-Understanding the Intrinsic Memorability of Images
7 0.97461361 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
8 0.93283063 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
9 0.89599466 19 nips-2011-Active Classification based on Value of Classifier
10 0.88525647 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
11 0.88294077 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
12 0.87788773 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
13 0.87337124 220 nips-2011-Prediction strategies without loss
14 0.87311679 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
15 0.87007868 78 nips-2011-Efficient Methods for Overlapping Group Lasso
16 0.86904514 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
17 0.86812401 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
18 0.86075747 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
19 0.85559165 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
20 0.85470319 169 nips-2011-Maximum Margin Multi-Label Structured Prediction