nips nips2010 nips2010-45 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jacob Bien, Ya Xu, Michael W. Mahoney
Abstract: The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. [sent-7, score-0.242]
2 In this regard, it appears to be similar to many sparse PCA methods. [sent-8, score-0.051]
3 However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. [sent-9, score-0.123]
4 In this paper, we try to understand CUR from a sparse optimization viewpoint. [sent-10, score-0.07]
5 We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. [sent-11, score-0.172]
6 We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity. [sent-12, score-0.103]
7 1 Introduction CUR decompositions are a recently-popular class of randomized algorithms that approximate a data matrix X ∈ Rn×p by using only a small number of actual columns of X [12, 4]. [sent-13, score-0.201]
8 CUR decompositions are often described as SVD-like low-rank decompositions that have the additional advantage of being easily interpretable to domain scientists. [sent-14, score-0.143]
9 The motivation to produce a more interpretable lowrank decomposition is also shared by sparse PCA (SPCA) methods, which are optimization-based procedures that have been of interest recently in statistics and machine learning. [sent-15, score-0.086]
10 It is the purpose of this paper to understand CUR decompositions from a sparse optimization viewpoint, thereby elucidating the connection between CUR decompositions and the SPCA class of sparse optimization methods. [sent-20, score-0.264]
11 To do so, we begin by putting forth a combinatorial optimization problem (see (6) below) which CUR is implicitly approximately optimizing. [sent-21, score-0.066]
12 This formulation will highlight two interesting features of CUR: first, CUR attains a distinctive pattern of sparsity, which has practical implications from the SPCA viewpoint; and second, CUR is implicitly optimizing a regression-type objective. [sent-22, score-0.048]
13 Before doing so, recall that, given an input matrix X, Principal Component Analysis (PCA) seeks the k-dimensional hyperplane with the lowest reconstruction error. [sent-32, score-0.097]
14 That is, it computes a p × k orthogonal matrix W that minimizes T ERR(W) = ||X − XWW ||F . [sent-33, score-0.052]
15 (1) Writing the SVD of X as UΣVT , the minimizer of (1) is given by Vk , the first k columns of V. [sent-34, score-0.066]
16 In the data analysis setting, each column of V provides a particular linear combination of the columns of X. [sent-35, score-0.066]
17 In many applications, interpreting such factors is made much easier if they are comprised of only a small number of actual columns of X, which is equivalent to Vk only having a small number of nonzero elements. [sent-37, score-0.098]
18 1 CUR matrix decompositions CUR decompositions were proposed by Drineas and Mahoney [12, 4] to provide a low-rank approximation to a data matrix X by using only a small number of actual columns and/or rows of X. [sent-39, score-0.319]
19 Fast randomized variants [3], deterministic variants [5], Nystr¨ m-based variants [1, 11], and heuristic o variants [17] have also been considered. [sent-40, score-0.107]
20 Given an arbitrary matrix X ∈ Rn×p and an integer k, there exists a randomized algorithm that chooses a random subset I ⊂ {1, . [sent-44, score-0.073]
21 , p} of size c = O(k log k log(1/δ)/ǫ2 ) such that XI , the n × c submatrix containing those c columns of X, satisfies ||X − XI XI+ X||F = min ||X − XI B||F ≤ (1 + ǫ)||X − Xk ||F , c×p B∈R (2) with probability at least 1 − δ, where Xk is the best rank k approximation to X. [sent-47, score-0.107]
22 2) Form I by randomly sampling c columns of X, using these normalized statistical leverage scores as an importance sampling distribution. [sent-49, score-0.112]
23 Let the p × k matrix Vk be the top-k right singular vectors of X. [sent-52, score-0.068]
24 These scores, proportional to the Euclidean norms of the rows of the top-k right singular vectors, define the relevant nonuniformity structure to be used to identify good (in the sense of Theorem 1) columns. [sent-57, score-0.081]
25 In addition, these scores are proportional to the diagonal elements of the projection matrix onto the top-k right singular subspace. [sent-58, score-0.091]
26 Thus, they generalize the so-called hat matrix [8], and they have a natural interpretation as capturing the “statistical leverage” or “influence” of a given column on the best low-rank fit of the data matrix [8, 12]. [sent-59, score-0.088]
27 2 Regularized sparse PCA methods SPCA methods attempt to make PCA easier to interpret for domain experts by finding sparse approximations to the columns of V. [sent-61, score-0.168]
28 [10] 1 For SPCA, we only consider sparsity in the right singular vectors V and not in the left singular vectors U. [sent-64, score-0.12]
29 This is similar to considering only the choice of columns and not of both columns and rows in CUR. [sent-65, score-0.179]
30 [19] use the maximum variance interpretation of PCA and provide an optimization problem which explicitly encourages sparsity in V based on a Lasso constraint [18]. [sent-67, score-0.109]
31 [21] use the minimum reconstruction error interpretation of PCA to suggest a different approach to the SPCA problem; this formulation will be most relevant to our present purpose. [sent-71, score-0.061]
32 (4) F F ∗(i) ∗ Then, the minimizing matrices A∗ and Vk satisfy A∗(i) = si V(i) and Vk si = 1 or −1. [sent-79, score-0.08]
33 Σ2 = si Σ2 ii V(i) , where +λ ii ∗ That is, up to signs, A∗ consists of the top-k right singular vectors of X, and Vk consists of those same vectors “shrunk” by a factor depending on the corresponding singular value. [sent-80, score-0.108]
34 This regularization tends to sparsify W element-wise, so that the ∗ solution Vk gives a sparse approximation of Vk . [sent-85, score-0.088]
35 3 Expressing CUR as an optimization problem In this section, we present an optimization formulation of CUR. [sent-86, score-0.052]
36 That is, it achieves sparsity indirectly by randomly selecting c columns, and it does so in such a way that the reconstruction error is small with high probability (Theorem 1). [sent-89, score-0.099]
37 In this sense, CUR can be viewed as a randomized algorithm for approximately solving the following combinatorial optimization problem: min min ||X − XI B||F I⊂{1,. [sent-92, score-0.085]
38 (6) In words, this objective asks for the subset of c columns of X which best describes the entire matrix X. [sent-98, score-0.1]
39 This optimization problem is analogous to all-subsets multivariate regression [7], which is known to be NP-hard. [sent-100, score-0.05]
40 However, by using ideas from the optimization literature we can approximate this combinatorial problem as a regularized regression problem that is convex. [sent-101, score-0.077]
41 However, we can approximate the L0 constraint by a group lasso penalty, which uses a well-known convex heuristic proposed by Yuan et al. [sent-108, score-0.115]
42 Thus, the combinatorial problem in (6) can be approximated by the following convex (and thus tractable) problem: Problem 1 (Group lasso regression: GL -R EG). [sent-110, score-0.085]
43 i=1 where t is chosen to get c nonzero rows in B∗ . [sent-114, score-0.063]
44 3 (8) p Since the rows of B are grouped together in the penalty i=1 ||B(i) ||2 , the row vector B(i) will tend to be either dense or entirely zero. [sent-115, score-0.111]
45 (Finally, as a side remark, note that our proposed GL -R EG is strikingly similar to a recently proposed method for sparse inverse covariance estimation [6, 15]. [sent-117, score-0.051]
46 So far, we have established CUR’s connection to regression by showing that CUR can be thought of as an approximation algorithm for the sparse regression problem (7). [sent-119, score-0.127]
47 In this section, we discuss the relationship between regression and PCA, and we show that CUR cannot be directly cast as an SPCA method. [sent-120, score-0.05]
48 It is common to consider sparsity and/or rank constraints. [sent-127, score-0.079]
49 To illustrate the difference between the reconstruction errors (9) and (10) when extra constraints are imposed, consider the 2-dimensional toy example in Figure 1. [sent-129, score-0.063]
50 In this example, we compare regression with a row-sparsity constraint to PCA with both rank and sparsity constraints. [sent-130, score-0.128]
51 Finally, if W is further required to be sparse in the X(2) direction (as with SPCA methods), we get the rank-one, sparse projection represented by the green line in Figure 1 (right). [sent-135, score-0.102]
52 The two sets of dotted lines in each plot clearly differ, indicating that their corresponding reconstruction errors are differ ent as well. [sent-136, score-0.077]
53 If such a VCUR existed, then clearly ERR(VCUR ) = ||X − XI XI+ X||F , and so CUR could be regarded as implicitly performing sparse PCA in the sense that (a) VCUR is sparse; and (b) by Theorem 1 (with high probability), ERR(VCUR ) ≤ (1 + ǫ)ERR(Vk ). [sent-143, score-0.071]
54 Thus, the existence of such a VCUR would cast CUR directly as a randomized approximation algorithm for SPCA. [sent-144, score-0.072]
55 However, the following theorem states that unless an unrealistic constraint on X holds, there does not exist a matrix VCUR for which ERR(VCUR ) = ||X − XI XI+ X||F . [sent-145, score-0.066]
56 4 Regression Regression error (9) error (9) error (10) error (10) SPCA X(2) X(2) PCA X(1) X(1) Figure 1: Example of the difference in reconstruction errors (9) and (10), when additional constraints imposed. [sent-153, score-0.063]
57 Left: regression with row-sparsity constraint (black) compared with PCA with low rank constraint (red). [sent-154, score-0.094]
58 Right: regression with row-sparsity constraint (black) compared with PCA with low rank and sparsity constraint (green). [sent-155, score-0.146]
59 5 CUR-type sparsity and the group lasso SPCA Although CUR cannot be directly cast as an SPCA-type method, in this section we propose a sparse PCA approach (which we call the group lasso SPCA or GL -SPCA) that accomplishes something very close to CUR. [sent-160, score-0.244]
60 Our proposal produces a V∗ that has rows that are entirely zero, and it is motivated by the following two observations about CUR. [sent-161, score-0.066]
61 First, following from the definition of the leverage scores (3), CUR chooses columns of X based on the norm of their corresponding rows of Vk . [sent-162, score-0.159]
62 Second, as we have noted in Section 4, if CUR could be expressed as a PCA method, its principal directions matrix “VCUR ” would have p − c rows that are entirely zero, corresponding to removing those columns of X. [sent-164, score-0.192]
63 [21] obtain a sparse V∗ by including in (5) an additional L1 penalty from the optimization problem (4). [sent-166, score-0.095]
64 Since the L1 penalty is on the entire matrix viewed as a vector, it encourages only unstructured sparsity. [sent-167, score-0.102]
65 (11) + λ1 i=1 Thus, the lasso penalty λ1 ||W||1 in (5) is replaced in (11) by a group lasso penalty λ1 p ||W(i) ||2 , where rows of W are grouped together so that each row of V∗ will tend to i=1 be either dense or entirely zero. [sent-173, score-0.241]
66 Consider, for example, when there are a small number of informative columns in X and the rest are not important for the task at hand [12, 14]. [sent-182, score-0.066]
67 In such a case, we would expect that enforcing entire rows to be zero would lead to better identification of the signal columns; and this has been empirically observed in the application of CUR to DNA SNP analysis [14]. [sent-183, score-0.062]
68 The unstructured V∗ , by contrast, would not be able to “borrow strength” across all columns of V∗ to differentiate the signal columns from the noise columns. [sent-184, score-0.17]
69 On the other hand, requiring such structured sparsity is more restrictive and may not be desirable. [sent-185, score-0.052]
70 For example, in microarray analysis in which we have measured p genes on n patients, our goal may be to find several underlying factors. [sent-186, score-0.068]
71 Biologists have identified “pathways” of interconnected genes [16], and it would be desirable if each sparse factor could be identified with a different pathway (that is, a different set of genes). [sent-187, score-0.089]
72 Requiring all factors of V∗ to exclude the same p − c genes does not allow a different sparse subset of genes to be active in each factor. [sent-188, score-0.141]
73 We finish this section by pointing out that while most SPCA methods only enforce unstructured zeros in V∗ , the idea of having a structured sparsity in the PCA context has very recently been explored [9]. [sent-189, score-0.093]
74 In particular, we compare the randomized CUR algorithm of Mahoney and Drineas [12, 4] to our GL -R EG (of Problem 1), and we compare the SPCA algorithm proposed by Zou et al. [sent-192, score-0.061]
75 1 Simulations We first consider synthetic examples of the form X = X + E, where X is the underlying signal matrix and E is a matrix of noise. [sent-197, score-0.083]
76 N (0, 1) entries, while the signal X has one of the following forms: Case I) X = [0n×(p−c) ; X∗ ] where the n × c matrix X∗ is the nonzero part of X. [sent-201, score-0.065]
77 In other words, X has c nonzero columns and does not necessarily have a low-rank structure. [sent-202, score-0.082]
78 Here V is low-rank and sparse, but the sparsity is not structured (i. [sent-208, score-0.052]
79 A successful method attains low reconstruction error of the true signal X and has high precision in identifying correctly the zeros in the underlying model. [sent-211, score-0.094]
80 In Case II and III, the signal matrix has rank k = 10. [sent-216, score-0.076]
81 804) Table 1: Simulation results: The reconstruction errors and the percentages of correctly identified zeros (in parentheses). [sent-247, score-0.081]
82 As we would expect, since CUR only uses information in the top k singular vectors, it does slightly worse than GL -R EG in terms of precision when the underlying signal is not low-rank (Case I). [sent-249, score-0.049]
83 In addition, both methods perform poorly if the sparsity is not structured as in Case III. [sent-250, score-0.052]
84 Again, the group lasso method seems to work better in Case I. [sent-252, score-0.061]
85 2 Microarray example We next consider a microarray dataset of soft tissue tumors studied by Nielsen et al. [sent-255, score-0.075]
86 Since SPCA does not explicitly enforce row-sparsity, for a gene to be not used in the model requires all of the (k = 4) columns of V∗ to exclude it. [sent-268, score-0.094]
87 The subgradient equations (using that AT A = Ik ) are 2BT XT X(i) − 2AT XT X(i) + 2λBT + λ1 si = 0; (i) 7 i = 1, . [sent-276, score-0.057]
88 , p, (12) Microarray Dataset GL -SPCA 460 SPCA 420 ERR (V) 200 0 360 380 400 300 440 400 GL -R EG CUR 100 ERR reg (I) Microarray Dataset 0 50 100 150 Number of genes used 200 1000 Figure 2: Left: Comparison of CUR, multiple runs, with SPCA with SPCA (specifically, Zou et al. [sent-279, score-0.076]
89 2000 3000 4000 Number of genes used GL -R EG ; 5000 Right: Comparison of GL - where the subgradient vectors si = BT /||B(i) ||2 if B(i) = 0, or ||si ||2 ≤ 1 if B(i) = 0. [sent-281, score-0.095]
90 Let us (i) define bi = j=i (X(j)T X(i) )BT = BT XT X(i) −||X(i) ||2 BT , so that the subgradient equations 2 (i) (j) can be written as bi + (||X(i) ||2 + λ)BT − AT XT X(i) + (λ1 /2)si = 0. [sent-282, score-0.101]
91 First, if B(i) = 0, the subgradient equations (13) become bi − AT XT X(i) + (λ1 /2)si = 0. [sent-287, score-0.059]
92 To prove the other direction, recall that B(i) = 0 implies si = BT /||B(i) ||2 . [sent-289, score-0.056]
93 2 By Claim 1, ||AT XT X(i) − bi ||2 > λ1 /2 implies that B(i) = 0 which further implies si = BT /||B(i) ||2 . [sent-291, score-0.082]
94 (i) 8 Conclusion In this paper, we have elucidated several connections between two recently-popular matrix decomposition methods that adopt very different perspectives on obtaining interpretable low-rank matrix decompositions. [sent-293, score-0.103]
95 On the other hand, CUR methods operate by using randomness and approximation as computational resources to optimize approximately an intractable objective, thereby implicitly incorporating a form of regularization into the steps of the approximation algorithm. [sent-296, score-0.048]
96 A direct formulation for sparse PCA using semidefinite programming. [sent-316, score-0.065]
97 Applications of the lasso and grouped lasso to the estimation of sparse graphical models. [sent-343, score-0.159]
98 Less is more: Compact matrix decomposition for large sparse graphs. [sent-462, score-0.101]
99 A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. [sent-473, score-0.111]
100 Model selection and estimation in regression with grouped variables. [sent-478, score-0.051]
wordName wordTfidf (topN-words)
[('cur', 0.774), ('spca', 0.39), ('gl', 0.249), ('eg', 0.172), ('err', 0.122), ('vcur', 0.114), ('pca', 0.105), ('vk', 0.095), ('zou', 0.086), ('columns', 0.066), ('mahoney', 0.064), ('decompositions', 0.062), ('lcol', 0.057), ('sparsity', 0.052), ('sparse', 0.051), ('xi', 0.05), ('rows', 0.047), ('reconstruction', 0.047), ('bt', 0.047), ('xt', 0.046), ('drineas', 0.046), ('uvt', 0.046), ('xww', 0.046), ('lasso', 0.044), ('bi', 0.042), ('si', 0.04), ('stanford', 0.039), ('randomized', 0.039), ('genes', 0.038), ('matrix', 0.034), ('bien', 0.034), ('errreg', 0.034), ('singular', 0.034), ('regression', 0.031), ('microarray', 0.03), ('rp', 0.029), ('witten', 0.028), ('combinatorial', 0.027), ('rank', 0.027), ('principal', 0.026), ('penalty', 0.025), ('tibshirani', 0.025), ('svd', 0.025), ('ya', 0.024), ('wi', 0.024), ('leverage', 0.023), ('unstructured', 0.023), ('tissue', 0.023), ('nielsen', 0.023), ('nystr', 0.023), ('sparsify', 0.023), ('udvt', 0.023), ('xvcur', 0.023), ('xvv', 0.023), ('xwat', 0.023), ('scores', 0.023), ('et', 0.022), ('ik', 0.022), ('viewpoint', 0.021), ('xb', 0.02), ('jolliffe', 0.02), ('biologists', 0.02), ('hat', 0.02), ('encourages', 0.02), ('grouped', 0.02), ('implicitly', 0.02), ('entirely', 0.019), ('cast', 0.019), ('interpretable', 0.019), ('optimization', 0.019), ('zeros', 0.018), ('orthogonal', 0.018), ('constraint', 0.018), ('jacob', 0.018), ('variants', 0.017), ('subgradient', 0.017), ('aspremont', 0.017), ('group', 0.017), ('errors', 0.016), ('reg', 0.016), ('recall', 0.016), ('decomposition', 0.016), ('rn', 0.016), ('interpreting', 0.016), ('nonzero', 0.016), ('signal', 0.015), ('iii', 0.015), ('explains', 0.015), ('convex', 0.014), ('attains', 0.014), ('xu', 0.014), ('gene', 0.014), ('approximation', 0.014), ('exclude', 0.014), ('graduate', 0.014), ('claim', 0.014), ('plot', 0.014), ('formulation', 0.014), ('theorem', 0.014), ('rc', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 45 nips-2010-CUR from a Sparse Optimization Viewpoint
Author: Jacob Bien, Ya Xu, Michael W. Mahoney
Abstract: The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.
2 0.080380134 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
3 0.077775456 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
4 0.077278726 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
5 0.057855163 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
6 0.056544863 231 nips-2010-Robust PCA via Outlier Pursuit
7 0.053926479 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
8 0.051671468 265 nips-2010-The LASSO risk: asymptotic results and real world examples
9 0.049166683 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
10 0.045747157 117 nips-2010-Identifying graph-structured activation patterns in networks
11 0.044119533 217 nips-2010-Probabilistic Multi-Task Feature Selection
12 0.042290289 98 nips-2010-Functional form of motion priors in human motion perception
13 0.042133026 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
14 0.041804709 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
15 0.039608195 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
16 0.039193355 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
17 0.038520977 5 nips-2010-A Dirty Model for Multi-task Learning
18 0.036706168 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
19 0.036617443 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
20 0.036428783 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
topicId topicWeight
[(0, 0.104), (1, 0.02), (2, 0.055), (3, 0.056), (4, 0.034), (5, -0.079), (6, -0.008), (7, 0.051), (8, -0.059), (9, -0.029), (10, 0.051), (11, -0.005), (12, -0.003), (13, 0.044), (14, 0.034), (15, -0.014), (16, -0.001), (17, 0.019), (18, -0.001), (19, 0.029), (20, 0.056), (21, 0.039), (22, -0.003), (23, 0.012), (24, -0.025), (25, 0.01), (26, 0.008), (27, -0.052), (28, -0.002), (29, -0.081), (30, -0.002), (31, -0.022), (32, -0.064), (33, 0.033), (34, 0.053), (35, -0.04), (36, 0.009), (37, -0.059), (38, -0.01), (39, -0.0), (40, -0.009), (41, 0.039), (42, 0.07), (43, 0.033), (44, -0.002), (45, -0.023), (46, 0.066), (47, 0.001), (48, -0.009), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.87588686 45 nips-2010-CUR from a Sparse Optimization Viewpoint
Author: Jacob Bien, Ya Xu, Michael W. Mahoney
Abstract: The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.
2 0.65775365 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
3 0.62667024 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
4 0.56652832 172 nips-2010-Multi-Stage Dantzig Selector
Author: Ji Liu, Peter Wonka, Jieping Ye
Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the first term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1
5 0.55077487 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
6 0.52164996 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
7 0.51554108 148 nips-2010-Learning Networks of Stochastic Differential Equations
8 0.51471782 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
9 0.50838995 5 nips-2010-A Dirty Model for Multi-task Learning
10 0.50330043 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
11 0.49829003 231 nips-2010-Robust PCA via Outlier Pursuit
12 0.49749157 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
13 0.48935077 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
14 0.48366147 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection
15 0.46053079 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
16 0.45537627 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
17 0.45370603 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
18 0.44825402 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
19 0.43887189 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
20 0.43695083 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
topicId topicWeight
[(13, 0.459), (27, 0.033), (30, 0.044), (35, 0.04), (45, 0.133), (50, 0.035), (52, 0.034), (60, 0.016), (77, 0.036), (78, 0.013), (79, 0.014), (90, 0.023)]
simIndex simValue paperId paperTitle
1 0.90913624 259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms
Author: Benjamin Miller, Nadya Bliss, Patrick J. Wolfe
Abstract: When working with network datasets, the theoretical framework of detection theory for Euclidean vector spaces no longer applies. Nevertheless, it is desirable to determine the detectability of small, anomalous graphs embedded into background networks with known statistical properties. Casting the problem of subgraph detection in a signal processing context, this article provides a framework and empirical results that elucidate a “detection theory” for graph-valued data. Its focus is the detection of anomalies in unweighted, undirected graphs through L1 properties of the eigenvectors of the graph’s so-called modularity matrix. This metric is observed to have relatively low variance for certain categories of randomly-generated graphs, and to reveal the presence of an anomalous subgraph with reasonable reliability when the anomaly is not well-correlated with stronger portions of the background graph. An analysis of subgraphs in real network datasets confirms the efficacy of this approach. 1
2 0.85531235 192 nips-2010-Online Classification with Specificity Constraints
Author: Andrey Bernstein, Shie Mannor, Nahum Shimkin
Abstract: We consider the online binary classification problem, where we are given m classifiers. At each stage, the classifiers map the input to the probability that the input belongs to the positive class. An online classification meta-algorithm is an algorithm that combines the outputs of the classifiers in order to attain a certain goal, without having prior knowledge on the form and statistics of the input, and without prior knowledge on the performance of the given classifiers. In this paper, we use sensitivity and specificity as the performance metrics of the meta-algorithm. In particular, our goal is to design an algorithm that satisfies the following two properties (asymptotically): (i) its average false positive rate (fp-rate) is under some given threshold; and (ii) its average true positive rate (tp-rate) is not worse than the tp-rate of the best convex combination of the m given classifiers that satisfies fprate constraint, in hindsight. We show that this problem is in fact a special case of the regret minimization problem with constraints, and therefore the above goal is not attainable. Hence, we pose a relaxed goal and propose a corresponding practical online learning meta-algorithm that attains it. In the case of two classifiers, we show that this algorithm takes a very simple form. To our best knowledge, this is the first algorithm that addresses the problem of the average tp-rate maximization under average fp-rate constraints in the online setting. 1
same-paper 3 0.83762741 45 nips-2010-CUR from a Sparse Optimization Viewpoint
Author: Jacob Bien, Ya Xu, Michael W. Mahoney
Abstract: The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.
4 0.80117559 146 nips-2010-Learning Multiple Tasks using Manifold Regularization
Author: Arvind Agarwal, Samuel Gerber, Hal Daume
Abstract: We present a novel method for multitask learning (MTL) based on manifold regularization: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common linear subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is fixed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efficient and easy to implement. We show the efficacy of our method on several datasets. 1
5 0.78728563 221 nips-2010-Random Projections for $k$-means Clustering
Author: Christos Boutsidis, Anastasios Zouzias, Petros Drineas
Abstract: This paper discusses the topic of dimensionality reduction for k-means clustering. We prove that any set of n points in d dimensions (rows in a matrix A ∈ Rn×d ) can be projected into t = Ω(k/ε2 ) dimensions, for any ε ∈ (0, 1/3), in O(nd⌈ε−2 k/ log(d)⌉) time, such that with constant probability the optimal k-partition of the point set is preserved within a factor of 2 + √ The projection is done ε. √ by post-multiplying A with a d × t random matrix R having entries +1/ t or −1/ t with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.
6 0.78500509 284 nips-2010-Variational bounds for mixed-data factor analysis
7 0.76616049 261 nips-2010-Supervised Clustering
8 0.67409331 262 nips-2010-Switched Latent Force Models for Movement Segmentation
9 0.66782302 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
10 0.5984087 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
11 0.59557724 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
12 0.58069986 226 nips-2010-Repeated Games against Budgeted Adversaries
13 0.57330835 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection
14 0.56303972 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
15 0.55877936 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
16 0.55822611 166 nips-2010-Minimum Average Cost Clustering
17 0.55493641 117 nips-2010-Identifying graph-structured activation patterns in networks
18 0.54654479 265 nips-2010-The LASSO risk: asymptotic results and real world examples
19 0.54626954 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
20 0.54597425 182 nips-2010-New Adaptive Algorithms for Online Classification