nips nips2011 nips2011-172 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh
Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu † School of Computer Science and †† Department of Statistics, Carnegie Mellon University Abstract We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. [sent-9, score-0.232]
2 This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. [sent-10, score-0.872]
3 We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. [sent-14, score-1.402]
4 We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. [sent-16, score-0.219]
5 We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. [sent-18, score-1.111]
6 1 Introduction Biclustering is the problem of identifying a (typically) sparse set of relevant columns and rows in a large, noisy data matrix. [sent-19, score-0.219]
7 In these applications, the data matrix is often indexed by (object, feature) pairs and the goal is to identify clusters in this set of bipartite variables. [sent-22, score-0.141]
8 Thus, while clustering aims to identify global structure in the data, biclustering take a more local approach by jointly clustering both objects and features. [sent-28, score-0.7]
9 In order to study, understand and compare biclustering algorithms we consider a simple theoretical model of biclustering [18, 17, 26]. [sent-30, score-1.148]
10 Notice that the magnitude of the signal for all the coordinates in the bicluster K1 ⇥ K2 is pk k . [sent-40, score-0.657]
11 The parameter measures the strength of the signal, 1 2 and is the key quantity we will be studying. [sent-41, score-0.157]
12 2 2 We focus on the case of a single bicluster that appears as an elevated sub-matrix of size k1 ⇥ k2 with signal strength embedded in a large n1 ⇥n2 data matrix with entries corrupted by additive Gaussian noise with variance 2 . [sent-42, score-0.974]
13 Under this model, the biclustering problem is formulated as the problem of estimating the sets K1 and K2 , based on a single noisy observation A of the unknown signal matrix uv0 . [sent-43, score-0.794]
14 Biclustering is most subtle when the matrix is large with several irrelevant variables, the entries are highly noisy, and the bicluster is small as defined by a sparse set of rows/columns. [sent-44, score-0.627]
15 We provide a sharp characterization of tuples of ( , n1 , n2 , k1 , k2 , 2 ) under which it is possible to recover the bicluster and study several common methods and establish the regimes under which they succeed. [sent-45, score-0.601]
16 We establish minimax lower and upper bounds for the following class of models. [sent-46, score-0.212]
17 I We derive a lower bound that identifies tuples of ( , n1 , n2 , k1 , k2 , 2 ) under which we can recover the true biclustering from a noisy high dimensional matrix. [sent-49, score-0.719]
18 We show that a combinatorial procedure based on the scan statistic achieves the minimax optimal limits, however it is impractical as it requires enumerating all possible sub-matrices of a given size in a large matrix. [sent-50, score-0.405]
19 the relation between and (n1 , n2 , k1 , k2 , 2 )) under which some computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and sparse singular vector decomposition (SSVD) succeed with high probability. [sent-53, score-1.155]
20 We consider the detection of both small and large biclusters of weak activation, and show that at the minimax scaling the problem is surprisingly subtle (e. [sent-54, score-0.328]
21 Minimax: ⇠ 2 1 k2 ) ◆ Element-wise thresholding does not take advantage of any structure in the data matrix and hence does not achieve the minimax scaling for any bicluster size. [sent-61, score-0.913]
22 If the clusters are big enough row/column averaging performs better than element-wise thresholding since it can take advantage of structure. [sent-62, score-0.352]
23 We also study a convex relaxation for sparse SVD, based on the DSPCA algorithm proposed by [11] that encourages the singular vectors of the matrix to be supported over a sparse set of variables. [sent-63, score-0.358]
24 However, despite the increasing popularity of this method, we show that it is only guaranteed to yield a sparse set of singular vectors when the SNR is quite high, equivalent to element-wise thresholding, and fails for stronger scalings of the SNR. [sent-64, score-0.266]
25 1 Related work Due to its practical importance and difficulty biclustering has attracted considerable attention (for some recent surveys see [9, 27, 20, 22]). [sent-66, score-0.574]
26 Broadly algorithms for biclustering can be categorized as either score-based searches, or spectral algorithms. [sent-67, score-0.574]
27 Many of the proposed algorithms for identifying relevant clusters are based on heuristic searches whose goal is to identify large average sub-matrices or sub-matrices that are well fit by a two-way ANOVA model. [sent-68, score-0.182]
28 Therefore, A0 A is a spiked covariance matrix and it is possible to use the existing theoretical results on the first eigenvalue to characterize the left singular vector of A. [sent-76, score-0.248]
29 For biclustering applications, the assumption that exactly one u or v is random, is not justifiable, therefore, theoretical results for the spiked covariance model do not translate directly. [sent-78, score-0.687]
30 Our setup for the biclustering problem also falls in the framework of structured normal means multiple hypothesis testing problems, where for each entry in the matrix the hypotheses are that the entry has mean 0 versus an elevated mean. [sent-81, score-0.795]
31 The presence of a bicluster (sub-matrix) however imposes structure on which elements are elevated concurrently. [sent-82, score-0.618]
32 For example, [5] consider the detection of elevated intervals and other parametric structures along an ordered line or grid, [4] consider detection of elevated connected paths in tree and lattice topologies, [3] considers nonparametric cluster structures in a regular grid. [sent-84, score-0.422]
33 In addition, [1] consider testing of different elevated structures in a general but known graph topology. [sent-85, score-0.183]
34 Our setup for the biclustering problem requires identification of an elevated submatrix in an unordered matrix. [sent-86, score-0.727]
35 However, computationally efficient procedures that achieve the minimax SNR thresholds are only known for a few of these problems. [sent-88, score-0.265]
36 Our results for biclustering have a similar flavor, in that the minimax threshold requires a combinatorial procedure whereas the computationally efficient procedures we investigate are often sub-optimal. [sent-89, score-1.026]
37 In Section 2, we provide a lower bound on the minimum signal strength needed for successfully identifying the bicluster. [sent-91, score-0.467]
38 Section 3 presents a combinatorial procedure which achieves the lower bound and hence is minimax optimal. [sent-92, score-0.352]
39 We investigate some computationally efficient procedures in Section 4. [sent-93, score-0.142]
40 2 Lower bound In this section, we derive a lower bound for the problem of identifying the correct bicluster, indexed by K1 and K2 , in model (1). [sent-96, score-0.175]
41 Intuitively, if either the signal-to-noise ratio / or the cluster size is small, the minimum signal strength needed will be high since it is harder to distinguish the bicluster from the noise. [sent-98, score-0.858]
42 The result can be interpreted in the following way: for any biclustering procedure , if 0 min , then there exists some element in the model class ⇥( 0 , k1 , k2 ) such that the probability of incorrectly identifying the sets K1 and K2 is bounded away from zero. [sent-105, score-0.714]
43 3 Minimax optimal combinatorial procedure We now investigate a combinatorial procedure achieving the lower bound of Theorem 1, in the sense that, for any ✓ 2 ⇥( min , k1 , k2 ), the probability of recovering the true bicluster (K1 , K2 ) tends to one as n1 and n2 grow unbounded. [sent-112, score-0.889]
44 This scan procedure consists in enumerating all possible pairs of subsets of the row and column indexes of size k1 and k2 , respectively, and choosing the one whose corresponding submatrix has the largest overall sum. [sent-113, score-0.23]
45 The estimated bicluster is the pair of subsets of sizes k1 and k2 achieving the ˜ ˜ i2K1 j2K2 highest score: ˜ ˜ ˜ ˜ (A) := argmax S(K1 , K2 ) subject to |K1 | = k1 , |K2 | = k2 . [sent-115, score-0.492]
46 (6) ˜ ˜ (K 1 ,K 2 ) The following theorem determines the signal strength bicluster. [sent-116, score-0.384]
47 Comparing to the lower bound in Theorem 1, we observe that the combinatorial procedure using the decoder that looks for all possible clusters and chooses the one with largest score achieves the lower bound up to constants. [sent-120, score-0.399]
48 The combinatorial procedure requires the signal to be positive, but not necessarily constant throughout the bicluster. [sent-122, score-0.303]
49 In fact it is easy to see that provided the average signal in the bicluster is larger than that stipulated by the theorem this procedure succeeds with high probability irrespective of how the signal is distributed across the bicluster. [sent-123, score-0.972]
50 Establishing minimax lower bounds and a procedure that adapts to unknown k1 and k2 is an open problem. [sent-125, score-0.249]
51 4 Computationally efficient biclustering procedures In this section we investigate the performance of various procedures for biclustering, that, unlike the optimal scan statistic procedure studied in the previous section, are computationally tractable. [sent-126, score-0.932]
52 For each of these procedures however, computational ease comes at the cost of suboptimal performance: recovery of the true bicluster is only possible if the is much larger than the minimax signal strength of Theorem 1. [sent-127, score-1.05]
53 1 Element-wise thresholding The simplest procedure that we analyze is based on element-wise thresholding. [sent-129, score-0.302]
54 The bicluster is estimated as ⌧} (8) thr (A, ⌧ ) := {(i, j) 2 [n1 ] ⇥ [n2 ] : |aij | where ⌧ > 0 is a parameter. [sent-130, score-0.533]
55 The following theorem characterizes the signal strength the element-wise thresholding to succeed in recovering the bicluster. [sent-131, score-0.691]
56 If then P[ p k1 k2 thr (A, ⌧ ) r 2 log k1 k2 + r 2 log (n1 k1 )(n2 k2 ) + k1 (n2 k2 ) + k2 (n1 k1 ) ! [sent-135, score-0.135]
57 Comparing Theorem 3 with theplower bound in Theorem 1, we observe that the signal p strength needs to be O(max( k1 , k2 )) larger than the lower bound. [sent-137, score-0.412]
58 This is not surprising, since the element-wise thresholding is not exploiting the structure of the problem, but is assuming that the large elements of the matrix A are positioned randomly. [sent-138, score-0.237]
59 if ✓q q p c k1 k2 2 log k1 k2 + 2 log (n1 k1 )(n2 k2 )+k1 (n2 k2 )+k2 (n1 k1 ) for a small enough constant c then thresholding will no longer recover the bicluster with probability at least 1 . [sent-141, score-0.81]
60 It is also worth noting that thresholding neither requires the signal in the bicluster to be constant nor positive provided it is larger in magnitude, at every entry, than the threshold specified in the theorem. [sent-142, score-0.888]
61 2 Row/Column averaging Next, we analyze another a procedure based on column and row averaging. [sent-144, score-0.23]
62 When the bicluster is large this procedure exploits the structure of the problem and outperforms the simple elementwise thresholding and the sparse SVD, which is discussed in the following section. [sent-145, score-0.838]
63 The averaging procedure works only well if the bicluster is “large”, as specified below, since otherwise the row or column average is dominated by the noise. [sent-146, score-0.688]
64 More precisely, the averaging procedure computes the average of each row and column of A and outputs the k1 rows and k2 columns with the largest average. [sent-147, score-0.241]
65 The bicluster is estimated then as avg (A) (9) := {i 2 [n1 ] : rr,i k1 } ⇥ {j 2 [n2 ] : rc,j k2 }. [sent-149, score-0.492]
66 The following theorem characterizes the signal strength succeed in recovering the bicluster. [sent-150, score-0.488]
67 required for the averaging procedure to 1/2+↵ 1/2+↵ Theorem 4. [sent-151, score-0.154]
68 Comparing to Theorem 3, we observe that the averaging requires lower signal strength than the p p element-wise thresholding when the bicluster is large, that is, k1 = ⌦( n1 ) and k2 = ⌦( n2 ). [sent-155, score-1.166]
69 Unless both k1 = O(n1 ) and k2 = O(n2 ), the procedure does not achieve the lower bound of Theorem 1, however, the procedure is simple and computationally efficient. [sent-156, score-0.251]
70 It is also not hard to show that this theorem is sharp in its characterization of the averaging procedure. [sent-157, score-0.178]
71 Further, unlike thresholding, averaging requires the signal to be positive in the bicluster. [sent-158, score-0.254]
72 It is interesting to note that a large bicluster can also be identified without assuming the normality of the noise matrix . [sent-159, score-0.526]
73 3 Sparse singular value decomposition (SSVD) An alternate way to estimate K1 and K2 would be based on the singular value decomposition (SVD), ˜ ˜ ˜ ˜ i. [sent-162, score-0.25]
74 Unfortuu v nately, such a method would perform poorly when the signal is weak and the dimensionality is ˜ ˜ high, since, due to the accumulation of noise, u and v are poor estimates of u and v and and do not exploit the fact that u and v are sparse. [sent-165, score-0.216]
75 In fact, it is now well understood [8] that SVD is strongly inconsistent when the signal strength is weak, i. [sent-166, score-0.322]
76 Based on the sparse singular vectors u and v, we estimate the bicluster b u as b b K1 = {j 2 [n1 ] : uj 6= 0} b and K2 = {j 2 [n2 ] : vj 6= 0}. [sent-175, score-0.704]
77 b (12) b 21 , and, therefore, provided The user defined parameter controls the sparsity of the solution X b b the solution is of rank one, it also controls the sparsity of the vectors u and v and of the estimated bicluster. [sent-176, score-0.186]
78 b The following theorem provides sufficient conditions for the solution X to be rank one and to recover the bicluster. [sent-177, score-0.148]
79 It is worth noting that SSVD correctly recovers signed vectors u and v under this signal strength. [sent-183, score-0.198]
80 If p 2 ck1 k2 log max(n1 k1 , n2 k2 ), (14) with then the optimization problem (11) does not have a rank 1 solution that correctly p p recovers the sparsity pattern with probability at least 1 O(exp( ( k1 + k2 )2 ) for sufficiently large n1 and n2 . [sent-190, score-0.146]
81 The signal strength needs to be of the same order as for the element-wise thresholding, which is somewhat surprising since from the formulation of the SSVD optimization problem it seems that the procedure uses the structure of the problem. [sent-193, score-0.387]
82 From numerical simulations in Section 5 we observe that although SSVD requires the same scaling as thresholding, it consistently performs slightly better at a fixed signal strength. [sent-194, score-0.225]
83 5 Simulation results We test the performance of the three computationally efficient procedures on synthetic data: thresholding, averaging and sparse SVD. [sent-195, score-0.288]
84 the Hamming distance between su and su b rescaled to be between 0 and 1) against the rescaled sample size. [sent-200, score-0.17]
85 For thresholding and sparse SVD the rescaled scaling (x-axis) is p k rescaled scaling (x-axis) is p k ↵ n . [sent-202, score-0.531]
86 log(n k) log(n k) and for averaging the We observe that there is a sharp threshold between success and failure of the algorithms, and the curves show good agreement with our theory. [sent-203, score-0.164]
87 We can make a direct comparison between thresholding and sparse SVD (since the curves are identically rescaled) to see that at least empirically sparse SVD succeeds at a smaller scaling constant than thresholding even though their asymptotic rates are identical. [sent-205, score-0.625]
88 8 1 1 n = 100 n = 200 n = 300 n = 400 n = 500 Hamming fraction Hamming fraction 1 2 3 4 5 Signal strength 6 n = 100 n = 200 n = 300 n = 400 n = 500 0. [sent-215, score-0.317]
89 2 0 1 7 2 3 4 5 Signal strength 6 7 Figure 1: Thresholding: Hamming fraction versus rescaled signal strength. [sent-219, score-0.513]
90 2 0 0 1 2 3 4 Signal strength 5 6 n = 100 n = 200 n = 300 n = 400 n = 500 0. [sent-226, score-0.157]
91 2 0 0 7 1 2 3 4 Signal strength 5 6 7 Figure 2: Averaging: Hamming fraction versus rescaled signal strength. [sent-230, score-0.513]
92 6 Hamming fraction Hamming fraction 1 n = 100 n = 200 n = 300 n = 400 n = 500 0. [sent-239, score-0.16]
93 5 3 Figure 3: Sparse SVD: Hamming fraction versus rescaled signal strength. [sent-250, score-0.356]
94 6 Discussion In this paper, we analyze biclustering using a simple statistical model (1), where a sparse rank one matrix is perturbed with noise. [sent-251, score-0.765]
95 Using this model, we have characterized the minimal signal strength below which no procedure can succeed in recovering the bicluster. [sent-252, score-0.471]
96 However, it is still an open problem to find a computationally efficient procedure that is minimax optimal. [sent-254, score-0.26]
97 [2] analyze the convex relaxation procedure proposed in [11] for high-dimensional sparse PCA. [sent-257, score-0.211]
98 Under the minimax scaling for this problem they show that provided a rank-1 solution exists it has the desired sparsity pattern (they were however not able to show that a rank-1 solution exists with high probability). [sent-258, score-0.258]
99 The singular values and vectors of low rank perturbations of large rectangular random matrices. [sent-312, score-0.179]
100 A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. [sent-443, score-0.167]
wordName wordTfidf (topN-words)
[('biclustering', 0.574), ('bicluster', 0.492), ('thresholding', 0.203), ('ssvd', 0.185), ('signal', 0.165), ('strength', 0.157), ('minimax', 0.144), ('hamming', 0.129), ('elevated', 0.126), ('svd', 0.124), ('singular', 0.101), ('spiked', 0.09), ('averaging', 0.089), ('rescaled', 0.085), ('fraction', 0.08), ('sparse', 0.078), ('identifying', 0.075), ('combinatorial', 0.073), ('procedures', 0.07), ('procedure', 0.065), ('snr', 0.063), ('theorem', 0.062), ('ery', 0.062), ('malware', 0.062), ('clusters', 0.06), ('principal', 0.055), ('scan', 0.054), ('succeed', 0.054), ('scalings', 0.054), ('computationally', 0.051), ('weak', 0.051), ('amini', 0.05), ('log', 0.047), ('emmanuel', 0.047), ('rank', 0.045), ('aij', 0.044), ('cluster', 0.044), ('enumerating', 0.042), ('biclusters', 0.041), ('thr', 0.041), ('decoder', 0.041), ('scaling', 0.04), ('lower', 0.04), ('genes', 0.037), ('clustering', 0.037), ('aarti', 0.036), ('testing', 0.035), ('gene', 0.035), ('matrix', 0.034), ('sparsity', 0.034), ('analyze', 0.034), ('relaxation', 0.034), ('tuples', 0.033), ('vectors', 0.033), ('signatures', 0.031), ('groups', 0.03), ('recovering', 0.03), ('bound', 0.03), ('drugs', 0.029), ('cand', 0.029), ('exhaustive', 0.029), ('detection', 0.029), ('threshold', 0.028), ('establish', 0.028), ('objects', 0.027), ('sharp', 0.027), ('submatrix', 0.027), ('statistic', 0.027), ('tr', 0.027), ('biological', 0.027), ('identi', 0.026), ('versus', 0.026), ('identify', 0.025), ('shen', 0.025), ('proteins', 0.025), ('banach', 0.025), ('decomposition', 0.024), ('ordered', 0.024), ('columns', 0.024), ('max', 0.023), ('subtle', 0.023), ('succeeds', 0.023), ('covariance', 0.023), ('recovery', 0.022), ('column', 0.022), ('handbook', 0.022), ('searches', 0.022), ('bipartite', 0.022), ('species', 0.022), ('structures', 0.022), ('investigate', 0.021), ('noisy', 0.021), ('recover', 0.021), ('rows', 0.021), ('establishing', 0.021), ('observe', 0.02), ('pca', 0.02), ('characterizes', 0.02), ('solution', 0.02), ('row', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh
Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
2 0.11801832 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
3 0.097653143 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
Author: Nasser M. Nasrabadi, Trac D. Tran, Nam Nguyen
Abstract: This paper studies the problem of accurately recovering a sparse vector β from highly corrupted linear measurements y = Xβ + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both β and e . Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N (0, Σ), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both β and e from only Ω(k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. 1
4 0.091888897 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
5 0.078335926 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
6 0.068929031 259 nips-2011-Sparse Estimation with Structured Dictionaries
7 0.068838075 260 nips-2011-Sparse Features for PCA-Like Linear Regression
8 0.066045523 265 nips-2011-Sparse recovery by thresholded non-negative least squares
9 0.063048713 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
10 0.059740163 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
11 0.059110869 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
12 0.058152642 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
13 0.055290826 209 nips-2011-Orthogonal Matching Pursuit with Replacement
14 0.053352814 145 nips-2011-Learning Eigenvectors for Free
15 0.052843761 225 nips-2011-Probabilistic amplitude and frequency demodulation
16 0.051143825 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
17 0.050620697 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
18 0.050392192 261 nips-2011-Sparse Filtering
19 0.049406253 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.047119696 231 nips-2011-Randomized Algorithms for Comparison-based Search
topicId topicWeight
[(0, 0.157), (1, 0.006), (2, -0.039), (3, -0.112), (4, -0.042), (5, 0.042), (6, -0.006), (7, 0.042), (8, 0.057), (9, 0.018), (10, 0.056), (11, 0.084), (12, -0.001), (13, -0.053), (14, 0.06), (15, -0.03), (16, 0.019), (17, -0.06), (18, 0.041), (19, -0.012), (20, 0.034), (21, 0.009), (22, -0.001), (23, -0.007), (24, 0.006), (25, -0.014), (26, 0.018), (27, 0.063), (28, 0.023), (29, 0.008), (30, -0.032), (31, 0.043), (32, 0.047), (33, 0.048), (34, 0.005), (35, -0.043), (36, 0.01), (37, 0.03), (38, -0.103), (39, -0.06), (40, 0.01), (41, 0.041), (42, 0.009), (43, -0.082), (44, -0.002), (45, 0.043), (46, 0.033), (47, 0.035), (48, -0.035), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.93432271 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh
Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
2 0.67373222 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
3 0.66088957 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
4 0.64695722 73 nips-2011-Divide-and-Conquer Matrix Factorization
Author: Lester W. Mackey, Michael I. Jordan, Ameet Talwalkar
Abstract: This work introduces Divide-Factor-Combine (DFC), a parallel divide-andconquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.
5 0.63701302 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
Author: Andrew E. Waters, Aswin C. Sankaranarayanan, Richard Baraniuk
Abstract: We consider the problem of recovering a matrix M that is the sum of a low-rank matrix L and a sparse matrix S from a small set of linear measurements of the form y = A(M) = A(L + S). This model subsumes three important classes of signal recovery problems: compressive sensing, affine rank minimization, and robust principal component analysis. We propose a natural optimization problem for signal recovery under this model and develop a new greedy algorithm called SpaRCS to solve it. Empirically, SpaRCS inherits a number of desirable properties from the state-of-the-art CoSaMP and ADMiRA algorithms, including exponential convergence and efficient implementation. Simulation results with video compressive sensing, hyperspectral imaging, and robust matrix completion data sets demonstrate both the accuracy and efficacy of the algorithm. 1
6 0.63386428 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
7 0.62777686 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
8 0.62548155 260 nips-2011-Sparse Features for PCA-Like Linear Regression
9 0.6134572 259 nips-2011-Sparse Estimation with Structured Dictionaries
10 0.60811567 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
11 0.60001683 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
12 0.59603208 209 nips-2011-Orthogonal Matching Pursuit with Replacement
13 0.56886995 264 nips-2011-Sparse Recovery with Brownian Sensing
14 0.55898267 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
15 0.55761266 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
16 0.5381937 198 nips-2011-On U-processes and clustering performance
17 0.5381577 265 nips-2011-Sparse recovery by thresholded non-negative least squares
18 0.5223549 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
19 0.5215677 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
20 0.51453024 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
topicId topicWeight
[(0, 0.038), (4, 0.052), (20, 0.045), (26, 0.031), (31, 0.09), (33, 0.02), (43, 0.066), (45, 0.15), (48, 0.012), (57, 0.031), (65, 0.015), (74, 0.054), (82, 0.214), (83, 0.037), (84, 0.016), (99, 0.043)]
simIndex simValue paperId paperTitle
1 0.86369854 59 nips-2011-Composite Multiclass Losses
Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson
Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classification calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1
same-paper 2 0.80005455 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh
Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
3 0.75304735 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
4 0.70508444 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov
Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial. 1
5 0.70490021 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
6 0.70351893 231 nips-2011-Randomized Algorithms for Comparison-based Search
7 0.70302987 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
8 0.70244557 80 nips-2011-Efficient Online Learning via Randomized Rounding
9 0.70051301 186 nips-2011-Noise Thresholds for Spectral Clustering
10 0.69854987 283 nips-2011-The Fixed Points of Off-Policy TD
11 0.69830948 271 nips-2011-Statistical Tests for Optimization Efficiency
12 0.69781327 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
13 0.69741392 263 nips-2011-Sparse Manifold Clustering and Embedding
14 0.69737679 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
15 0.69697559 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
16 0.69539285 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
17 0.69536102 258 nips-2011-Sparse Bayesian Multi-Task Learning
18 0.6949994 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
19 0.69489098 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
20 0.69488448 220 nips-2011-Prediction strategies without loss