nips nips2011 nips2011-260 knowledge-graph by maker-knowledge-mining

260 nips-2011-Sparse Features for PCA-Like Linear Regression


Source: pdf

Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail

Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. [sent-10, score-0.324]

2 The eigenfeatures are very useful in data analysis, including the regularization of linear regression. [sent-11, score-0.245]

3 We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. [sent-15, score-0.481]

4 The vector of regression weights w∗ ∈ Rd minimizes (over all w ∈ Rd ) the RMS in-sample error n E(w) = i=1 (xi · w − yi )2 = Xw − y 2 . [sent-22, score-0.232]

5 In the above, X ∈ Rn×d is the data matrix whose rows are the vectors xi (i. [sent-23, score-0.166]

6 We will use the more convenient matrix formulation1, namely given X and y, we seek a vector w∗ that minimizes Xw − y 2 . [sent-28, score-0.187]

7 Popular regularization methods include, for example, the Lasso [28], Tikhonov regularization [17], and top-k PCA regression or truncated SVD regularization [21]. [sent-34, score-0.183]

8 Our focus is on top-k PCA regression which can be viewed as regression onto the top-k principal components, or, equivalently, the top-k eigenfeatures. [sent-36, score-0.399]

9 The eigenfeatures are the top-k right singular vectors of X and are arbitrary linear combinations of all available input features. [sent-37, score-0.454]

10 The question we tackle is whether one can efficiently extract sparse eigenfeatures (i. [sent-38, score-0.337]

11 , eigenfeatures that are linear combinations of only a small number of the available features) that have nearly the same performance as the top-k eigenfeatures. [sent-40, score-0.282]

12 right) singular vectors of X with singular values in the diagonal matrix Σ. [sent-58, score-0.408]

13 For k ≤ d, let Uk , Σk , and Vk contain only the top-k singular vectors and associated singular values. [sent-59, score-0.293]

14 The best rank-k reconstruction of X in the Frobenius norm can be obtained from T this truncated singular value decomposition as Xk = Uk Σk Vk . [sent-60, score-0.228]

15 The k right singular vectors in Vk are called the top-k eigenfeatures. [sent-61, score-0.172]

16 The projections of the data points onto the top k eigenfeatures are obtained by projecting the xi ’s onto the columns of Vk to obtain Fk = XVk = UΣV T Vk = U k Σk . [sent-62, score-0.55]

17 Each column of Fk contains a particular eigenfeature’s value for every data point and is a linear combination of the columns of X. [sent-64, score-0.368]

18 The top-k PCA regression uses Fk as the data matrix and y as the target vector to produce regression ∗ weights wk = F+ y. [sent-65, score-0.83]

19 The in-sample error of this k-dimensional regression is equal to k ∗ y − F k wk 2 = y − Fk F + y k 2 −1 T = y − U k Σk Σk U k y 2 T = y − Uk Uk y 2 . [sent-66, score-0.52]

20 ∗ ∗ The weights wk are k-dimensional and cannot be applied to X, but the equivalent weights Vk wk can be applied to X and they have the same in-sample error with respect to X: ∗ ∗ E(V k wk ) = y − XVk wk 2 ∗ = y − F k wk 2 T = y − Uk Uk y 2 . [sent-67, score-1.769]

21 ∗ ∗ Hence, we will refer to both wk and Vk wk as the top-k PCA regression weights (the dimension will ∗ make it clear which one we are talking about) and, for simplicity, we will overload wk to refer to both these weight vectors (the dimension will make it clear which). [sent-68, score-1.279]

22 We do not argue the merits of top-k PCA regression; we just note that top-k PCA regression is a common tool for regularizing regression. [sent-71, score-0.181]

23 Given X ∈ Rn×d , k (the number of target eigenfeatures for top-k PCA regression), and r > k (the sparsity parameter), we seek to extract a set of at most k sparse eigenfeaˆ ˆ ˆ tures Vk which use at most r of the actual dimensions. [sent-73, score-0.434]

24 Let Fk = X Vk ∈ Rn×k denote the matrix whose columns are the k extracted sparse eigenfeatures, which are a linear combination of a set of at most r actual features. [sent-74, score-0.462]

25 Our goal is to obtain sparse features for which the vector of sparse regression ˆ+ ˆ ˆ+ ˆ weights wk = F k y results in an in-sample error y − F k Fk y 2 that is close to the top-k PCA regression error y − F k F+ y 2 . [sent-75, score-1.015]

26 Just as with top-k PCA regression, we can define the equivalent k ˆ ˆ ˆ d-dimensional weights Vk wk ; we will overload wk to refer to these weights as well. [sent-76, score-0.785]

27 The weight vector w becomes a weight matrix, W, where each column of W contains the weights from the regression of the corresponding column of Y onto the features. [sent-78, score-0.463]

28 Our first theorem argues that there exists a polynomial-time deterministic algorithm that constructs a feature matrix ˆ ˆ Fk ∈ Rn×k , such that each feature (column of Fk ) is a linear combination of at most r actual features (columns) from X and results in small in-sample error . [sent-81, score-0.48]

29 Again, this should be contrasted with top-k PCA regression, which constructs a feature matrix Fk , such that each feature (column of Fk ) is a linear combination of all features (columns) in X. [sent-82, score-0.365]

30 Our theorems argue that the in-sample error of our features is almost as good as the in-sample error of top-k PCA regression, which uses dense features. [sent-83, score-0.178]

31 Let X ∈ Rn×d and Y ∈ Rn×ω be the input matrices in a multiple regression problem. [sent-85, score-0.221]

32 Let k > 0 be a target rank for top-k PCA regression on X and Y. [sent-86, score-0.245]

33 ˆ ˆ For any r > k, there exists an algorithm that constructs a feature matrix Fk = X Vk ∈ Rn×k , such ˆ k is a linear combination of (the same) at most r columns of X, and that every column of F ˆ Y − X Wk F ˆ ˆ+ = Y − F k Fk Y F ≤ Y − XW ∗ k F + 1+ X − Xk σk (X) 9k r F Y 2. [sent-87, score-0.583]

34 ) The running time of the proposed algorithm is T (Vk ) + O ndk + nrk 2 , where T (Vk ) is the time required to compute the matrix Vk , the top-k right singular vectors of X. [sent-89, score-0.342]

35 Theorem 1 says that one can construct k features with sparsity O(k) and obtain a comparble regression error to that attained by the dense top-k PCA features, up to additive term that is proportional to ∆k = X − Xk F /σk (X). [sent-90, score-0.362]

36 3) to select r columns of X and form the matrix C ∈ Rn×r . [sent-92, score-0.34]

37 In other words, ΠC,k (Y) is a rank-k matrix that minimizes Y − ΠC,k (Y) F over all rank-k matrices in the column-span of C. [sent-94, score-0.181]

38 Given ΠC,k (Y), the sparse eigenfeatures can be computed efficiently as follows: first, set Ψ = C + ΠC,k (Y). [sent-96, score-0.337]

39 The last equality follows because CC + projects onto the column span of C and ΠC,k (Y) is already in the column span of C. [sent-98, score-0.356]

40 Clearly, each column of Fk is a linear combination of (the same) at most r columns of X (the columns in C). [sent-101, score-0.593]

41 The sparse features ˆ ˆ ˆ ˆ themselves can also be obtained because F k = X Vk , so Vk = X+ Fk . [sent-102, score-0.171]

42 ˆ ˆ To prove that F k are a good set of sparse features, we first relate the regression error from using Fk to how well ΠC,k (Y) approximates Y. [sent-103, score-0.282]

43 Y − ΠC,k (Y) F = Y − CΨ F T = Y − CUψ Σψ Vψ F ˆ T = Y − Fk Vψ F ˆ ˆ+ ≥ Y − Fk Fk Y ˆ+ The last inequality follows because Fk Y are the optimal regression weights for the reverse inequality also holds because ΠC,k (Y) is the best rank-k approximation to F. [sent-104, score-0.197]

44 The upshot of the above discussion is that if we can find a matrix C consisting of columns of X for which Y − ΠC,k (Y) F is small, then we immediately have good sparse eigenfeatures. [sent-108, score-0.432]

45 Indeed, all that remains to complete the proof of Theorem 1 is to bound Y − ΠC,k (Y) F for the columns C returned by the Algorithm DSF-Select. [sent-109, score-0.225]

46 4) to select r columns of X and again form the matrix C ∈ Rn×r . [sent-111, score-0.34]

47 Let X ∈ Rn×d and Y ∈ Rn×ω be the input matrices in a multiple regression problem. [sent-115, score-0.221]

48 Let k > 0 be a target rank for top-k PCA regression on X and Y. [sent-116, score-0.245]

49 For any r > 144k ln(20k), there exists a randomized algorithm that constructs a feature matrix ˆ ˆ ˆ Fk = X Vk ∈ Rn×k , such that every column of Fk is a linear combination of at most r columns of X, and, with probability at least . [sent-117, score-0.645]

50 3 Connections with prior work A variant of our problem is the identification of a matrix C consisting of a small number (say r) columns of X such that the regression of Y onto C (as opposed to k features from C) gives small insample error. [sent-120, score-0.614]

51 This is the sparse approximation problem, where the number of non-zero weights in the regression vector is restricted to r. [sent-121, score-0.289]

52 An important difference between sparse approximation and sparse PCA regression is that our goal is not to minimize the error under a sparsity constraint, but to match the top-k PCA regularized regression under a sparsity constraint. [sent-124, score-0.595]

53 If X = Y (approximating X using the columns of X), then this is the column-based matrix reconstruction problem, which has received much attention in existing literature [16, 18, 14, 26, 5, 12, 20]. [sent-128, score-0.393]

54 Our goal is to provide sparse PCA features that are almost as good as the exact principal components. [sent-133, score-0.22]

55 The authors use a QR-like decomposition to select exactly k columns of X and compare ∗ ˆ their sparse solution vector wk with the top-k PCA regularized solution wk . [sent-136, score-0.977]

56 They show that ∗ ˆ wk − wk 2 ≤ k(n − k) + 1 X − Xk σk (X) 2 ∆, ∗ ˆ where ∆ = 2 wk 2 + y − Xwk 2 /σk (X). [sent-137, score-0.99]

57 The importance of the right singular vectors in matrix reconstruction problems (including PCA) has been heavily studied in prior literature, going back to work by Jolliffe in 1972 [22]. [sent-140, score-0.34]

58 The idea of T sampling columns from a matrix X with probabilities that are derived from Vk (as we do in Theorem 2) was introduced in [15] in order to construct coresets for regression problems by sampling data points (rows of the matrix X) as opposed to features (columns of the matrix X). [sent-141, score-1.029]

59 Finally, we note that our deterministic feature selection algorithm (Theorem 1) uses a sparsification tool developed in [2] for column based matrix reconstruction. [sent-143, score-0.36]

60 Both algorithms necessitate access to the right singular vectors of X, namely the matrix Vk ∈ Rd×k . [sent-146, score-0.328]

61 Our first algorithm (DSF-Select) is deterministic, while the second algorithm (RSF-Select) is randomized, requiring logarithmically more columns to guarantee the theoretical bounds. [sent-148, score-0.266]

62 Prior to describing our algorithms in detail, we will introduce useful notation on sampling and rescaling matrices as well as a matrix factorization lemma (Lemma 3) that will be critical in our proofs. [sent-149, score-0.462]

63 1 Sampling and rescaling matrices Let C ∈ Rn×r contain r columns of X ∈ Rn×d . [sent-151, score-0.425]

64 We can express the matrix C as C = XΩ, where the sampling matrix Ω ∈ Rd×r is equal to [ei1 , . [sent-152, score-0.295]

65 In our proofs, we will make use of S ∈ Rr×r , a diagonal rescaling matrix with positive entries on the diagonal. [sent-156, score-0.249]

66 Our column selection algorithms return a sampling and a rescaling matrix, so that XΩS contains a subset of rescaled columns from X. [sent-157, score-0.607]

67 The rescaling is benign since it does not affect the span of the columns of C = XΩ and thus the quantity of interest, namely ΠC,k (Y). [sent-158, score-0.445]

68 2 A structural result using matrix factorizations We now present a matrix reconstruction lemma that will be the starting point for our algorithms. [sent-160, score-0.365]

69 Let Y ∈ Rn×ω be a target matrix and let X ∈ Rn×d be the basis matrix that we will use in order to reconstruct Y. [sent-161, score-0.263]

70 More specifically, we seek a sparse reconstruction of Y from X, or, in other words, we would like to choose r ≪ d columns from X and form a matrix C ∈ Rn×r such that Y − ΠC,k (Y) F is small. [sent-162, score-0.516]

71 , Z T Z = Ik ), and express the matrix X as follows: X = HZT + E, where H is some matrix in Rn×k and E ∈ Rn×d is the residual error of the factorization. [sent-165, score-0.265]

72 Let Ω ∈ Rd×r and S ∈ Rr×r be a sampling and a rescaling matrix respectively as defined in the previous section, and let C = XΩ ∈ Rn×r . [sent-167, score-0.314]

73 Using the above notation, if the rank of the matrix Z T ΩS is equal to k, then Y − ΠC,k (Y) F ≤ Y − HH+ Y F + EΩS(Z T ΩS)+ H+ Y F. [sent-170, score-0.172]

74 For our goals, the matrix C essentially contains a subset of r features from the data matrix X. [sent-172, score-0.309]

75 Recall that ΠC,k (Y) is the best rank-k approximation to Y within the column space of C; and, the difference Y − ΠC,k (Y) measures the error from performing regression using sparse eigenfeatures that are constructed as linear combinations of the columns of C. [sent-173, score-0.902]

76 Here, we focus on one extreme of this trade off, namely choosing Z so that the (Frobenius) norm of the matrix E is minimized. [sent-178, score-0.182]

77 Using the above notation, if the rank of the matrix T Vk ΩS is equal to k, then Y − ΠC,k (Y) F T ≤ Y − U k Uk Y F T T + (X − Xk )ΩS(V k ΩS)+ Σ−1 U k Y k F. [sent-183, score-0.172]

78 , n} and t such that 4: Run DetSampling to construct sampling and rescaling matrices Ω and S: T [Ω, S] = DetSampling(Vk , E, r). [sent-201, score-0.296]

79 3 DSF-Select: Deterministic Sparse Feature Selection DSF-Select deterministically selects r columns of the matrix X to form the matrix C (see Table 1 and note that the matrix C = XΩ might contain duplicate columns which can be removed without any loss in accuracy). [sent-208, score-0.831]

80 In order to describe the algorithm, it is convenient to view these two matrices as two sets of column d T vectors, V T = [v1 , . [sent-213, score-0.179]

81 , vd ] (satisfying i=1 vi vi = Ik ) and A = [a1 , . [sent-216, score-0.168]

82 At every step τ , the algorithm selects a column ai such that U (ai ) ≤ L(vi , B τ −1 , Lτ ); note that B τ −1 is a k × k matrix which is also updated at every step of the algorithm (see Table 1). [sent-227, score-0.306]

83 It is worth noting that in practical implementations of the proposed algorithm, there might exist multiple columns which satisfy the above requirement. [sent-229, score-0.225]

84 Run RandSampling to construct sampling and rescaling matrices Ω and S: T [Ω, S] = RandSampling(Vk , r). [sent-242, score-0.296]

85 DetSampling with inputs V T and A returns a sampling matrix Ω ∈ Rd×r and a rescaling matrix S ∈ Rr×r satisfying (V T ΩS)+ 2 ≤1− k ; r AΩS F ≤ A F. [sent-270, score-0.456]

86 4 RSF-Select: Randomized Sparse Feature Selection RSF-Select is a randomized algorithm that selects r columns of the matrix X in order to form the matrix C (see Table 2). [sent-273, score-0.553]

87 The main differences between RSF-Select and DSF-Select are two: first, T RSF-Select only needs access to V k and, second, RSF-Select uses a simple sampling procedure in order to select the columns of X to include in C. [sent-274, score-0.29]

88 This sampling procedure is described in algorithm RandSampling and essentially selects columns of X with probabilities that depend on the norms of T the columns of Vk . [sent-275, score-0.606]

89 Thus, RandSampling first computes a set of probabilities that are proportional T to the norms of the columns of Vk and then samples r columns of X in r independent identical trials with replacement, where in each trial a column is sampled according to the computed probabilities. [sent-276, score-0.618]

90 In terms of running time, and assuming that the matrix Vk that contains the top k right singular vectors of X has already been computed, the proposed algorithm needs O(dk) time to compute the sampling probabilities and an additional O(d+ r log r) time to sample r columns from X. [sent-278, score-0.659]

91 5 Experiments The goal of our experiments is to illustrate that our algorithms produce sparse features which perform as well in-sample as the top-k PCA regression. [sent-281, score-0.171]

92 7 (n; d) Data ∗ wk Arcene (100;10,000) I-sphere (351;34) LibrasMov (45;90) Madelon (2,000;500) HillVal (606;100) Spambase (4601;57) 0. [sent-283, score-0.33]

93 30 k = 5, r = k + 1 ˆ DSF wk ˆ RSF wk ˆ rnd wk 0. [sent-295, score-1.06]

94 We used a five-fold cross validation design with 1,000 random splits: we computed regression weights using 80% of the data and estimated out-sample error in the remaining 20% of the data. [sent-384, score-0.232]

95 Table 3 shows the in- and out-sample error ∗ ˆ DSF for four methods: top-k PCA regression, wk ; r-sparse features regression using DSF-select, wk ; RSF ˆ r-sparse features regression using RSF-select, wk ; r-sparse features regression using r random ˆ rnd columns, wk . [sent-386, score-2.127]

96 6 Discussion The top-k PCA regression constructs “features” without looking at the targets – it is target-agnostic. [sent-387, score-0.214]

97 We only explored one extreme choice for the factorization, namely the minimization of some norm of the matrix E. [sent-390, score-0.182]

98 As mentioned when we discussed our deterministic algorithm, it will often be the case that in some steps of the greedy selection process, multiple columns could satisfy the criterion for selection. [sent-393, score-0.316]

99 An improved approximation algorithm for the column subset selection problem. [sent-426, score-0.153]

100 Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. [sent-478, score-0.181]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vk', 0.417), ('wk', 0.33), ('fk', 0.33), ('eigenfeatures', 0.245), ('columns', 0.225), ('pca', 0.215), ('detsampling', 0.162), ('regression', 0.155), ('drineas', 0.15), ('rn', 0.147), ('xk', 0.143), ('rescaling', 0.134), ('boutsidis', 0.122), ('singular', 0.121), ('randsampling', 0.116), ('matrix', 0.115), ('column', 0.113), ('sparse', 0.092), ('uk', 0.086), ('lemma', 0.082), ('features', 0.079), ('hzt', 0.07), ('rnd', 0.07), ('rsf', 0.07), ('xvk', 0.07), ('mahoney', 0.067), ('matrices', 0.066), ('sampling', 0.065), ('ik', 0.064), ('randomized', 0.062), ('dsf', 0.061), ('vi', 0.06), ('constructs', 0.059), ('rank', 0.057), ('propack', 0.056), ('xw', 0.054), ('reconstruction', 0.053), ('rd', 0.051), ('ties', 0.051), ('deterministic', 0.051), ('vectors', 0.051), ('principal', 0.049), ('qr', 0.048), ('vd', 0.048), ('svd', 0.047), ('drk', 0.046), ('nnz', 0.046), ('petros', 0.046), ('frobenius', 0.045), ('stoc', 0.045), ('span', 0.045), ('ai', 0.042), ('weights', 0.042), ('overload', 0.041), ('rensselaer', 0.041), ('logarithmically', 0.041), ('feature', 0.041), ('namely', 0.041), ('soda', 0.04), ('selection', 0.04), ('onto', 0.04), ('sparsi', 0.039), ('siam', 0.038), ('polytechnic', 0.037), ('troy', 0.037), ('coresets', 0.037), ('combinations', 0.037), ('rr', 0.037), ('selects', 0.036), ('error', 0.035), ('dk', 0.034), ('malik', 0.033), ('lanczos', 0.033), ('cu', 0.033), ('tikhonov', 0.033), ('sparsity', 0.033), ('target', 0.033), ('necessitates', 0.032), ('construct', 0.031), ('seek', 0.031), ('combination', 0.03), ('return', 0.03), ('running', 0.029), ('dense', 0.029), ('vt', 0.029), ('theorem', 0.029), ('table', 0.029), ('truncated', 0.028), ('norms', 0.028), ('extraction', 0.028), ('cc', 0.028), ('rk', 0.027), ('satisfying', 0.027), ('probabilities', 0.027), ('breaking', 0.026), ('regularizing', 0.026), ('norm', 0.026), ('proceedings', 0.026), ('considerably', 0.026), ('compute', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 260 nips-2011-Sparse Features for PCA-Like Linear Regression

Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail

Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.

2 0.14998488 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

Author: Youwei Zhang, Laurent E. Ghaoui

Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1

3 0.10747492 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

Author: Ioannis A. Gkioulekas, Todd Zickler

Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1

4 0.10713975 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n, ε−1 )) preference labels for a regret of ε times the optimal loss. This is strictly better, and often significantly better than what non-adaptive sampling could achieve. Our main result helps settle an open problem posed by learning-to-rank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? 1

5 0.10048315 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

6 0.096545503 259 nips-2011-Sparse Estimation with Structured Dictionaries

7 0.090793766 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

8 0.087388247 109 nips-2011-Greedy Model Averaging

9 0.087171659 240 nips-2011-Robust Multi-Class Gaussian Process Classification

10 0.084760197 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

11 0.084031537 261 nips-2011-Sparse Filtering

12 0.081518196 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

13 0.079879247 258 nips-2011-Sparse Bayesian Multi-Task Learning

14 0.078319006 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

15 0.077239759 198 nips-2011-On U-processes and clustering performance

16 0.076145016 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

17 0.072787218 168 nips-2011-Maximum Margin Multi-Instance Learning

18 0.072635286 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

19 0.071995996 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

20 0.07190498 68 nips-2011-Demixed Principal Component Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.209), (1, -0.007), (2, -0.049), (3, -0.128), (4, -0.069), (5, 0.063), (6, 0.045), (7, 0.122), (8, -0.001), (9, 0.003), (10, 0.028), (11, 0.018), (12, 0.029), (13, -0.034), (14, 0.037), (15, -0.073), (16, -0.059), (17, 0.013), (18, 0.044), (19, -0.052), (20, 0.016), (21, 0.04), (22, 0.083), (23, -0.047), (24, -0.105), (25, 0.018), (26, 0.065), (27, -0.007), (28, -0.014), (29, -0.115), (30, -0.078), (31, 0.021), (32, 0.021), (33, 0.026), (34, 0.07), (35, 0.114), (36, -0.083), (37, -0.009), (38, -0.085), (39, -0.108), (40, 0.025), (41, -0.037), (42, -0.1), (43, -0.133), (44, 0.129), (45, -0.022), (46, 0.044), (47, -0.064), (48, -0.107), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94575512 260 nips-2011-Sparse Features for PCA-Like Linear Regression

Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail

Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.

2 0.71184003 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

Author: Zhouchen Lin, Risheng Liu, Zhixun Su

Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1

3 0.70763588 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

Author: Youwei Zhang, Laurent E. Ghaoui

Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1

4 0.58901018 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

Author: Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, Aarti Singh

Abstract: We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1

5 0.5885092 73 nips-2011-Divide-and-Conquer Matrix Factorization

Author: Lester W. Mackey, Michael I. Jordan, Ameet Talwalkar

Abstract: This work introduces Divide-Factor-Combine (DFC), a parallel divide-andconquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.

6 0.56545043 68 nips-2011-Demixed Principal Component Analysis

7 0.56391549 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent

8 0.55969042 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

9 0.55789059 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

10 0.5073275 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

11 0.49968886 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

12 0.4850271 4 nips-2011-A Convergence Analysis of Log-Linear Training

13 0.47518638 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity

14 0.46466473 259 nips-2011-Sparse Estimation with Structured Dictionaries

15 0.4566682 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

16 0.4548457 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

17 0.45301363 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

18 0.45132792 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

19 0.44907433 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

20 0.44102752 14 nips-2011-A concave regularization technique for sparse mixture models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (4, 0.041), (20, 0.435), (26, 0.031), (31, 0.056), (33, 0.019), (43, 0.05), (45, 0.136), (57, 0.019), (65, 0.012), (74, 0.042), (83, 0.027), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93907946 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection

Author: Joseph J. Lim, Antonio Torralba, Ruslan Salakhutdinov

Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset. 1

2 0.93811452 275 nips-2011-Structured Learning for Cell Tracking

Author: Xinghua Lou, Fred A. Hamprecht

Abstract: We study the problem of learning to track a large quantity of homogeneous objects such as cell tracking in cell culture study and developmental biology. Reliable cell tracking in time-lapse microscopic image sequences is important for modern biomedical research. Existing cell tracking methods are usually kept simple and use only a small number of features to allow for manual parameter tweaking or grid search. We propose a structured learning approach that allows to learn optimum parameters automatically from a training set. This allows for the use of a richer set of features which in turn affords improved tracking compared to recently reported methods on two public benchmark sequences. 1

3 0.93297023 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo

Abstract: For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We first apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms.

4 0.92260355 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

Author: Samory Kpotufe

Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1

same-paper 5 0.86224842 260 nips-2011-Sparse Features for PCA-Like Linear Regression

Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail

Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.

6 0.61800885 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

7 0.60173076 154 nips-2011-Learning person-object interactions for action recognition in still images

8 0.58829838 227 nips-2011-Pylon Model for Semantic Segmentation

9 0.57330573 303 nips-2011-Video Annotation and Tracking with Active Learning

10 0.56721294 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

11 0.56170017 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

12 0.55568433 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

13 0.55345982 263 nips-2011-Sparse Manifold Clustering and Embedding

14 0.55172068 59 nips-2011-Composite Multiclass Losses

15 0.54620034 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

16 0.54093498 25 nips-2011-Adaptive Hedge

17 0.54091442 22 nips-2011-Active Ranking using Pairwise Comparisons

18 0.54004937 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

19 0.53441018 180 nips-2011-Multiple Instance Filtering

20 0.53152913 109 nips-2011-Greedy Model Averaging