nips nips2009 nips2009-252 knowledge-graph by maker-knowledge-mining

252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

Source: pdf

Author: Christos Boutsidis, Petros Drineas, Michael W. Mahoney

Abstract: We present a novel feature selection algorithm for the k-means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter ϵ ∈ (0, 1), selects and appropriately rescales in an unsupervised manner Θ(k log(k/ϵ)/ϵ2 ) features from a dataset of arbitrary dimensions. We prove that, if we run any γ-approximate k-means algorithm (γ ≥ 1) on the features selected using our method, we can ﬁnd a (1 + (1 + ϵ)γ)-approximate partition with high probability. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a novel feature selection algorithm for the k-means clustering problem. [sent-8, score-0.278]

2 Our algorithm is randomized and, assuming an accuracy parameter ϵ ∈ (0, 1), selects and appropriately rescales in an unsupervised manner Θ(k log(k/ϵ)/ϵ2 ) features from a dataset of arbitrary dimensions. [sent-9, score-0.119]

3 We prove that, if we run any γ-approximate k-means algorithm (γ ≥ 1) on the features selected using our method, we can ﬁnd a (1 + (1 + ϵ)γ)-approximate partition with high probability. [sent-10, score-0.143]

4 This optimization objective is often called the k-means clustering objective. [sent-13, score-0.134]

5 In recent years, the high dimensionality of the modern massive datasets has provided a considerable challenge to k-means clustering approaches. [sent-16, score-0.112]

6 First, the curse of dimensionality can make algorithms for k-means clustering very slow, and, second, the existence of many irrelevant features may not allow the identiﬁcation of the relevant underlying structure in the data [14]. [sent-17, score-0.15]

7 Practitioners addressed such obstacles by introducing feature selection and feature extraction techniques. [sent-18, score-0.239]

8 Despite the signiﬁcance of the problem, as well as the wealth of heuristic methods addressing it (see Section 3), there exist no provably accurate feature selection methods and extremely few provably accurate feature extraction methods for the k-means clustering objective (see Section 3. [sent-20, score-0.547]

9 1 Our work here addresses this shortcoming by presenting the ﬁrst provably accurate feature selection algorithm for k-means clustering. [sent-22, score-0.253]

10 Our algorithm constructs a probability distribution for the feature space, and then selects a small number of features (roughly k log(k), where k is the number of clusters) with respect to the computed probabilities. [sent-23, score-0.162]

11 ) Then, we argue that running k-means clustering algorithms on the selected features returns a constant-factor approximate partition to the optimal. [sent-25, score-0.214]

12 ) We now formally deﬁne the k-means clustering problem using the so-called cluster indicator matrix. [sent-27, score-0.187]

13 ) Deﬁnition 1 [T HE K - MEANS CLUSTERING PROBLEM ] Given a matrix A ∈ Rn×d (representing n points – rows – described with respect to d features – columns) and a positive integer k denoting the number of clusters, ﬁnd the n × k indicator matrix Xopt such that 2 Xopt = arg min A − XX T A F . [sent-31, score-0.283]

14 (1) X∈X The optimal value of the k-means clustering objective is Fopt = min X∈X 2 F A − XX T A T = A − Xopt Xopt A 2 F . [sent-32, score-0.153]

15 We brieﬂy expand on the notion of an n × k indicator matrix X. [sent-34, score-0.107]

16 √ 0 1/ 1 The above deﬁnition of the k-means objective is exactly equivalent with the standard deﬁnition of ∑n 2 k-means clustering [28]. [sent-47, score-0.134]

17 2 The feature selection algorithm and the quality-of-clustering results Algorithm 1 takes as inputs the matrix A ∈ Rn×d , the number of clusters k, and an accuracy parameter ϵ ∈ (0, 1). [sent-53, score-0.291]

18 It ﬁrst computes the top-k right singular vectors of A (columns of Vk ∈ Rd×k ). [sent-54, score-0.098]

19 Using these vectors, it computes the so-called (normalized) leverage scores [4, 24]; for i = 1, . [sent-55, score-0.162]

20 , d the i-th leverage score equals the square of the Euclidian norm of the i-th row of Vk (denoted by (Vk )(i) ). [sent-58, score-0.11]

21 The i-th leverage score characterizes the importance of the i-th feature with respect to the k-means objective. [sent-59, score-0.156]

22 Notice that these scores (see the deﬁnition of p′ s in step 2 of Algorithm 1) form i ∑n a probability distribution over the columns of A since i=1 pi = 1. [sent-60, score-0.149]

23 Then, the algorithm chooses a sampling parameter r that is equal to the number of (rescaled) features that we want to select. [sent-61, score-0.097]

24 Finally, note that the running time of Algorithm 1 is dominated by the time required to compute the top-k right singular ( ) vectors of the matrix A, which is at most O min{nd2 , n2 d} . [sent-72, score-0.165]

25 2 Input: n × d matrix A (n points, d features), number of clusters k, parameter ϵ ∈ (0, 1). [sent-73, score-0.096]

26 Compute the top-k right singular vectors of A, denoted by Vk ∈ Rd×k . [sent-75, score-0.098]

27 Compute the (normalized) leverage scores pi , for i = 1, . [sent-77, score-0.209]

28 d random trials: • keep the i-th feature with probability pi and multiply it by the factor (rpi )−1/2 . [sent-89, score-0.111]

29 Algorithm 1: A randomized feature selection algorithm for the k-means clustering problem. [sent-93, score-0.316]

30 This metric of accuracy has been extensively used in the Theoretical Computer Science community in order to analyze approximation algorithms for the k-means clustering problem. [sent-95, score-0.112]

31 Then, an approximation algorithm for the k-means clustering ˜ problem would be applied on A in order to determine the partition of the rows of A. [sent-99, score-0.203]

32 Deﬁnition 2 [ K - MEANS APPROXIMATION ALGORITHM ] An algorithm is a “γ-approximation” for the k-means clustering problem (γ ≥ 1) if it takes inputs A and k, and returns an indicator matrix Xγ that satisﬁes with probability at least 1 − δγ , T A − Xγ X γ A 2 F ≤ γ min X∈X A − XX T A 2 F . [sent-101, score-0.29]

33 Theorem 1 (see Section 4 for its proof) is our main quality-ofapproximation result for our feature selection algorithm. [sent-107, score-0.143]

34 Theorem 1 Let the n×d matrix A and the positive integer k be the inputs of the k-means clustering problem. [sent-108, score-0.234]

35 Let ϵ ∈ (0, 1), and run Algorithm 1 with inputs A, k, and ϵ in order to construct the n × r ˜ matrix A containing the selected features, where r = Θ(k log(k/ϵ)/ϵ2 ). [sent-109, score-0.118]

36 If we run any γ-approximation algorithm (γ ≥ 1) for the k-means clustering problem, whose fail˜ ure probability is δγ , on inputs A and k, the resulting cluster indicator matrix Xγ satisﬁes with ˜ probability at least 0. [sent-110, score-0.306]

37 A large number of different techniques appeared in prior work, addressing the feature selection within the context of both clustering and classiﬁcation. [sent-114, score-0.338]

38 Popular feature selection techniques include the Laplacian scores [16], the Fisher scores [9], or the constraint scores [33]. [sent-116, score-0.353]

39 In this section, we opt to discuss only a family of feature selection methods that are closely related to the leverage scores of our algorithm. [sent-117, score-0.305]

40 To the best of our knowledge, all previous feature selection methods come with no theoretical guarantees of the form that we describe here. [sent-118, score-0.143]

41 Given as input an n × d object-feature matrix A and a positive integer k, feature selection for Principal Components Analysis (PCA) corresponds to the task of identifying a subset of k columns from A that capture essentially the same information as do the top k principal components of A. [sent-119, score-0.332]

42 Four of them (called B1, B2, B3, and B4 in [18]) employ the Singular Value Decomposition of A in order to identify columns that are somehow correlated with its top k left singular vectors. [sent-121, score-0.149]

43 In particular, B3 employs exactly the leverage scores in order to greedily select the k columns corresponding to the highest scores; no theoretical results are reported. [sent-122, score-0.194]

44 Another approach employing the matrix of the top k right singular vectors of A and a Procrustes-type criterion appeared in [20]. [sent-124, score-0.266]

45 From an applications perspective, [30] employed the methods of [18] and [20] for gene selection in microarray data analysis. [sent-125, score-0.123]

46 From a complementary viewpoint, feature selection for clustering seeks to identify those features that have the most discriminative power among the set of all features. [sent-126, score-0.293]

47 Finally, note that employing the leverage scores in a randomized manner similar to Algorithm 1 has already been proven to be accurate for least-squares regression [8] and PCA [7, 2]. [sent-128, score-0.232]

48 Recall Deﬁnition T 1, and notice that Xopt Xopt A is a matrix of rank at most k. [sent-131, score-0.117]

49 If the n × d matrix A is projected on the subspace spanned by its top k left singular vectors, then the resulting n × k ˆ matrix A = Uk Σk corresponds to a mapping of the original d-dimensional space to the optimal k-dimensional space. [sent-135, score-0.234]

50 This process is equivalent to feature extraction: the top k left singular vectors (the columns of Uk ) correspond to the constructed features (Σk is a simple rescaling operator). [sent-136, score-0.336]

51 Prior to the work of [6], it was empirically known that running k-means clustering algorithms on ˆ the low-dimensional matrix A was a viable alternative to clustering the high-dimensional matrix A. [sent-137, score-0.358]

52 ˆ The work of [6] formally argued that if we let the cluster indicator matrix Xopt denote the optimal ˆ i. [sent-138, score-0.142]

53 , k-means partition on A, ˆ ˆ A − XX T A ˆ Xopt = arg min X∈X 2 , (6) F then using this partition on the rows of the original matrix A is a 2-approximation to the optimal partition, a. [sent-140, score-0.196]

54 On the positive side, an obvious advantage of feature selection vs. [sent-146, score-0.143]

55 feature extraction is the immediate interpretability of the former. [sent-147, score-0.096]

56 right) singular vectors of A, and let Σk ∈ Rk×k be a diagonal matrix containing the top k singular T values of A. [sent-154, score-0.265]

57 A+ denotes the pseudo-inverse of A and ||A+ ||2 = σmax (A+ ) = 1/σmin (A), where σmax (X) and σmin (X) denote the largest and the smallest non-zero singular values of a matrix X, respectively. [sent-157, score-0.149]

58 A useful property of matrix norms is that for any two matrices X and Y , ∥XY ∥F ≤ ∥X∥F ∥Y ∥2 and ∥XY ∥F ≤ ∥X∥2 ∥Y ∥F ; this is a stronger version of the standard submultiplicavity property for matrix norms. [sent-158, score-0.18]

59 We call P a projector matrix if it is square and P 2 = P . [sent-159, score-0.127]

60 2 Sampling and rescaling matrices We introduce a simple matrix formalism in order to conveniently represent the sampling and rescaling processes of Algorithm 1. [sent-166, score-0.262]

61 Let S be a d × r sampling matrix that is constructed as follows: S is initially empty. [sent-167, score-0.119]

62 , r, in turn, if the i-th feature of A is selected by the random sampling process described in Algorithm 1, then ei (a column vector of all-zeros, except for its i-th entry which is set to one) is appended to S. [sent-171, score-0.144]

63 Also, let D be a r × r diagonal rescaling matrix constructed as follows: D is initially an all-zeros matrix. [sent-172, score-0.153]

64 , r, in turn, if the i-th feature = of A is selected, then the next diagonal entry of D is set to 1/ rpi . [sent-176, score-0.091]

65 3 A preliminary lemma and sufﬁcient conditions Lemma 1 presented below gives upper and lower bounds for the largest and the smallest singular T T values of the matrix Vk SD, respectively. [sent-179, score-0.229]

66 Finally, it argues that the matrix ASD can be used to provide a very accurate approximation to the matrix Ak . [sent-181, score-0.166]

67 Lemma 1 provides four sufﬁcient conditions for designing provably accurate feature selection algorithms for k-means clustering. [sent-182, score-0.23]

68 (4) given below, the results of Lemma 1 are sufﬁcient to prove our main theorem; the rest of the arguments apply to all sampling and rescaling matrices S and D. [sent-184, score-0.143]

69 any sampling matrix S and rescaling matrix D, that satisfy bounds similar to those of Lemma 1, can be employed to design a provably accurate feature selection algorithm for k-means clustering. [sent-187, score-0.509]

70 Where no rescaling is allowed in the selected features, the bottleneck in the approximation accuracy of a feature selection algorithm would be to ﬁnd a T sampling matrix S such that only ||(Vk S)+ ||2 is bounded from above. [sent-189, score-0.361]

71 It is worth emphasizing that the same factor + T ||(Vk S) ||2 appeared to be the bottleneck in the design of provably accurate column-based lowrank approximations (see, for example, Theorem 1. [sent-193, score-0.17]

72 It is evident from the above observations that other column sampling methods (see, for example, [17, 3, 2] and references therein), satisfying similar bounds to those of Lemma 1, immediately suggest themselves for the design of provably accurate feature selection algorithms for k-means clustering. [sent-197, score-0.304]

73 5 Lemma 1 Assume that the sampling matrix S and the rescaling matrix D are constructed using Algorithm 1 (see also Section 4. [sent-201, score-0.256]

74 , d Pr[y = yi ] = pi , where yi = (1/ pi )(Vk )(i) is a realization of y. [sent-225, score-0.094]

75 This deﬁnition of y and the deﬁnition of the sampling and rescaling matrices S and D imply that √ ∑d T T T Vk SDDS T Vk = 1 i=1 yi yi . [sent-226, score-0.125]

76 To prove the third statement, we T only need to show that the k-th singular value of Vk SD is positive. [sent-239, score-0.1]

77 (12) we replaced Aρ−k by be dropped without increasing a unitarily invariant norm such as the Frobenius matrix norm. [sent-255, score-0.197]

78 If the ﬁrst three statements of the lemma hold w. [sent-256, score-0.109]

79 ) Finally, notice that the ﬁrst three statements have the same failure probability 1/6 and the fourth statement fails w. [sent-260, score-0.115]

80 (14) 2 θ4 T Since I −Xγ Xγ ˜ ˜ is a projector matrix, it can be dropped We ﬁrst bound the second term of eqn. [sent-269, score-0.11]

81 (16) we used Lemma 1, the triangle inequality, and the fact that I − Xγ Xγ is a projector matrix and can be dropped without increasing a unitarily invariant norm. [sent-276, score-0.239]

82 (18) we replaced Xγ by Xopt and the factor γ appeared in the ﬁrst term. [sent-281, score-0.104]

83 To ˜ better understand this step, notice that Xγ gives a γ-approximation to the optimal k-means clustering ˜ of the matrix ASD, and any other n × k indicator matrix (for example, the matrix Xopt ) satisﬁes ( ) ( ) 2 2 2 T T I − Xγ Xγ ASD F ≤ γ min (I − XX T )ASD F ≤ γ I − Xopt Xopt ASD F . [sent-282, score-0.402]

84 (19) we ﬁrst introduced the k × k identity matrix Ik = (Vk SD)+ (Vk SD) T (rank(Vk SD) = k) and then we used submultiplicativity (see Section 4. [sent-284, score-0.094]

85 (22) we replaced Ak by AVk Vk T T and dropped I − Xopt Xopt from the second term (I − Xopt Xopt is a projector matrix and does not T increase the Frobenius norm). [sent-292, score-0.198]

86 (23) we dropped the projector matrix Vk Vk and used eqn. [sent-294, score-0.177]

87 Note that Theorem 1 fails only if Lemma 1 or the γ-approximation k-means clustering algorithm fail, which happens w. [sent-303, score-0.135]

88 08 ruyter all leverage scores best set ( r = 30 ) 0. [sent-322, score-0.186]

89 01 0 0 1000 2000 3000 4000 features 5000 6000 7000 Figure 1: Leverage scores for the NIPS dataset. [sent-329, score-0.108]

90 We show that it selects the most relevant features (Figure 1) and that the clustering obtained after feature selection is performed is very accurate (Table 1). [sent-331, score-0.345]

91 Notice that only a small subset of features sufﬁces to approximately reproduce the partition obtained when all features were kept. [sent-346, score-0.118]

92 In Figure 1 we plotted the distribution of the leverage scores for the 6314 terms (columns) of A; we also highlighted the features returned by Algorithm 1 when the sampling parameter r is set to 10k. [sent-347, score-0.252]

93 We observed that terms corresponding to the largest leverage scores had signiﬁcant discriminative power. [sent-348, score-0.162]

94 In particular, ruyter appeared almost exclusively in documents of the ﬁrst and third categories, hand appeared in documents of the third category, information appeared in documents of the ﬁrst category, and code appeared in documents of the second and third categories only. [sent-349, score-0.46]

95 Inﬂuential observations, high leverage points, and outliers in linear regression. [sent-378, score-0.092]

96 Result analysis of the NIPS 2003 feature selection challenge. [sent-443, score-0.143]

97 A simple linear time (1 + ϵ)-approximation algorithm for k-means clustering in any dimensions. [sent-483, score-0.135]

98 Identifying critical variables of principal components for unsupervised feature selection. [sent-515, score-0.11]

99 Gene selection for microarray data analysis using principal component analysis. [sent-537, score-0.151]

100 Constraint score: A new ﬁlter method for feature selection with pairwise constraints. [sent-558, score-0.143]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vk', 0.646), ('xopt', 0.439), ('sd', 0.382), ('asd', 0.178), ('ak', 0.146), ('clustering', 0.112), ('fopt', 0.11), ('leverage', 0.092), ('appeared', 0.083), ('singular', 0.082), ('selection', 0.079), ('rescaling', 0.07), ('scores', 0.07), ('mahoney', 0.069), ('matrix', 0.067), ('feature', 0.064), ('lemma', 0.064), ('lloyd', 0.062), ('projector', 0.06), ('xx', 0.059), ('ik', 0.056), ('provably', 0.055), ('sdds', 0.055), ('dropped', 0.05), ('drineas', 0.048), ('pi', 0.047), ('principal', 0.046), ('statements', 0.045), ('partition', 0.042), ('unitarily', 0.041), ('indicator', 0.04), ('uk', 0.039), ('svd', 0.038), ('randomized', 0.038), ('features', 0.038), ('sampling', 0.036), ('cluster', 0.035), ('frobenius', 0.033), ('accurate', 0.032), ('extraction', 0.032), ('columns', 0.032), ('notice', 0.03), ('inputs', 0.029), ('clusters', 0.029), ('avk', 0.027), ('boutsidis', 0.027), ('cur', 0.027), ('polytechnic', 0.027), ('rensselaer', 0.027), ('rpi', 0.027), ('submultiplicativity', 0.027), ('submultiplicavity', 0.027), ('troy', 0.027), ('theorem', 0.027), ('rows', 0.026), ('integer', 0.026), ('microarray', 0.026), ('soda', 0.026), ('documents', 0.026), ('rescaled', 0.025), ('ruyter', 0.024), ('failure', 0.024), ('nition', 0.024), ('symposium', 0.023), ('annual', 0.023), ('algorithm', 0.023), ('focs', 0.022), ('selected', 0.022), ('objective', 0.022), ('column', 0.022), ('triangle', 0.021), ('nips', 0.021), ('replaced', 0.021), ('yy', 0.021), ('selects', 0.02), ('rank', 0.02), ('proof', 0.02), ('guyon', 0.02), ('ces', 0.019), ('min', 0.019), ('matrices', 0.019), ('algebra', 0.019), ('log', 0.018), ('prove', 0.018), ('notation', 0.018), ('patients', 0.018), ('top', 0.018), ('norm', 0.018), ('gene', 0.018), ('co', 0.017), ('somehow', 0.017), ('discarding', 0.017), ('constructs', 0.017), ('returned', 0.016), ('constructed', 0.016), ('suf', 0.016), ('bounds', 0.016), ('vectors', 0.016), ('surveys', 0.016), ('statement', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

Author: Christos Boutsidis, Petros Drineas, Michael W. Mahoney

2 0.11574566 54 nips-2009-Compositionality of optimal control laws

Author: Emanuel Todorov

Abstract: We present a theory of compositionality in stochastic optimal control, showing how task-optimal controllers can be constructed from certain primitives. The primitives are themselves feedback controllers pursuing their own agendas. They are mixed in proportion to how much progress they are making towards their agendas and how compatible their agendas are with the present task. The resulting composite control law is provably optimal when the problem belongs to a certain class. This class is rather general and yet has a number of unique properties – one of which is that the Bellman equation can be made linear even for non-linear or discrete dynamics. This gives rise to the compositionality developed here. In the special case of linear dynamics and Gaussian noise our framework yields analytical solutions (i.e. non-linear mixtures of LQG controllers) without requiring the ﬁnal cost to be quadratic. More generally, a natural set of control primitives can be constructed by applying SVD to Green’s function of the Bellman equation. We illustrate the theory in the context of human arm movements. The ideas of optimality and compositionality are both very prominent in the ﬁeld of motor control, yet they have been difﬁcult to reconcile. Our work makes this possible.

3 0.092064753 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Author: John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, Yi Ma

Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized “robust principal component analysis” problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efﬁciently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the ﬁrst proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision. 1

4 0.08032345 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

Author: Samuel R. Bulò, Marcello Pelillo

Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-deﬁned number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Speciﬁcally, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player “clustering game”, whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of ﬁnding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.

5 0.07475213 147 nips-2009-Matrix Completion from Noisy Entries

Author: Raghunandan Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative ﬁltering (the ‘Netﬂix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced in [1], based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. 1

6 0.072245009 234 nips-2009-Streaming k-means approximation

7 0.071510755 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

8 0.06290894 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

9 0.056630213 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

10 0.052736953 133 nips-2009-Learning models of object structure

11 0.050040673 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

12 0.048822995 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

13 0.047981169 157 nips-2009-Multi-Label Prediction via Compressed Sensing

14 0.047229271 137 nips-2009-Learning transport operators for image manifolds

15 0.047081571 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

16 0.045998506 223 nips-2009-Sparse Metric Learning via Smooth Optimization

17 0.044120517 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

18 0.040725145 21 nips-2009-Abstraction and Relational learning

19 0.040478073 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

20 0.039759818 122 nips-2009-Label Selection on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, 0.042), (2, -0.013), (3, -0.001), (4, -0.017), (5, -0.022), (6, 0.019), (7, -0.051), (8, 0.041), (9, -0.002), (10, 0.005), (11, -0.042), (12, 0.031), (13, -0.017), (14, -0.058), (15, -0.04), (16, 0.091), (17, -0.099), (18, -0.103), (19, 0.007), (20, 0.031), (21, 0.02), (22, -0.013), (23, -0.027), (24, -0.002), (25, -0.051), (26, -0.105), (27, -0.097), (28, -0.065), (29, 0.009), (30, 0.083), (31, -0.03), (32, 0.01), (33, -0.0), (34, 0.123), (35, -0.078), (36, -0.002), (37, -0.044), (38, 0.027), (39, 0.029), (40, 0.0), (41, 0.109), (42, 0.083), (43, -0.069), (44, 0.005), (45, -0.013), (46, 0.082), (47, 0.136), (48, 0.218), (49, -0.143)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93225271 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

Author: Christos Boutsidis, Petros Drineas, Michael W. Mahoney

2 0.58456069 54 nips-2009-Compositionality of optimal control laws

Author: Emanuel Todorov

3 0.52314335 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (ﬁtting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

4 0.51596177 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

Author: Samuel R. Bulò, Marcello Pelillo

5 0.466241 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman

Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally eﬃcient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1

6 0.4351739 182 nips-2009-Optimal Scoring for Unsupervised Learning

7 0.43118504 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

8 0.40726569 147 nips-2009-Matrix Completion from Noisy Entries

9 0.4053492 81 nips-2009-Ensemble Nystrom Method

10 0.37989175 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

11 0.36352488 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

12 0.36213145 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

13 0.35845372 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

14 0.35686278 67 nips-2009-Directed Regression

15 0.34575352 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

16 0.34406748 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

17 0.34148261 177 nips-2009-On Learning Rotations

18 0.32804951 51 nips-2009-Clustering sequence sets for motif discovery

19 0.32199991 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

20 0.31816292 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.013), (21, 0.011), (24, 0.051), (25, 0.052), (35, 0.054), (36, 0.117), (37, 0.02), (39, 0.067), (47, 0.27), (58, 0.117), (71, 0.034), (81, 0.029), (86, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78123033 46 nips-2009-Bilinear classifiers for visual recognition

Author: Hamed Pirsiavash, Deva Ramanan, Charless C. Fowlkes

Abstract: We describe an algorithm for learning bilinear SVMs. Bilinear classiﬁers are a discriminative variant of bilinear models, which capture the dependence of data on multiple factors. Such models are particularly appropriate for visual data that is better represented as a matrix or tensor, rather than a vector. Matrix encodings allow for more natural regularization through rank restriction. For example, a rank-one scanning-window classiﬁer yields a separable ﬁlter. Low-rank models have fewer parameters and so are easier to regularize and faster to score at run-time. We learn low-rank models with bilinear classiﬁers. We also use bilinear classiﬁers for transfer learning by sharing linear factors between different classiﬁcation tasks. Bilinear classiﬁers are trained with biconvex programs. Such programs are optimized with coordinate descent, where each coordinate step requires solving a convex program - in our case, we use a standard off-the-shelf SVM solver. We demonstrate bilinear SVMs on difﬁcult problems of people detection in video sequences and action classiﬁcation of video sequences, achieving state-of-the-art results in both. 1

same-paper 2 0.77830285 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

Author: Christos Boutsidis, Petros Drineas, Michael W. Mahoney

3 0.77724814 161 nips-2009-Nash Equilibria of Static Prediction Games

Author: Michael Brückner, Tobias Scheffer

Abstract: The standard assumption of identically distributed training and test data is violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will ﬁnd the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam ﬁltering empirically. 1

4 0.59581256 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semideﬁnite matrix variable, we reformulate the problem to eliminate the need for general semideﬁnite programming. In particular, we provide two reformulations that admit fast algorithms. The ﬁrst is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

5 0.59375161 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

Author: Jaakko Luttinen, Alexander T. Ihler

Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.

6 0.59285492 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

7 0.590473 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

8 0.59045875 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

9 0.5886761 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

10 0.58831704 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

11 0.58808088 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

12 0.58792782 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

13 0.58779627 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

14 0.58661246 104 nips-2009-Group Sparse Coding

15 0.58611995 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

16 0.58521664 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

17 0.58221126 76 nips-2009-Efficient Learning using Forward-Backward Splitting

18 0.58135676 157 nips-2009-Multi-Label Prediction via Compressed Sensing

19 0.58135557 70 nips-2009-Discriminative Network Models of Schizophrenia

20 0.58041185 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction