jmlr jmlr2013 jmlr2013-100 knowledge-graph by maker-knowledge-mining

100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization


Source: pdf

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Electrical Engineering University of Washington Seattle, WA 98195, USA Editor: Inderjit Dhillon Abstract For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. [sent-9, score-0.228]

2 To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. [sent-10, score-0.268]

3 Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. [sent-14, score-0.242]

4 Many clustering methods can be interpreted in terms of a matrix factorization problem. [sent-17, score-0.223]

5 For example, the popular k-means clustering algorithm attempts to solve the k-means problem: produce a clustering such that the sum of squared error between samples and the mean of their cluster is small (Hastie et al. [sent-18, score-0.353]

6 For n feature vectors gathered as the d-dimensional columns of a matrix X ∈ Rd×n , the k-means problem can be written as a matrix factorization: minimize F∈Rd×k G∈Rk×n X − FG 2 F (1) k×n subject to G ∈ {0, 1} c 2013 Raman Arora, Maya R. [sent-20, score-0.176]

7 The matrix F can be interpreted as a matrix whose columns are the k cluster centroids. [sent-23, score-0.251]

8 The soft k-means problem instead solves for a cluster probability matrix P that specifies the probability of each sample belonging to each of the clusters: minimize F∈Rd×k P∈Rk×n X − FP 2 F (2) T subject to P ≥ 0, P 1k = 1n , where the inequality P ≥ 0 is entrywise. [sent-29, score-0.191]

9 Following their work, other researchers have posed different matrix factorization models, and attempted to explicitly solve the resulting matrix factorization problems to form clusterings. [sent-33, score-0.188]

10 Convex NMF restricts the columns of F (the cluster centroids) to be convex combinations of the columns of X: minimize W ∈Rn×k + G∈Rk×n + X − XW G 2 F. [sent-42, score-0.211]

11 , 2010); for an input kernel matrix K, the kernel convex NMF solves minimize tr K − 2GT W +W T KW GT G . [sent-44, score-0.166]

12 W ∈Rn×k + G∈Rk×n + 1716 S IMILARITY- BASED C LUSTERING In this paper, we propose a new non-negative matrix factorization (NNMF) model for clustering from pairwise similarities between the samples. [sent-45, score-0.241]

13 Experimental results are presented in Section 5, and show that the LSD clustering model performs well in terms of both within-class similarity and misclassification rate. [sent-52, score-0.208]

14 Relax the ideal kernel’s assumption that each sample belongs to only one class, and let P ∈ [0, 1]k×n be interpreted as a soft cluster assignment matrix for n samples and k clusters. [sent-60, score-0.171]

15 We use the LSD model for clustering by solving a non-negative matrix factorization problem for the best cluster assignment matrix P. [sent-65, score-0.372]

16 That is, given a matrix K, we propose finding a scaling factor c and a cluster probability matrix P that best solve minimize c∈R+ P∈[0,1]k×n 1 K − PT P c 2 F (3) T subject to P 1k = 1n . [sent-66, score-0.223]

17 ˆ LSD Clustering: Given a solution P to (3) an LSD clustering is an assignment of samples to clusters th sample is assigned to cluster i∗ , then P ∗ ≥ P for all i. [sent-67, score-0.262]

18 Proposition 2 gives a closed-form solution for such an optimal c, and Theorem 3 states that given such a c, the optimal √ matrix P is a rotation of the matrix cZ. [sent-97, score-0.237]

19 Theorem 4 (Unique LSD Clustering) Let Hu be the subgroup of the rotation group SO(k) in Rk that leaves the vector u = 1 [1, . [sent-143, score-0.161]

20 , n}, of rotation matrices that rotate all columns of a given LSD factor P into the probability simplex (LSD) ∆k ⊂ Rk . [sent-150, score-0.378]

21 The columns of P∗ are shown as points on the probability simplex for n = 15 samples. [sent-157, score-0.169]

22 The Y-shaped thick gray lines show the clustering decision regions, and separate the 15 points into three clusters marked by ‘o’s, ‘+’s, and ‘x’s. [sent-158, score-0.167]

23 One can rotate the columns of P∗ about the center of the simplex u to form a different probability ˆ ˆ ˆ matrix P = Ru P∗ , but the inner product does not change PT P = P∗T P∗ = K, thus any such rotation is another LSD solution. [sent-159, score-0.432]

24 One can rotate clockwise by at most angle βc (and anti-clockwise βac ) before a point crosses a cluster decision boundary, which changes the clustering. [sent-160, score-0.193]

25 That is, if one cannot rotate any point across a cluster boundary without pushing another point out of the probability simplex, then the LSD clustering is unique. [sent-163, score-0.304]

26 The simple k = 2 clustering can be used to produce a fast hierarchical binary-tree clustering, which we discuss in Section 3. [sent-176, score-0.163]

27 1), (ii) rotate the matrix factor to enforce the left-stochastic constraint, which puts each column of the matrix factor in the same plane as the probability simplex (see Section 3. [sent-197, score-0.325]

28 2) matrix comprising columns of Q projected onto the probability simplex (see Section 3. [sent-213, score-0.247]

29 2) group of k × k rotation matrices embedding of SO(k − 1) into SO(k) as the isotropy subgroup of u ∈ Rk (see Section 3. [sent-215, score-0.177]

30 The scaled similarity matrix is factorized as K ′ ≈ M T M, where M = [ λ1 v1 λ2 v2 · · · λk vk ]T is a k × n real matrix comprising scaled eigenvectors corresponding to the top k eigenvalues. [sent-222, score-0.221]

31 2 S TEP 2: E NFORCE L EFT- STOCHASTIC C ONSTRAINT The n columns of M can be seen as points in k-dimensional Euclidean space, and the objective of the LSD algorithm is to rotate these points to lie inside the probability simplex in Rk . [sent-227, score-0.279]

32 Let m denote the normal to the hyperplane ˜ H , and let M denote the matrix obtained after projection of columns of M onto a hyperplane perpen√ dicular to m and 1/ k units away from the origin (see Figure 2(a) for an illustration and Algorithm 1 for details). [sent-230, score-0.234]

33 m ˜ Next the columns of M are rotated by a rotation matrix Rs that rotates the unit vector m 2 ∈ Rk 1 about the origin to coincide with the unit vector u = √k [1, . [sent-231, score-0.257]

34 The rotation matrix Rs is computed from a Givens rotation, as described in Subroutine 1. [sent-235, score-0.183]

35 This rotation ˜ ˜ matrix acts on M to give Q = Rs M, whose columns are points in Rk that lie on the hyperplane containing the probability simplex. [sent-236, score-0.3]

36 Then, as shown in Figure 2b, we rotate the columns of Q about the normal u to best fit the points inside the probability simplex (some projection onto the simplex may be needed), mapping the red circles to black crosses. [sent-249, score-0.424]

37 For k = 2, we do not need Step 3 which finds the rotation about the normal to the simplex that rotates each column of the matrix factor into the simplex. [sent-253, score-0.346]

38 3 S TEP 3: E NFORCE N ON - NEGATIVITY C ONSTRAINT For k > 2, a final step is needed in which we rotate the columns of Q about u in the hyperplane containing the simplex to fit each column of Q into the simplex (see Figure 2(b) for an illustration). [sent-257, score-0.44]

39 m] (shift columns 1/ k units in direction of m ) 2 Compute a rotation Rs=Rotate Givens(m, u) (see Subroutine 1 or if k = 2 the formula in Section 3. [sent-264, score-0.177]

40 Note that it may not be possible to rotate each point into the simplex and therefore in such cases we would require a projection step onto the simplex after the algorithm for learning Ru has converged. [sent-267, score-0.346]

41 Following Theorem 3(b), Subroutine 2 tries to find the rotation that best fits the columns of the matrix Q inside the probability simplex by solving the following problem minimize R∈SO(k) K ′ − (RQ)T (RQ)∆ ∆ 2 F (12) subject to Ru = u. [sent-269, score-0.387]

42 The optimization is over all k × k rotation matrices R that leave u invariant (the isotropy subgroup of u) ensuring that the columns of RQ stay on the hyperplane containing the probability simplex. [sent-271, score-0.279]

43 It is easy to check that any rotation matrix R that leaves a vector u ∈ Rk invariant can be written as ψu (g) for some rotation matrix g ∈ SO(k − 1). [sent-278, score-0.366]

44 We have thus exploited the invariance property of the isotropy subgroup to reduce our search space to the set of all (k − 1) × (k − 1) rotation matrices. [sent-279, score-0.177]

45 Let g(t) denote the estimate of the optimal rotation matrix at iteration t. [sent-282, score-0.183]

46 Note that xi represents the column qi after rotation by the current estimate ψ(g(t) ) and yi is the projection of xi onto the probability simplex. [sent-291, score-0.169]

47 Form the Givens rotation matrix by setting: (RG )11 = (RG )22 = uT m, (RG )21 = −(RG )12 = uT v. [sent-306, score-0.183]

48 Output: Rs (a rotation matrix such that Rs m m 2 = u u 2 ). [sent-308, score-0.183]

49 qn ] ∈ Rk×n with columns lying in the probability simplex hyperplane; maximum number of batch iterations IT ER ˆ Initialize ψ(g0 ), Ru as k × k identity matrices. [sent-312, score-0.169]

50 Compute rotation matrix Rue = Rotate Givens(u, e) ∈ Rk where u = e= [0, . [sent-313, score-0.183]

51 This top-down hierarchical LSD clustering requires running LSD algorithm for k = 2 a total of k − 1 times. [sent-342, score-0.163]

52 In Section 5, we show experimentally that for large k the runtime of hierarchical LSD clustering may be orders of magnitude faster than other clustering algorithms that produce similar within-cluster similarities. [sent-344, score-0.292]

53 Note that in the LSD algorithm, R is forced to be a rotation matrix (which is easy to do with a small variation of the first step). [sent-360, score-0.183]

54 Also, at the end of each iteration k, the data X is also updated as Xk+1 = Xk Rk , which means the rotation is applied to the data, and we look for further rotation to move our data points into C . [sent-361, score-0.258]

55 5 Related Clustering Algorithms The proposed rotation-based LSD has some similarities to spectral clustering (von Luxburg, 2006; Ng et al. [sent-364, score-0.174]

56 Our algorithm begins with an eigenvalue decomposition, and then we work with the eigenvaluescaled eigenvectors of the similarity matrix K. [sent-366, score-0.167]

57 Spectral clustering instead acts on the eigenvectors of the graph Laplacian of K (normalized spectral clustering acts on the eigenvectors of the normalized graph Laplacian of K). [sent-367, score-0.353]

58 For k > 2 clusters, Perron cluster analysis linearly maps their n × k eigenvector matrix to the probability simplex to form a soft cluster assignment. [sent-370, score-0.387]

59 In a somewhat similar step, we rotate a n × k matrix factorization to the probability simplex. [sent-371, score-0.174]

60 Experiments We compared the LSD and hierarchical LSD clustering to nine other clustering algorithms: kernel convex NMF (Ding et al. [sent-375, score-0.325]

61 , 2009), and the classic DIANA hierarchical clustering method (MacNaughton-Smith et al. [sent-378, score-0.163]

62 In addition, we compared with hierarchical variants of the other clustering algorithms using the same splitting strategy as used in the proposed hierarchical LSD. [sent-380, score-0.197]

63 Recall that k-means clustering iteratively assigns each of the samples to the cluster whose mean is the closest. [sent-391, score-0.224]

64 Similarly, when running the k-means algorithm as a subroutine of the spectral clustering variants, we used Matlab’s kmeans function with 100 random starts and chose the solution that best optimized the k-means objective, that is, within-cluster scatter. [sent-396, score-0.209]

65 We modified it to take a similarity matrix instead, as follows: At each iteration, we split the cluster with the smallest average within-cluster similarity. [sent-406, score-0.228]

66 First, we choose the point x1 ∈ C that has the smallest average similarity to all the other points in the cluster and place it in a new cluster Cnew and set Cold = C\{x1 }. [sent-408, score-0.269]

67 We continue this process until the expression in (16) is non-positive for all y ∈ Cold ; that is, until there are no points in the old cluster that have a larger average similarity to points in the new cluster, as compared to remaining points in the old cluster. [sent-412, score-0.174]

68 2 Metrics There is no single approach to judge whether a clustering is “good,” as the goodness of the clustering depends on what one is looking for. [sent-417, score-0.258]

69 1 AVERAGE W ITHIN - CLUSTER S IMILARITY One common goal of clustering algorithms is to maximize the similarity between points within the same cluster, or equivalently, to minimize the similarity between points lying in different clusters. [sent-421, score-0.307]

70 (2010), the misclassification rate for a clustering is defined to be the smallest misclassification rate over all permutations of cluster labels. [sent-428, score-0.224]

71 A clustering method asked to find 47 clusters might not pick up on the author-clustering, but might instead produce a clustering indicative of sub-genre. [sent-434, score-0.296]

72 Experimental results with the kernel convex NMF code were generally not as good with the full similarity matrix as with the nearest PSD matrix, as suggested by the theory. [sent-442, score-0.166]

73 3 Data Sets The proposed method acts on a similarity matrix, and thus most of the data sets used are specified as similarity matrices as described in the next subsection. [sent-455, score-0.158]

74 1 NATIVELY S IMILARITY DATA S ETS We compared the clustering methods on eleven similarity data sets. [sent-463, score-0.228]

75 The similarity is the average of two humans’ judgement of the similarity between two sonar signals, on a scale of 1 to 5. [sent-471, score-0.176]

76 The similarity is a cosine similarity between the integral invariant signatures of the surface curves of the 945 sample faces. [sent-475, score-0.158]

77 The similarity is the Tversky similarity of 1556 binary features describing a webpage, which is negative for many pairs of webpages. [sent-479, score-0.158]

78 The similarity is the average of three humans’ fine-grained judgement of the audio similarity of a 1732 S IMILARITY- BASED C LUSTERING pair of samples. [sent-483, score-0.158]

79 The similarity is calculated from the multi-spectral scale-invariant feature transform (MSIFT) descriptors (Brown and Susstrunk, 2011) of two images by taking the average distance d between all pairs of descriptors for the two images, and setting the similarity to e−d . [sent-486, score-0.158]

80 The similarity measures the Hamming similarity of sixteen votes in 1984 between any two politicians. [sent-499, score-0.158]

81 2 NATIVELY E UCLIDEAN DATA S ETS In order to also compare similarity-based clustering algorithms to the standard k-means clustering algorithm (Hastie et al. [sent-507, score-0.258]

82 The k-means algorithm computes the cluster means and cluster assignments using the features directly, whereas the similarity-based clustering algorithms use the RBF (radial basis function) kernel to infer the similarities between a pair of data points. [sent-511, score-0.37]

83 The bandwidth of the RBF kernel was tuned for the average within cluster similarity on a small held-out set. [sent-512, score-0.207]

84 Note that different bandwidths yield different similarity matrices and the resulting average within cluster similarities (computed using Equation (17)) are not directly comparable for two different values of bandwidths. [sent-514, score-0.192]

85 The runtimes for the LSD algorithms were similar to the kernel convex NMF and spectral clustering methods. [sent-533, score-0.207]

86 Because the other clustering algorithms do not have fast special cases for k = 2, hierarchical variants of these methods do not offer increased efficiency, but we compared to them for completeness. [sent-724, score-0.163]

87 For example, the hierarchical normalized spectral clustering is as good or tied with normalized spectral clustering on eight of the eleven similarity data sets. [sent-726, score-0.445]

88 This fact motivated a fast hierarchical LSD clustering for problems where the number of clusters desired is large. [sent-734, score-0.201]

89 For most data sets, the proposed LSD clustering and hierarchical LSD were top performers. [sent-735, score-0.163]

90 We showed that the set of possible LSD clusterings is related by rotations and gave conditions for when the LSD clustering is unique. [sent-736, score-0.171]

91 to different clusters to produce a clustering with a desired number of samples in each cluster. [sent-1114, score-0.167]

92 Similar to the LSD algorithm, this general iterative algorithm will start by factorizing the Gram matrix to obtain an initial set of vectors (we assume the correct dimension of the xi are known) and seek a rotation matrix that maps these vectors into the cone C . [sent-1250, score-0.237]

93 At every iteration, the algorithm will project the vectors onto the cone, then update the rotation matrix accordingly as discussed in Section 3. [sent-1251, score-0.207]

94 Then Theorem 3(a) implies that there Z ˜ ˜ ˜ exists a rotation matrix R such that RQ = Z. [sent-1281, score-0.183]

95 , xk ) ∈ Rk | ∑k xk = 1} denote the hyperplane that contains the simplex j=1 ∆k . [sent-1291, score-0.175]

96 Since we are interested in clusterings that are unique up to a permutation of labels, and the orthogonal group modulo reflections is isomorphic to the rotation group, all left-stochastic decompositions are characterized by the stabilizer subgroup of u, given by Hu = {R ∈ SO(k)|Ru = u}. [sent-1301, score-0.213]

97 Clearly, we cannot have a unique clustering in this case as we can arbitrarily rotate the columns of P about the centroid of the simplex without violating any LSD constraints and resulting in arbitrary LSD clusterings. [sent-1315, score-0.378]

98 To get a bound ˆ on P, we use a result by Stewart (1998) that states that ˜ Ek = RE Ek +WE , (23) ˜ ˜ where Ek , Ek are the first k columns of E, E respectively, RE is an orthogonal matrix and WE is a k × n matrix that describes the perturbation of the eigenvectors. [sent-1329, score-0.199]

99 The 2−norm of the perturbation matrix is bounded by the 2-norm of the additive noise as ||WE ||2 ≤ C1 ||W ||2 , |δk | ˜ ˜ where δk = λk − λk+1 is the difference between the kth eigenvalue of the perturbed matrix K and th eigenvalue of the original matrix K. [sent-1330, score-0.226]

100 (k + 1) ˜1 ˜T Now, from (23), pre-multiplying Ek by Λ 2 and post-multiplying by R1 = RT R0 gives E ˜1 T ˜1 T ˜ 1 ˜T Λ 2 Ek R1 = Λ 2 Ek R0 + Λ 2 WE R1 , (24) 1 ˜ where R0 is the rotation matrix that gives P = Λ 2 Ek R0 . [sent-1331, score-0.183]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lsd', 0.874), ('clustering', 0.129), ('rotation', 0.129), ('simplex', 0.121), ('lsdable', 0.111), ('nmf', 0.095), ('cluster', 0.095), ('fazel', 0.09), ('apila', 0.085), ('rora', 0.085), ('pt', 0.084), ('ru', 0.082), ('rotate', 0.08), ('similarity', 0.079), ('upta', 0.073), ('rk', 0.071), ('lustering', 0.061), ('nnmf', 0.058), ('arora', 0.057), ('matrix', 0.054), ('hyperplane', 0.054), ('conn', 0.053), ('columns', 0.048), ('perron', 0.048), ('perplexity', 0.045), ('factorization', 0.04), ('rs', 0.039), ('clusters', 0.038), ('subroutine', 0.038), ('perturbed', 0.038), ('ek', 0.037), ('diana', 0.037), ('misclassi', 0.035), ('hierarchical', 0.034), ('eigenvectors', 0.034), ('kernel', 0.033), ('givens', 0.032), ('msift', 0.032), ('ppt', 0.032), ('rhetoric', 0.032), ('weber', 0.032), ('rg', 0.032), ('patrol', 0.032), ('cold', 0.032), ('zz', 0.032), ('subgroup', 0.032), ('ding', 0.032), ('gi', 0.031), ('rt', 0.03), ('spectral', 0.027), ('perturbation', 0.026), ('facerec', 0.026), ('rotates', 0.026), ('tr', 0.026), ('hu', 0.024), ('onto', 0.024), ('rotations', 0.023), ('protein', 0.023), ('gupta', 0.023), ('cnew', 0.023), ('paatero', 0.023), ('soft', 0.022), ('rq', 0.022), ('auralson', 0.021), ('natively', 0.021), ('cm', 0.021), ('spec', 0.02), ('eleven', 0.02), ('minimize', 0.02), ('multiplicative', 0.02), ('rue', 0.019), ('svd', 0.019), ('voting', 0.019), ('clusterings', 0.019), ('stewart', 0.019), ('fusion', 0.019), ('frobenius', 0.018), ('similarities', 0.018), ('sonar', 0.018), ('runtimes', 0.018), ('clockwise', 0.018), ('eigenvalues', 0.018), ('orthogonal', 0.017), ('yeast', 0.017), ('rp', 0.017), ('cz', 0.016), ('column', 0.016), ('cazzanti', 0.016), ('isotropy', 0.016), ('kapila', 0.016), ('kpt', 0.016), ('stabilizer', 0.016), ('xrk', 0.016), ('ads', 0.016), ('inside', 0.015), ('qt', 0.015), ('kmeans', 0.015), ('gt', 0.015), ('lie', 0.015), ('circles', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive

2 0.091130331 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error

3 0.044195026 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

4 0.041869726 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

5 0.033301167 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

6 0.029026082 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.027767105 33 jmlr-2013-Dimension Independent Similarity Computation

8 0.027439605 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

9 0.024851479 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

10 0.023100562 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

11 0.022767106 114 jmlr-2013-The Rate of Convergence of AdaBoost

12 0.021028634 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

13 0.019943362 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

14 0.019230274 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

15 0.019214258 22 jmlr-2013-Classifying With Confidence From Incomplete Information

16 0.019117296 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

17 0.018992746 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

18 0.018620688 15 jmlr-2013-Bayesian Canonical Correlation Analysis

19 0.018315345 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

20 0.018128928 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.106), (1, 0.032), (2, -0.012), (3, -0.02), (4, 0.026), (5, 0.006), (6, -0.003), (7, -0.029), (8, 0.055), (9, -0.048), (10, 0.124), (11, -0.075), (12, 0.01), (13, 0.058), (14, 0.008), (15, 0.023), (16, 0.026), (17, -0.05), (18, -0.241), (19, 0.118), (20, -0.029), (21, 0.258), (22, -0.07), (23, -0.058), (24, -0.098), (25, -0.062), (26, 0.059), (27, -0.014), (28, 0.057), (29, 0.116), (30, 0.077), (31, -0.079), (32, 0.134), (33, -0.149), (34, -0.039), (35, 0.01), (36, -0.147), (37, 0.146), (38, -0.058), (39, -0.115), (40, -0.083), (41, -0.093), (42, -0.022), (43, -0.043), (44, -0.003), (45, 0.169), (46, -0.009), (47, 0.039), (48, 0.118), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93780845 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive

2 0.67610604 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error

3 0.44952688 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

4 0.37405559 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

5 0.32524616 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

Author: Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan, Yue Wang

Abstract: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions. Keywords: convex analysis of mixtures, blind source separation, affinity propagation clustering, compartment modeling, information-based model selection c 2013 Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan and Yue Wang. WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG 1. Overview Blind source separation (BSS) has proven to be a powerful and widely-applicable tool for the analysis and interpretation of composite patterns in engineering and science (Hillman and Moore, 2007; Lee and Seung, 1999). BSS is often described by a linear latent variable model X = AS, where X is the observation data matrix, A is the unknown mixing matrix, and S is the unknown source data matrix. The fundamental objective of BSS is to estimate both the unknown but informative mixing proportions and the source signals based only on the observed mixtures (Child, 2006; Cruces-Alvarez et al., 2004; Hyvarinen et al., 2001; Keshava and Mustard, 2002). While many existing BSS algorithms can usefully extract interesting patterns from mixture observations, they often prove inaccurate or even incorrect in the face of real-world BSS problems in which the pre-imposed assumptions may be invalid. There is a family of approaches exploiting the source non-negativity, including the non-negative matrix factorization (NMF) (Gillis, 2012; Lee and Seung, 1999). This motivates the development of alternative BSS techniques involving exploitation of source nonnegative nature (Chan et al., 2008; Chen et al., 2011a,b; Wang et al., 2010). The method works by performing convex analysis of mixtures (CAM) that automatically identifies pure-source signals that reside at the vertices of the multifaceted simplex most tightly enclosing the data scatter, enabling geometrically-principled delineation of distinct source patterns from mixtures, with the number of underlying sources being suggested by the minimum description length criterion. Consider a latent variable model x(i) = As(i), where the observation vector x(i) = [x1 (i), ..., xM (i)]T can be expressed as a non-negative linear combination of the source vectors s(i) = [s1 (i), ..., sJ (i)]T , and A = [a1 , ..., aJ ] is the mixing matrix with a j being the jth column vector. This falls neatly within the definition of a convex set (Fig. 1) (Chen et al., 2011a): X= J J ∑ j=1 s j (i)a j |a j ∈ A, s j (i) ≥ 0, ∑ j=1 s j (i) = 1, i = 1, ..., N . Assume that the sources have at least one sample point whose signal is exclusively enriched in a particular source (Wang et al., 2010), we have shown that the vertex points of the observation simplex (Fig. 1) correspond to the column vectors of the mixing matrix (Chen et al., 2011b). Via a minimum-error-margin volume maximization, CAM identifies the optimum set of the vertices (Chen et al., 2011b; Wang et al., 2010). Using the samples attached to the vertices, compartment modeling (CM) (Chen et al., 2011a) obtains a parametric solution of A, nonnegative independent component analysis (nICA) (Oja and Plumbley, 2004) estimates A (and s) that maximizes the independency in s, and nonnegative well-grounded component analysis (nWCA) (Wang et al., 2010) finds the column vectors of A directly from the vertex cluster centers. Figure 1: Schematic and illustrative flowchart of R-Java CAM package. 2900 T HE CAM S OFTWARE IN R-JAVA In this paper we describe a newly developed R-Java CAM package whose analytic functions are written in R, while a graphic user interface (GUI) is implemented in Java, taking full advantages of both programming languages. The core software suite implements CAM functions and includes normalization, clustering, and data visualization. Multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, which not only provides convenient data or parameter passing and visual progress monitoring but also assures the responsive execution of the entire CAM software. 2. Software Design and Implementation The CAM package mainly consists of R and Java modules. The R module is a collection of main and helper functions, each represented by an R function object and achieving an independent and specific task (Fig. 1). The R module mainly performs various analytic tasks required by CAM: figure plotting, update, or error message generation. The Java module is developed to provide a GUI (Fig. 2). We adopt the model-view-controller (MVC) design strategy, and use different Java classes to separately perform information visualization and human-computer interaction. The Java module also serves as the software driver and integrator that use a multi-thread strategy to facilitate the interactions between the R and Java modules, such as importing raw data, passing algorithmic parameters, calling R scripts, and transporting results and messages. Figure 2: Interactive Java GUI supported by a multi-thread design strategy. 2.1 Analytic and Presentation Tasks Implemented in R The R module performs the CAM algorithm and facilitates a suite of subsequent analyses including CM, nICA, and nWCA. These tasks are performed by the three main functions: CAM-CM.R, CAM-nICA.R, and CAM-nWCA.R, which can be activated by the three R scripts: Java-runCAM-CM.R, Java-runCAM-ICA.R, and Java-runCAM-nWCA.R. The R module also performs auxiliary tasks including automatic R library installation, figure drawing, and result recording; and offers other standard methods such as nonnegative matrix factorization (Lee and Seung, 1999), Fast ICA (Hyvarinen et al., 2001), factor analysis (Child, 2006), principal component analysis, affinity propagation, k-means clustering, and expectation-maximization algorithm for learning standard finite normal mixture model. 2.2 Graphic User Interface Written in Java Swing The Java GUI module allows users to import data, select algorithms and parameters, and display results. The module encloses two packages: guiView contains classes for handling frames and 2901 WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG Figure 3: Application of R-Java CAM to deconvolving dynamic medical image sequence. dialogs for managing user inputs; guiModel contains classes for representing result data sets and for interacting with the R script caller. Packaged as one jar file, the GUI module runs automatically. 2.3 Functional Interaction Between R and Java We adopt the open-source program RCaller (http://code.google.com/p/rcaller) to implement the interaction between R and Java modules (Fig. 2), supported by explicitly designed R scripts such as Java-runCAM-CM.R. Specifically, five featured Java classes are introduced to interact with R for importing data or parameters, running algorithms, passing on or recording results, displaying figures, and handing over error messages. The examples of these classes include guiModel.MyRCaller.java, guiModel.MyRCaller.readResults(), and guiView.MyRPlotViewer. 3. Case Studies and Experimental Results The CAM package has been successfully applied to various data types. Using dynamic contrastenhanced magnetic resonance imaging data set of an advanced breast cancer case (Chen, et al., 2011b),“double click” (or command lines under Ubuntu) activated execution of CAM-Java.jar reveals two biologically interpretable vascular compartments with distinct kinetic patterns: fast clearance in the peripheral “rim” and slow clearance in the inner “core”. These outcomes are consistent with previously reported intratumor heterogeneity (Fig. 3). Angiogenesis is essential to tumor development beyond 1-2mm3 . It has been widely observed that active angiogenesis is often observed in advanced breast tumors occurring in the peripheral “rim” with co-occurrence of inner-core hypoxia. This pattern is largely due to the defective endothelial barrier function and outgrowth blood supply. In another application to natural image mixtures, CAM algorithm successfully recovered the source images in a large number of trials (see Users Manual). 4. Summary and Acknowledgements We have developed a R-Java CAM package for blindly separating mixed nonnegative sources. The open-source cross-platform software is easy-to-use and effective, validated in several real-world applications leading to plausible scientific discoveries. The software is freely downloadable from http://mloss.org/software/view/437/. We intend to maintain and support this package in the future. This work was supported in part by the US National Institutes of Health under Grants CA109872, CA 100970, and NS29525. We thank T.H. Chan, F.Y. Wang, Y. Zhu, and D.J. Miller for technical discussions. 2902 T HE CAM S OFTWARE IN R-JAVA References T.H. Chan, W.K. Ma, C.Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing, 56:5120–5143, 2008. L. Chen, T.H. Chan, P.L. Choyke, and E.M. Hillman et al. Cam-cm: a signal deconvolution tool for in vivo dynamic contrast-enhanced imaging of complex tissues. Bioinformatics, 27:2607–2609, 2011a. L. Chen, P.L. Choyke, T.H. Chan, and C.Y. Chi et al. Tissue-specific compartmental analysis for dynamic contrast-enhanced mr imaging of complex tumors. IEEE Transactions on Medical Imaging, 30:2044–2058, 2011b. D. Child. The essentials of factor analysis. Continuum International, 2006. S.A. Cruces-Alvarez, Andrzej Cichocki, and Shun ichi Amari. From blind signal extraction to blind instantaneous signal separation: criteria, algorithms, and stability. IEEE Transactions on Neural Networks, 15:859–873, 2004. N. Gillis. Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of Machine Learning Research, 13:3349–3386, 2012. E.M.C. Hillman and A. Moore. All-optical anatomical co-registration for molecular imaging of small animals using dynamic contrast. Nature Photonics, 1:526–530, 2007. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001. N. Keshava and J.F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, 2002. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. E. Oja and M. Plumbley. Blind separation of positive sources by globally convergent gradient search. Neural Computation, 16:1811–1825, 2004. F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:857–888, 2010. 2903

6 0.31791615 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

7 0.30215231 33 jmlr-2013-Dimension Independent Similarity Computation

8 0.23986867 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

9 0.18444766 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

10 0.17359205 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

11 0.17142226 15 jmlr-2013-Bayesian Canonical Correlation Analysis

12 0.17004238 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

13 0.16922221 59 jmlr-2013-Large-scale SVD and Manifold Learning

14 0.16394773 90 jmlr-2013-Quasi-Newton Method: A New Direction

15 0.16212556 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

16 0.15761721 73 jmlr-2013-Multicategory Large-Margin Unified Machines

17 0.15242486 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

18 0.14307776 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

19 0.14224948 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

20 0.14131895 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (5, 0.113), (6, 0.034), (10, 0.111), (14, 0.01), (20, 0.018), (23, 0.022), (44, 0.011), (68, 0.018), (70, 0.022), (71, 0.395), (75, 0.052), (85, 0.024), (87, 0.027), (89, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76557887 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

Author: Robert Hable

Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN

same-paper 2 0.7139535 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive

3 0.70199198 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

Author: Rami Mahdi, Jason Mezey

Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min

4 0.39606076 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing specific rates of convergence under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

5 0.38721079 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modification to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modified SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

6 0.38328749 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.38267705 59 jmlr-2013-Large-scale SVD and Manifold Learning

8 0.381982 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

9 0.37966636 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

10 0.37891579 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.37819028 86 jmlr-2013-Parallel Vector Field Embedding

12 0.37770864 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

13 0.37721774 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

14 0.37607953 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.37535173 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

16 0.3747282 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

17 0.37441319 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

18 0.37294286 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

19 0.37272173 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.37200075 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation