jmlr jmlr2013 jmlr2013-23 knowledge-graph by maker-knowledge-mining

23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Source: pdf

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many ﬁelds. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classiﬁcation and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classiﬁcation and regression, such as model selection criteria to select the number of clusters, a difﬁcult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classiﬁcation and regression. [sent-6, score-0.13]

2 Here we formulate clustering as penalized regression with grouping pursuit. [sent-7, score-0.22]

3 Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP) 1. [sent-11, score-0.135]

4 In the absence of a class label, clustering analysis is also called unsupervised learning, as opposed to supervised learning that includes classiﬁcation and regression. [sent-14, score-0.13]

5 Accordingly, approaches to clustering analysis are typically quite different from supervised learning. [sent-15, score-0.13]

6 In this paper we adopt a novel framework for clustering analysis by viewing it as a regression problem (Pelckmans et al. [sent-16, score-0.149]

7 We call our proposed method as penalized regression-based clustering (PRclust). [sent-26, score-0.17]

8 An advantage of regarding clustering as a regression problem is its uniﬁcation with regression, which in turn provides the opportunity to apply or modify many established results and techniques, such as model selection criteria, in regression to clustering. [sent-27, score-0.185]

9 In particular, a notoriously difﬁcult model selection problem in clustering analysis is to determine the number of clusters; already numerous methods exist with new ones constantly emerging (Tibshirani et al. [sent-28, score-0.147]

10 In clustering analysis, due to the data-adaptive nature of model searches in ﬁnding clusters, it is unclear what is, or how to estimate df. [sent-33, score-0.13]

11 Again by formulating clustering as regression, we can adapt the use of GDF to our current context. [sent-36, score-0.13]

12 In spite of many advantages of formulating clustering analysis as a penalized regression problem, there are some challenges in its implementation. [sent-38, score-0.189]

13 In particular, with a desired non-smooth and non-convex penalty function, many existing algorithms for penalized regression are not suitable. [sent-39, score-0.141]

14 Hence, complementary to the K-means, PRclust is a potentially useful clustering tool. [sent-44, score-0.13]

15 Albeit not the focus here, we also propose GDF-based GCV as a general model selection criterion to determine the number of clusters in the above clustering approaches; a comparison with several existing methods demonstrates the promising performance of GCV. [sent-47, score-0.34]

16 , xip )′ , we would like to conduct a cluster analysis; that is, we would like to identify group-memberships of the observations such that the withingroup similarity and between-group dissimilarity are as strong as possible. [sent-58, score-0.119]

17 There are different ways of deﬁning a similarity/dissimilarity between two observations, leading to various clustering approaches. [sent-59, score-0.13]

18 The main idea is that, for the purpose of clustering, we would like to strike a balance between minimizing the distance between the observations and their centroids and reducing the number of centroids via grouping some close centroids together. [sent-74, score-0.332]

19 As pointed out by a reviewer, the general idea with a convex Lq -norm as the fusion penalty has appeared in the literature (Pelckmans et al. [sent-75, score-0.146]

20 , Lasso) penalty for a small α ≤ τ, but imposes no further penalty for a large α > τ. [sent-84, score-0.164]

21 Again, to alleviate the bias of the usual convex L2 -norm (or more generally, Lq -norm for q > 1) group penalty, we propose a novel and non-convex group penalty based on TLP, called group TLP or simply gTLP, deﬁned as gTLP(µi − µ j ; τ) = TLP(||µi − µ j ||2 ; τ). [sent-88, score-0.167]

22 Since the objective function (2) is a sum of a differentiable and ˆ convex function and a convex penalty in µ and θ (while θ(m) is a known vector), the coordinate-wise descent algorithm will converge to its minimizer (Tseng, 2001). [sent-117, score-0.144]

23 We form ˆ i j ’s: for any two observations xi and x j , if θi j = 0, they are declared to be in ˆ clusters based on θ the same cluster. [sent-124, score-0.232]

24 We construct a graph G based on an adjacency matrix A = (ai j ) with elements ˆ ai j = I(θi j = 0); ﬁnding clusters is equivalent to ﬁnding connected subcomponents of G. [sent-125, score-0.226]

25 In contrast, by specifying K, the number of clusters as the only tuning parameter, the K-means starts with some K initial centroids, then assigns each observation to a cluster before updating the centroid estimates. [sent-144, score-0.367]

26 Hence, PRclust differs from the K-means in that PRclust does not explicitly assign an observation to any cluster; clustering is implicitly done after the convergence of the algorithm. [sent-145, score-0.13]

27 For any threshold d, we can deﬁne an adjacency matrix A = (ai j ) with ai j = I(di j < d); as in PRclust, we can deﬁne any connected subcomponent based on A as a cluster, resulting in a clustering method called HTclust, which is named as a“connected components” algorithm in Ng et al. [sent-148, score-0.163]

28 HTclust is related to agglomerative hierarchical clustering: the latter also starts with each observation being its own cluster, then sequentially merges two nearest clusters until only one cluster left. [sent-151, score-0.303]

29 4 Selecting Tuning Parameters or Number of Clusters A bonus with the regression approach to clustering is the potential application of many existing model selection methods for regression or supervised learning to clustering. [sent-156, score-0.185]

30 In our notation, GCV(df) = p ˆ RSS ∑n ∑ (xik − µik )2 = i=1 k=1 , 2 2 (np − df) (np − df) where df is the degrees of freedom used in estimating µi ’s. [sent-161, score-0.228]

31 For our problem, a naive treatment is to take df = K p, the number of unknown parameters in µi ’s, which however does not take into account the data-adaptive nature in estimating µi ’s in clustering analysis. [sent-162, score-0.332]

32 In Step 3, we apply the same clustering algorithm (e. [sent-192, score-0.13]

33 The above method can be equally applied to the K-means method to select the number of clusters: we just need to apply the K-means with a ﬁxed number of clusters, say K, in Step 3, then use the cluster centroid of observation xi as its estimated mean µi ; other steps remain the same. [sent-196, score-0.18]

34 As a comparison, we also apply the Jump statistic to select the number of clusters for the Kmeans (Sugar and James, 2003). [sent-198, score-0.217]

35 Wang (2010) proposed a consistent estimator for the number of clusters based on clustering stability. [sent-201, score-0.323]

36 It is based on an intuitive idea: with the correct number of clusters, the clustering results should be most stable. [sent-202, score-0.13]

37 Case I: two convex clusters in two dimensions (Figure 1a). [sent-211, score-0.224]

38 We consider two somewhat overlapping clusters with the same spherical shape, which is ideal for the K-means. [sent-212, score-0.208]

39 Case II: two non-convex clusters in two dimensions (Figure 1b). [sent-216, score-0.193]

40 There were 2 clusters as two nested circles (distorted with some added noises), each with 100 observations (see the upper-left 2 panel in Figure 3). [sent-218, score-0.215]

41 Case VI: three clusters in 2 dimension with two spherically shaped clusters inside 3/4 of a perturbed circle (Figure 1c). [sent-271, score-0.444]

42 For comparison, we also applied Gaussian mixturemodel based clustering as implemented in R package mclust (Fraley and Raftery, 2006); for each data set, we ﬁtted each of the 10 models corresponding to 10 different ways of parameterizing the mixture model, for K = 1, 2, . [sent-291, score-0.251]

43 1873 PAN , S HEN AND L IU Due to the conceptual similarity between our proposed PRclust and spectral clustering (Sclust), we also included the spectral clustering algorithm of Ng et al. [sent-295, score-0.324]

44 Fourth, for j=1 a speciﬁed number of clusters k, we stack the k top eigen-vectors (corresponding to the k largest eigen-values) of L column-wise to form an n × k matrix, say Zk ; normalize each row of Zk to have a unit L2 -norm. [sent-303, score-0.193]

45 To evaluate the performance of a clustering algorithm, we used the Rand index (Rand, 1971), adjusted Rand index (Hubert and Arabie, 1985) and Jaccard index (Jaccard, 1912), all measuring the agreement between estimated cluster memberships and the truth. [sent-313, score-0.279]

46 2 Simulation Results Case I: For the K-means, we chose the number of clusters using Jump, CV1, CV2 and GCV statistics; for comparison, we also ﬁxed the number of clusters around its true value. [sent-316, score-0.386]

47 Figure 2 shows how GDF and NDF changed with K, the number of clusters in the K-means algorithm, for the ﬁrst simulated data set. [sent-320, score-0.216]

48 Since the two clusters were formed by observations drawn from two Normal distributions, as expected, the model-based clustering Mclust performed best. [sent-325, score-0.36]

49 PRclust with GCV(GDF) selecting its tuning parameters performed well too: the average number of clusters is close to the truth K0 = 2; the corresponding clustering results had high degrees of agreement with the truth, as evidenced by the high indices. [sent-337, score-0.421]

50 Table 1 also displays the frequencies of the number of clusters selected by GCV(GDF): for the overwhelming majority (98%), either the correct number of cluster K0 = 2 was selected, or a slightly larger K = 3 or 4 with very high agreement indices was chosen. [sent-338, score-0.325]

51 (2002), HTclust is not robust to outliers: since an “outlier” (lower left corner in Figure 1a) was farthest away from any other observations, it formed its own cluster while all others formed another cluster when the threshold d was chosen to yield two clusters. [sent-428, score-0.194]

52 Case II: Since each cluster was not spherically shaped, and more importantly, the two true cluster centroids completely overlapped with each other, the K-means would not work: it could not distinguish the two clusters. [sent-430, score-0.316]

53 to the cluster and its assigning a cluster membership of an observation based on its distance to the centroids; since the two clusters share the same centroid in truth, the K-means cannot distinguish the two clusters. [sent-434, score-0.429]

54 Note that, the cluster memberships in PRclust are determined by the estimates of θi j = µi − µ j ; due to the use of ˆ the ridge penalty with a ﬁxed λ1 = 1, we might have θi j = 0 but µi = µ j . [sent-437, score-0.209]

55 ˆ ˆ Since HTclust assigned the cluster-memberships according to the pair-wise distances among the observations, not the nearest distance of an observation to the centroids as done in the K-means, it also performed well. [sent-438, score-0.122]

56 Interestingly, an exception happened in four (out of 100) simulations: when Sclust ˆ could not correctly distinguish the two true clusters (with low agreement statistics) with K = 2, it ˆ = 1. [sent-441, score-0.228]

57 The results here suggest that, although Sclust had a smaller GCV(GDF) statistic than that for K may perform well for non-convex clusters with an appropriately chosen γ (as selected by the method 1876 C LUSTER A NALYSIS WITH A N ON - CONVEX P ENALTY b) PRclust2−gTLP, large λ1 1. [sent-442, score-0.193]

58 Case IV seems to be challenging with partially overlapping spherically shaped clusters of smaller cluster sizes: the number of clusters could be under- or over-selected by various methods. [sent-571, score-0.556]

59 In Case V, all performed perfectly except that the GCV(GDF) over-selected the number of the clusters in the K-means and the two spectral clustering methods. [sent-574, score-0.37]

60 Note that the K-means implicitly assumes that all clusters share the same volume and spherical shape, and GCV also implicitly favors such clusters (with smaller within-cluster sum of squares, and thus a smaller GCV statistic). [sent-577, score-0.386]

61 On the other hand, the K-means with CV1 or CV2 and the two spectral clustering methods seemed to under-select the number of clusters, leading to lower agreement statistics. [sent-674, score-0.211]

62 For the K-means, we tried the number of clusters K = 1, 2, . [sent-802, score-0.193]

63 ˆ For the K-means, in agreement with simulations, the Jump selected perhaps a too large K = 27, ˆ while GCV(GDF) selected K = 9 perhaps due to the non-spherical shapes of the true clusters (Table ˆ 5). [sent-819, score-0.26]

64 Both the K-means with CV1 (or CV2) and Mclust selected K = 2 and yielded the same clustering ˆ results. [sent-820, score-0.13]

65 In comparison, PRclust with 1880 C LUSTER A NALYSIS WITH A N ON - CONVEX P ENALTY ˆ GCV(GDF) yielded K = 3 clusters with higher agreement indices than those of the K-means, Mclust ˆ and Sclust. [sent-822, score-0.228]

66 HTclust selected K = 4 clusters with the agreement indices less than but close to those ˆ of PRclust. [sent-823, score-0.228]

67 We also applied the K-means and Sclust with a ﬁxed K = 2 or 3, and took the subset of the tuning parameter values yielding 2 or 3 clusters for HTclust and PRclust. [sent-824, score-0.228]

68 We applied PRclust2 to the earlier examples and obtained ˆ the following results: when all the clusters were convex, PRclust2 yielded results very similar to those of PRclust; otherwise, their results were different. [sent-837, score-0.193]

69 PRclust2 forms clusters based on the (approximate) equality of µi ’s, while ˆ PRclust clusters two observations i and j together if their µi and µ j are close to each other, say, ˆ ˆ ||ˆ i − µ j ||2 < d0,i j , where the threshold d0,i j is possibly (i, j)-speciﬁc. [sent-842, score-0.408]

70 Hence, PRclust2 seems to be µ ˆ more rigid and greedy in forming clusters than PRclust. [sent-843, score-0.193]

71 More ˆ i j ’s obtained generally, one can apply the spectral clustering of Ng et al. [sent-848, score-0.162]

72 (2005) proposed using a fusion penalty based on the Lq -norm with the objective function 1 n ∑ ||xi − µi ||2 + λ ∑ ||µi − µ j ||q , 2 2 i=1 i< j and proposed an efﬁcient quadratic convex programming-based computing method for q = 1. [sent-891, score-0.146]

73 (2011) recognized the importance of using a group penalty with q > 1, and applied the Matlab CVX package (Grant and Boyd, 2011) to solve the general convex programming problem for the group Lasso penalty with q = 2 (Yuan and Lin, 2006). [sent-893, score-0.231]

74 More importantly, overall the solution paths of all three PRclust-Lq were similar to each other, sharing the common feature that the estimated centroids were more and more biased towards the overall mean as the penalty parameter λ increased. [sent-901, score-0.175]

75 (2011) treated PRclust-Lq as a hierarchical clustering tool; none of the authors discussed the choice of the number of clusters. [sent-933, score-0.143]

76 The issue of an Lq -norm penalty in yielding possibly severely biased estimates is well known in penalized regression, which partially motivated the development of non-convex penalties such as TLP (Shen et al. [sent-934, score-0.141]

77 (2011) has recognized the issue of the biased centroid estimates in PRclust-Lq and thus proposed a second stage to re-estimate the centroids after a clustering result is obtained. [sent-937, score-0.265]

78 When we applied the GCV(GDF) to select the number of clusters for PRclust-Lq in simulation Case I, as expected, it performed poorly. [sent-939, score-0.26]

79 For any d0 ≥ 0, similar to hierarchical clustering, we deﬁned an adjacency matrix A = (ai j ) with ai j = I(||ˆ i − µ j ||2 ≤ d0 ); any two observations xi and x j were assigned to the µ ˆ same cluster if ai j = 1. [sent-941, score-0.198]

80 Similarly, Mclust does not perform well for non-convex clusters (Table 2), but may have advantages with overlapping and ellipsoidal clusters as for simulation Case I (Table 1) and the iris data (Table 5). [sent-961, score-0.476]

81 (2002) in selecting the scale parameter γ, but the clustering result also critically depended on the speciﬁed k, the number of clusters, for which the GCV(GDF) might not perform well. [sent-966, score-0.13]

82 More generally, model selection is related to kernel learning in spectral clustering (Bach and Jordan, 2006). [sent-968, score-0.179]

83 It is currently an open problem whether the strengths of PRclust and spectral clustering can be combined. [sent-969, score-0.162]

84 K-median clustering is closely related to partitioning-around-centroids (PAM) of Kaufman and Rousseeuw (1990), and is more robust to outliers than is the K-means. [sent-975, score-0.13]

85 Alternatively, we can also relax the equal cluster volume assumption and use: 1 L(xi − µi ) = (xi − µi )′ (xi − µi )/σ2 , i 2 where observation-speciﬁc variances σ2 ’s have to be estimated through grouping pursuit, as for i observation-speciﬁc means/centroids µi ’s (Xie et al. [sent-979, score-0.128]

86 Among others, it might provide a computationally more efﬁcient algorithm than the EM algorithm commonly adopted in mixture model-based clustering (Dempster et al. [sent-982, score-0.13]

87 This is a special and simple approach to more general graphbased clustering (Xu and Wunsch, 2005); other more sophisticated approaches may be borrowed or adapted. [sent-986, score-0.13]

88 We implemented a speciﬁc combination of PRclust and spectral clustering along with GCV(GDF) for model selection: we ﬁrst applied PRclust, then used its output as the input to spectral clustering, but it did not show improvement over PRclust. [sent-987, score-0.194]

89 Other options exist; for example, as suggested by a reviewer, it might be more fruitful to replace the K-means in spectral clustering with PRclust. [sent-988, score-0.162]

90 In principle, we may add a penalty into our objective function for variable selection (Pan and Shen, 2007), which again requires a fast method to select more tuning parameters and is worth future investigation. [sent-996, score-0.158]

91 Perhaps the most interesting idea of our proposal is the view of regarding clustering analysis as a penalized regression problem, blurring the typical line drawn to distinguish clustering (or unsupervised learning) with regression and classiﬁcation (i. [sent-997, score-0.338]

92 We prove the equivalence between HTclust and the single-linkage hierarchical clustering (SLHclust). [sent-1024, score-0.143]

93 Both the HTclust and SL-Hclust form clusters sequentially and in a ﬁnite number of steps; we show that, in each step, the SL-Hclust gives the same clusters as those of HTclust if an appropriate threshold is chosen in the latter. [sent-1025, score-0.386]

94 In the next step, the SL-Hclust combines observations according to whether their distances satisfy di j ≤ d(1) to form clusters; in HTclust, if we use a threshold d0 = d(1) + ε with a tiny ε > 0, then it results in the same clusters as those of the SL-Hclust. [sent-1031, score-0.229]

95 If the clustering results of the two methods are the same in step k − 1 > 0 and if we use d0 = d(k) + ε in HTclust, then it leads to the same clusters as the SL-Hclust in step k. [sent-1032, score-0.323]

96 Finding the number of clusters in a data set: An information theoretic approach. [sent-1227, score-0.193]

97 Evaluation and comparison of gene clustering methods in microarray analysis. [sent-1235, score-0.13]

98 Estimating the number of clusters in a data set via the gap statistic. [sent-1253, score-0.193]

99 Consistent selection of the number of clusters via crossvalidation. [sent-1264, score-0.21]

100 Penalized model-based clustering with cluster-speciﬁc diagonal covariance matrices and grouped variables. [sent-1276, score-0.13]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gdf', 0.517), ('prclust', 0.464), ('gcv', 0.435), ('htclust', 0.255), ('df', 0.202), ('clusters', 0.193), ('clustering', 0.13), ('mclust', 0.121), ('cluster', 0.097), ('sclust', 0.094), ('centroids', 0.093), ('tlp', 0.092), ('pan', 0.086), ('fixed', 0.084), ('penalty', 0.082), ('luster', 0.081), ('enalty', 0.069), ('freq', 0.062), ('rand', 0.059), ('jaccard', 0.057), ('arand', 0.054), ('hen', 0.054), ('nalysis', 0.048), ('gtlp', 0.047), ('nnn', 0.047), ('jump', 0.046), ('iu', 0.044), ('lasso', 0.043), ('centroid', 0.042), ('penalized', 0.04), ('jasa', 0.04), ('lq', 0.036), ('tuning', 0.035), ('shen', 0.035), ('agreement', 0.035), ('bic', 0.034), ('iris', 0.034), ('hocking', 0.034), ('lindsten', 0.034), ('ndf', 0.034), ('pelckmans', 0.034), ('fusion', 0.033), ('spectral', 0.032), ('grouping', 0.031), ('convex', 0.031), ('ik', 0.029), ('xik', 0.029), ('shaped', 0.029), ('spherically', 0.029), ('ng', 0.028), ('simulation', 0.028), ('tibshirani', 0.026), ('select', 0.024), ('simulated', 0.023), ('observations', 0.022), ('reviewer', 0.021), ('sugar', 0.02), ('regression', 0.019), ('penalties', 0.019), ('ye', 0.018), ('np', 0.018), ('group', 0.018), ('memberships', 0.017), ('minneapolis', 0.017), ('minnesota', 0.017), ('tep', 0.017), ('xi', 0.017), ('adjacency', 0.017), ('selection', 0.017), ('perhaps', 0.016), ('ai', 0.016), ('rss', 0.016), ('performed', 0.015), ('overlapping', 0.015), ('cv', 0.015), ('seemed', 0.014), ('cvx', 0.014), ('shrinkage', 0.014), ('distances', 0.014), ('ridge', 0.013), ('wu', 0.013), ('appl', 0.013), ('ban', 0.013), ('binghui', 0.013), ('eik', 0.013), ('elongated', 0.013), ('fraley', 0.013), ('gdfs', 0.013), ('hik', 0.013), ('mclachlan', 0.013), ('sepal', 0.013), ('subtype', 0.013), ('thalamuthu', 0.013), ('wunsch', 0.013), ('ellipsoidal', 0.013), ('lange', 0.013), ('nocedal', 0.013), ('degrees', 0.013), ('freedom', 0.013), ('hierarchical', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

2 0.12715061 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

Author: Wei Sun, Junhui Wang, Yixin Fang

Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model ﬁtting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model ﬁtting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both ﬁxed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefﬁcient, penalized regression, selection consistency, stability, tuning

3 0.069955721 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-deﬁnite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artiﬁcial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to inﬁnity. Experiments on several artiﬁcial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-deﬁnite programming, ﬁnite sample stability, clustering error

4 0.061953545 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

Abstract: We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, ﬁnancial engineering and optimization, but there is currently no multivariate method that is stable and computationally feasible for more than a few thousand observations. We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates ﬁt on adaptively selected covariate partitions. CAP is a computationally efﬁcient, consistent method for convex regression. We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model

5 0.047724407 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

6 0.0441061 86 jmlr-2013-Parallel Vector Field Embedding

7 0.041869726 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

8 0.041734129 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

9 0.031737801 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

10 0.028112732 22 jmlr-2013-Classifying With Confidence From Incomplete Information

11 0.026245194 106 jmlr-2013-Stationary-Sparse Causality Network Learning

12 0.025699686 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

13 0.023806721 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

14 0.023164377 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

15 0.023029281 73 jmlr-2013-Multicategory Large-Margin Unified Machines

16 0.022414865 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

17 0.022014963 76 jmlr-2013-Nonparametric Sparsity and Regularization

18 0.02188007 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.019791588 59 jmlr-2013-Large-scale SVD and Manifold Learning

20 0.018987719 90 jmlr-2013-Quasi-Newton Method: A New Direction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.108), (1, 0.024), (2, 0.013), (3, -0.017), (4, 0.098), (5, -0.015), (6, -0.019), (7, -0.023), (8, 0.179), (9, -0.066), (10, 0.0), (11, -0.164), (12, -0.092), (13, 0.084), (14, -0.043), (15, -0.115), (16, 0.087), (17, 0.253), (18, -0.149), (19, 0.135), (20, -0.06), (21, 0.244), (22, 0.021), (23, -0.326), (24, 0.019), (25, 0.033), (26, 0.059), (27, 0.044), (28, -0.041), (29, 0.071), (30, 0.073), (31, 0.183), (32, 0.001), (33, 0.042), (34, -0.042), (35, 0.047), (36, 0.054), (37, 0.04), (38, -0.003), (39, -0.03), (40, 0.023), (41, -0.048), (42, -0.012), (43, 0.058), (44, -0.093), (45, -0.008), (46, -0.006), (47, -0.07), (48, 0.096), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9572292 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

2 0.64062893 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

Author: Wei Sun, Junhui Wang, Yixin Fang

3 0.41808915 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

4 0.40907517 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama

5 0.39894879 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efﬁcient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efﬁcient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indeﬁnite kernel, similarity, completely positive

6 0.24165398 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.20431094 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

8 0.17871085 22 jmlr-2013-Classifying With Confidence From Incomplete Information

9 0.17843179 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

10 0.17058262 106 jmlr-2013-Stationary-Sparse Causality Network Learning

11 0.15545651 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

12 0.15523124 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

13 0.14924102 86 jmlr-2013-Parallel Vector Field Embedding

14 0.14159876 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

15 0.13379386 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

16 0.13084449 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

17 0.13068128 73 jmlr-2013-Multicategory Large-Margin Unified Machines

18 0.12144239 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

19 0.120918 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

20 0.10920658 33 jmlr-2013-Dimension Independent Similarity Computation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (5, 0.075), (6, 0.035), (10, 0.074), (20, 0.017), (23, 0.021), (41, 0.013), (68, 0.039), (70, 0.014), (75, 0.539), (85, 0.012), (87, 0.016), (89, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98203331 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

Author: Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, Aki Vehtari

Abstract: The GPstuff toolbox is a versatile collection of Gaussian process models and computational tools required for Bayesian inference. The tools include, among others, various inference methods, sparse approximations and model assessment methods. Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

2 0.95259547 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

Author: Lisha Chen, Andreas Buja

Abstract: Multidimensional scaling (MDS) is the art of reconstructing pointsets (embeddings) from pairwise distance data, and as such it is at the basis of several approaches to nonlinear dimension reduction and manifold learning. At present, MDS lacks a unifying methodology as it consists of a discrete collection of proposals that differ in their optimization criteria, called “stress functions”. To correct this situation we propose (1) to embed many of the extant stress functions in a parametric family of stress functions, and (2) to replace the ad hoc choice among discrete proposals with a principled parameter selection method. This methodology yields the following beneﬁts and problem solutions: (a) It provides guidance in tailoring stress functions to a given data situation, responding to the fact that no single stress function dominates all others across all data situations; (b) the methodology enriches the supply of available stress functions; (c) it helps our understanding of stress functions by replacing the comparison of discrete proposals with a characterization of the effect of parameters on embeddings; (d) it builds a bridge to graph drawing, which is the related but not identical art of constructing embeddings from graphs. Keywords: multidimensional scaling, force-directed layout, cluster analysis, clustering strength, unsupervised learning, Box-Cox transformations

3 0.88537169 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difﬁcult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difﬁculties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-ﬁeld iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artiﬁcial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

same-paper 4 0.86070895 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

5 0.79792142 21 jmlr-2013-Classifier Selection using the Predicate Depth

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and ﬁnding a hypothesis that minimizes it. This is equivalent to ﬁnding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by deﬁning a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efﬁcient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we deﬁne: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classiﬁcation, estimation, median, Tukey depth

6 0.63201666 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

7 0.57152802 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

8 0.53349507 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

9 0.51656598 120 jmlr-2013-Variational Algorithms for Marginal MAP

10 0.4876304 86 jmlr-2013-Parallel Vector Field Embedding

11 0.47604856 108 jmlr-2013-Stochastic Variational Inference

12 0.47555679 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

13 0.45921317 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

14 0.45121768 32 jmlr-2013-Differential Privacy for Functions and Functional Data

15 0.44530424 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

16 0.44164157 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

17 0.4408716 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

18 0.43981922 59 jmlr-2013-Large-scale SVD and Manifold Learning

19 0.4370091 22 jmlr-2013-Classifying With Confidence From Incomplete Information

20 0.42981181 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning