jmlr jmlr2013 jmlr2013-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama
Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error
Reference: text
sentIndex sentText sentNum sentScore
1 A Matlab implementation of maximum volume clustering is available from http://sugiyama-www. [sent-20, score-0.413]
2 N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. [sent-27, score-0.397]
3 Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. [sent-28, score-0.839]
4 Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. [sent-29, score-0.391]
5 Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error bound 1. [sent-30, score-0.476]
6 Over the past decades, a large number of clustering algorithms have been developed. [sent-32, score-0.361]
7 For instance, k-means clustering (MacQueen, 1967; Hartigan and Wong, 1979; Girolami, 2002), spectral clustering (Shi and Malik, 2000; Meila and Shi, 2001; Ng et al. [sent-33, score-0.784]
8 , 2005; Xu and Schuurmans, 2005), dependence-maximization clustering (Song et al. [sent-35, score-0.361]
9 , 2007; Faivishevsky and Goldberger, 2010) and information-maximization clustering (Agakov and Barber, 2006; Gomes et al. [sent-36, score-0.361]
10 To the best of our knowledge, MMC, which partitions the data samples into two clusters based on the large margin principle (LMP) (Vapnik, 1982), is the first clustering approach that is directly connected to the statistical learning theory (Vapnik, 1998). [sent-40, score-0.448]
11 In this paper, we introduce a novel discriminative clustering approach called maximum volume clustering (MVC), which serves as a prototype to partition the data samples into two clusters based on LVP. [sent-60, score-0.81]
12 Hence, MVC might be regarded as a natural extension of many existing clustering methods. [sent-83, score-0.361]
13 It suggests that under mild conditions, different locally optimal solutions to soft-label MVC would induce the same data partition, and thus the non-convex optimization of soft-label MVC seems like a convex one; • A clustering error bound for the soft-label MVC method. [sent-85, score-0.424]
14 Then, we propose the model and algorithms of MVC in Section 3, and show that they are closely related to several existing clustering methods in Section 4. [sent-92, score-0.361]
15 In Section 6, we derive the clustering error bound. [sent-94, score-0.397]
16 Due to our clustering model that will be defined as optimization (2) in page 2646, [h∗ ]i = 0 hardly happens in practice, and we simply assume [h∗ ]i = 0 in our problem setting. [sent-119, score-0.388]
17 Maximum Volume Clustering In this section, we define our clustering model and propose two approximation algorithms. [sent-146, score-0.361]
18 (2005), we think of the binary clustering problem from a regularization viewpoint. [sent-149, score-0.389]
19 When the labels Yn are absent, a clustering method tries to minimize ϑ(Xn , y) over all possible assignments y ∈ {−1, +1}n for given Xn , that is, to solve the problem y ∗ = arg min ϑ(Xn , y). [sent-155, score-0.361]
20 y∈{−1,+1}n 2645 N IU , DAI , S HANG AND S UGIYAMA Generally speaking, ϑ(Xn , y ∗ ) can be regarded as a measure of clustering quality. [sent-156, score-0.361]
21 In our discriminative clustering model, we hope to use V (h) in Equation (1) as our regularization function. [sent-158, score-0.389]
22 More specifically, let us include a class balance constraint −b ≤ h⊤1n ≤ b with a user-specified class balance parameter b > 0 to prevent skewed clustering sizes. [sent-177, score-0.361]
23 Generality MVC is a general framework and closely related to several existing clustering methods. [sent-236, score-0.361]
24 The primal problem of MVC-SL can in fact be reduced to the optimization problems of unnormalized spectral clustering (USC) (von Luxburg, 2007, p. [sent-237, score-0.516]
25 6), relaxed plain and kernel k-means clustering (Ding and He, 2004), and squared-loss mutual information based clustering (SMIC) (Sugiyama et al. [sent-238, score-0.722]
26 On the other hand, continuous solutions to the relaxations of k-means clustering (MacQueen, 1967; Hartigan and Wong, 1979) and kernel k-means clustering (Girolami, 2002) can be obtained by principle component analysis (PCA) and kernel PCA, respectively (Zha et al. [sent-257, score-0.722]
27 As a result, lim h∗ = v ∗ , m m→∞ and lim −2 h∗ 1 /γm + h∗⊤Qh∗ = ε − v ∗Cn KCn v ∗ , m m m m→∞ where h∗ is the solution to (15) with Q specified as above and γm = m, and v ∗ is the solution to the m relaxed kernel k-means clustering (Ding and He, 2004, Theorem 3. [sent-260, score-0.361]
28 In other words, plain k-means clustering and kernel k-means clustering after certain relaxations are special limit cases of MVC-SL. [sent-264, score-0.722]
29 When considering k-means algorithms that are referred to as certain iterative clustering algorithms rather than clustering models, by no means they can be special limit cases of MVC-SL. [sent-267, score-0.722]
30 2,m 1,m Remark 2 After optimizing (17) and (18), SMIC adopts the post-processing that encloses α∗ and 1 α∗ into posterior probabilities and enables the out-of-sample clustering ability for any x ∈ X even 2 if x ∈ Xn (Sugiyama et al. [sent-277, score-0.361]
31 Finite Sample Stability The stability of the resulting clusters is important for clustering models whose non-convex primal problems are solved by randomized algorithms (e. [sent-280, score-0.438]
32 1 Definitions Definition 3 The Hamming clustering distance for two n-dimensional soft response vectors h and h′ is defined as dH (h, h′ ) := 1 min( sign(h) + sign(h′ ) 1 , sign(h) − sign(h′ ) 1 ). [sent-289, score-0.427]
33 The idea behind the irreducibility of Xn is simple: If xi is isolated, we cannot decide its cluster based on the information contained in Q no matter what binary clustering algorithm is used, unless we assign xi to one cluster and Xn \ xi to the other cluster. [sent-293, score-0.361]
34 We would like to remove such isolated samples and reduce the clustering of Xn to a better-defined problem. [sent-294, score-0.387]
35 To see this, let ∑k∈K δk ek + ∑k∈K δk ek √ , n ∑k∈K δk ek − ∑k∈K δk ek √ h− = . [sent-332, score-0.388]
36 Clustering Error Bound In this section, we derive a clustering error bound for MVC-SL based on transductive Rademacher complexity (El-Yaniv and Pechyony, 2009). [sent-462, score-0.472]
37 It is extremely difficult, if possible, to evaluate clustering methods in an objective and domainindependent manner (von Luxburg et al. [sent-463, score-0.361]
38 However, when the goals and interests are clear, it 2657 N IU , DAI , S HANG AND S UGIYAMA makes sense to evaluate clustering results using classification benchmark data sets, where the class structure coincides with the desired cluster structure according to the goals and interests. [sent-465, score-0.391]
39 Here, we derive a data-dependent clustering error bound to guarantee the quality of this propagation of knowledge. [sent-468, score-0.397]
40 By Lemma 18 together with Theorem 2 of El-Yaniv and Pechyony (2009), we can immediately obtain the clustering error bound: Theorem 19 Assume that y ∗ is the ground truth partition of Xn , and L is a random set of size n′ chosen uniformly from the set {L | L ⊂ {1, . [sent-508, score-0.433]
41 The first term is a measure of the clustering error on Xn′ = {xi | i ∈ L } by the surrogate loss times the ratio n/n′ . [sent-518, score-0.397]
42 When considering the average clustering error measured by dH (h, y ∗ )/n, the √ order of the error bound is O(1/ n′ ) since the second term dominates the third and fourth terms. [sent-525, score-0.433]
43 We can use the theory of transductive Rademacher complexity to derive a clustering error bound for Algorithm 1, since it can be viewed as a transductive algorithm that ignores all revealed labels. [sent-527, score-0.547]
44 1 Maximum Margin Clustering Among existing clustering methods, maximum margin clustering (MMC) is closest to MVC. [sent-531, score-0.809]
45 The latter is more natural, since clustering is more transductive than inductive. [sent-534, score-0.436]
46 Subsequently, generalized maximum margin clustering (GMMC) (Valizadegan and Jin, 2007) relaxes the restriction that the original MMC only considers homogeneous hyperplanes and hence demands every possible clustering boundary to pass through the origin. [sent-566, score-0.809]
47 W (24) where Cδ is a regularization parameter to control the trade-off between the clustering error and the margin. [sent-590, score-0.425]
48 Unlike MMC and GMMC that rely on SDP or IterSVR and CPMMC that are non-convex, label-generation maximum margin clustering (LGMMC) (Li et al. [sent-601, score-0.448]
49 Moreover, MVC-SL has a clustering error bound, and to the best of our knowledge no MMC has such a result. [sent-616, score-0.397]
50 2 Spectral Clustering Spectral clustering (SC) (Shi and Malik, 2000; Meila and Shi, 2001; Ng et al. [sent-619, score-0.361]
51 SC algorithms include two steps, a spectral embedding step to unfold the manifold structure and embed the input data into a low-dimensional space in a geodesic manner, and then a k-means clustering step to carry out clustering using the embedded data. [sent-621, score-0.784]
52 Globally optimal solutions to k-means clustering converge to a limit partition of the whole data space X , if the underlying distribution has a finite support, and the globally optimal solution 9. [sent-634, score-0.397]
53 2m On the other hand, MVC involves a combinatorial optimization similarly to the most clustering models and several semi-supervised learning models such as MMC. [sent-666, score-0.388]
54 This difficulty caused by the integer feasible region is intrinsically owing to the clustering problem and has no business with the large volume approximation V (h). [sent-667, score-0.413]
55 1 Setup Seven clustering algorithms were included in our experiments: • Kernel k-means clustering (KM; Zha et al. [sent-672, score-0.722]
56 , 2002), • Normalized spectral clustering (NSC; Ng et al. [sent-673, score-0.423]
57 , 2002), • Maximum margin clustering (MMC; Xu et al. [sent-674, score-0.448]
58 , 2009), • Soft-label maximum volume clustering (MVC-SL), • Hard-label maximum volume clustering (MVC-HL). [sent-676, score-0.826]
59 In our experiments, the performance was measured by the clustering error rate 1 1 dH (y, y ∗ ) = min( y + y ∗ 1 , y − y ∗ 1 ), n 2n where y is the label vector returned by clustering algorithms and y ∗ is the ground truth label vector. [sent-683, score-0.758]
60 2 Artificial Data Sets To begin with, we compare the clustering error and the computation time of all seven algorithms based on three artificial data sets. [sent-747, score-0.397]
61 Then, the regularization parameter C of MMC was the best value among {10−3 , 1, 103 }, that is, we ran MMC three times using C = 10−3 , 1, 103 and recorded the best performance, since there lacks a uniformly effective model selection framework for clustering algorithms. [sent-752, score-0.389]
62 The experimental results in terms of the means of the clustering error are reported in Figure 4. [sent-757, score-0.397]
63 Surprisingly, NSC was worse than KM on 2guassians, whereas MVC-SL based on the almost same input Q = Lsym + In /n had much lower clustering errors, which implies that the highly non-convex k-means step may be a bottleneck of NSC. [sent-761, score-0.361]
64 The worst-case computational complexities of MVC-HL and MMC made them extremely time-consuming, poorly scalable to middle or large sample sizes, and hence impractical despite their low mean clustering errors on 2guassians and 2circles. [sent-794, score-0.416]
65 It is interesting and surprising that both h0 and v0 were significantly inferior to v2 on 2circles when n = 100, but they still resulted in h∗ with the lowest mean clustering error. [sent-811, score-0.361]
66 96 Table 2: Means with standard errors of the clustering error (in %) on IDA benchmark data sets. [sent-919, score-0.455]
67 For each realization of each data set, we ignored the test data and tested five clustering algorithms using the training data, yet GMMC was not tested on the data sets Flare-solar, German, Image and Splice as it required a very long execution time when n ≥ 600. [sent-923, score-0.361]
68 Table 2 describes the means with standard errors of the clustering error rate by each algorithm on each data set. [sent-927, score-0.425]
69 The clustering errors of five algorithms exhibited large differences on five data sets, namely, Breast-cancer, Flare-solar, German, Ringnorm and Splice, among which MVC-SL was one of the best algorithms on three data sets, and LGMMC was one of the best algorithms on two data sets. [sent-930, score-0.389]
70 The clustering errors exhibited merely small differences on the other five data sets. [sent-931, score-0.389]
71 Moreover, the fully supervised SVM has a mean classification error obviously smaller than the lowest mean clustering error on the data sets German, Image and Splice, and larger than the lowest mean clustering error on the data sets Breast-cancer, Titanic and Twonorm. [sent-932, score-0.83]
72 It should not be surprising or confusing since the classification error is the out-of-sample test error on the test data whereas the clustering error is the in-sample test error on the same data to be clustered. [sent-933, score-0.505]
73 In fact, Ringnorm violates the underlying assumption when evaluating clustering results using classification data sets, that is, the class structure and the cluster structure must coincide with each other. [sent-935, score-0.361]
74 Instead of testing KM, NSC, LGMMC, GMMC and MVC-SL on all forty-five pairwise clustering tasks, a few challenging tasks were selected, namely, the pairs {1, 7}, {1, 9}, {8, 9}, {3, 5}, {3, 8}, {5, 8} of USPS and {1, 7}, {7, 9}, {8, 9}, {3, 5}, {3, 8}, {5, 8} of MNIST. [sent-940, score-0.412]
75 Figure 7 reports the means of the clustering error by each algorithm on each task. [sent-952, score-0.397]
76 Moreover, Table 3 summarizes the means with standard errors of the clustering error, in which each algorithm has 80 random samplings on each task. [sent-955, score-0.453]
77 Since the sample sizes here varied in a large range, we performed the paired t-test of the null hypothesis that the difference of the clustering error is from a Gaussian distribution with mean zero and unknown variance, against the alternative hypothesis that the mean is not zero. [sent-956, score-0.506]
78 7, such that the mean clustering errors of MVC-SL and NSC were less than two percents when n ≥ 100, and the hardest tasks are MNIST 7 vs. [sent-958, score-0.44]
79 In addition, according to Figure 7, the mean clustering errors of MVC-SL were basically non-increasing except in panel (f) USPS 5 vs. [sent-962, score-0.425]
80 8 Figure 7: Means of the clustering error (in %) on USPS and MNIST. [sent-982, score-0.397]
81 80 Table 3: Means with standard errors of the clustering error (in %) on USPS and MNIST. [sent-1115, score-0.425]
82 We prepared nine pairwise clustering tasks which included all tasks between the four multi-modal topics and all tasks between the three unimodal topics. [sent-1119, score-0.514]
83 Figure 8 reports the means of the clustering error by each algorithm on each task. [sent-1124, score-0.397]
84 In addition, Table 4 summarizes the means with standard errors of the clustering error, in which each algorithm has 80 random samplings on each task. [sent-1127, score-0.453]
85 Moreover, MVC-SL, NSC and GMMC usually outperformed KM and LGMMC, and Figure 8 also illustrates that the mean clustering errors of MVC-SL were basically non-increasing. [sent-1140, score-0.389]
86 soc Figure 8: Means of the clustering error (in %) on 20Newsgroups. [sent-1160, score-0.437]
87 60 Table 4: Means with standard errors of the clustering error (in %) on 20Newsgroups. [sent-1267, score-0.425]
88 47 Table 5: Means with standard errors of the clustering error (in %) on Isolet. [sent-1335, score-0.425]
89 Figure 9 reports the means of the clustering error by each algorithm on each task. [sent-1354, score-0.397]
90 N Figure 9: Means of the clustering error (in %) on Isolet. [sent-1361, score-0.397]
91 In addition, Table 5 summarizes the means with standard errors of the clustering error, in which each algorithm has 80 random samplings on each task. [sent-1364, score-0.453]
92 D, such that the lowest mean clustering errors on B vs. [sent-1375, score-0.389]
93 D were almost three times larger than the lowest mean clustering error on T vs. [sent-1377, score-0.397]
94 Unlike the curves shown in Figures 7 and 8, the mean clustering errors of MVC-SL in Figure 9 were basically non-increasing only in panel (d) A vs. [sent-1379, score-0.425]
95 Conclusions We proposed a new discriminative clustering model called maximum volume clustering (MVC) to partition the data samples into two clusters based on the large volume principle. [sent-1385, score-0.862]
96 Then, we demonstrated that MVC includes the optimization problems of some well-known clustering methods as special limit cases, and discussed the finite sample stability and the clustering error bound of MVC-SL in great detail. [sent-1387, score-0.851]
97 Notice that any (cross-) validation technique using the clustering error, which is the in-sample test error on the same data to be clustered, simply does not work for model selection. [sent-1424, score-0.397]
98 In order to do model selection, a criterion other than the clustering error is necessary. [sent-1425, score-0.397]
99 Nevertheless, MVC-SL may work with high probability in practice when spectral clustering works. [sent-1440, score-0.423]
100 We argue that it may be the minimization of negative ℓ1 -norm in MVC-SL that has improved spectral clustering as shown in panel (c) of Figure 6. [sent-1441, score-0.459]
wordName wordTfidf (topN-words)
[('clustering', 0.361), ('mvc', 0.359), ('lgmmc', 0.277), ('gmmc', 0.25), ('mmc', 0.23), ('hq', 0.213), ('qh', 0.195), ('dai', 0.173), ('nsc', 0.166), ('ugiyama', 0.147), ('aximum', 0.147), ('lustering', 0.133), ('usps', 0.123), ('ek', 0.097), ('axisymmetric', 0.097), ('lun', 0.097), ('xn', 0.094), ('iu', 0.093), ('mnist', 0.093), ('lsym', 0.09), ('sci', 0.09), ('km', 0.089), ('margin', 0.087), ('vl', 0.085), ('hang', 0.083), ('sign', 0.076), ('transductive', 0.075), ('axisymmetry', 0.075), ('dh', 0.071), ('pechyony', 0.07), ('sqp', 0.067), ('rademacher', 0.066), ('samplings', 0.064), ('luxburg', 0.062), ('hk', 0.062), ('spectral', 0.062), ('ht', 0.061), ('ida', 0.058), ('misc', 0.058), ('rn', 0.054), ('axes', 0.053), ('talk', 0.053), ('similarity', 0.053), ('comp', 0.052), ('volume', 0.052), ('tasks', 0.051), ('anisotropic', 0.051), ('von', 0.05), ('rec', 0.048), ('sdp', 0.047), ('sugiyama', 0.047), ('eigenvectors', 0.044), ('diag', 0.043), ('sc', 0.043), ('bie', 0.043), ('soc', 0.04), ('shi', 0.04), ('laplacian', 0.039), ('stability', 0.039), ('valizadegan', 0.038), ('primal', 0.038), ('soft', 0.038), ('boggs', 0.037), ('eigenvalues', 0.037), ('partition', 0.036), ('hyperparameter', 0.036), ('error', 0.036), ('panel', 0.036), ('ding', 0.035), ('sec', 0.035), ('qi', 0.035), ('eigenvector', 0.034), ('cn', 0.033), ('alt', 0.033), ('subsequently', 0.033), ('ringnorm', 0.032), ('splice', 0.032), ('en', 0.031), ('ind', 0.03), ('tolle', 0.03), ('twin', 0.03), ('isolet', 0.03), ('benchmarks', 0.03), ('benchmark', 0.03), ('paired', 0.028), ('regularization', 0.028), ('unnormalized', 0.028), ('xu', 0.028), ('errors', 0.028), ('pt', 0.028), ('response', 0.028), ('cristianini', 0.027), ('optimization', 0.027), ('hypothesis', 0.027), ('equivalence', 0.027), ('sample', 0.027), ('usc', 0.027), ('suzuki', 0.027), ('qy', 0.027), ('isolated', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama
Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
3 0.096777588 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou
Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation
4 0.091130331 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel
Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive
5 0.069955721 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
Author: Wei Pan, Xiaotong Shen, Binghui Liu
Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)
6 0.054357603 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
7 0.053077403 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
8 0.050481912 59 jmlr-2013-Large-scale SVD and Manifold Learning
9 0.049409386 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
10 0.049035087 114 jmlr-2013-The Rate of Convergence of AdaBoost
11 0.047660094 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
12 0.04588747 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
13 0.045010757 90 jmlr-2013-Quasi-Newton Method: A New Direction
14 0.044502873 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
15 0.044116296 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
16 0.042294152 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
17 0.041363951 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
18 0.041216746 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
19 0.040256269 73 jmlr-2013-Multicategory Large-Margin Unified Machines
20 0.038559355 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
topicId topicWeight
[(0, -0.214), (1, 0.069), (2, 0.019), (3, -0.036), (4, 0.057), (5, -0.002), (6, -0.007), (7, -0.045), (8, 0.037), (9, -0.165), (10, 0.132), (11, -0.103), (12, 0.062), (13, 0.079), (14, 0.024), (15, 0.093), (16, 0.069), (17, -0.049), (18, -0.322), (19, 0.146), (20, -0.096), (21, 0.33), (22, -0.063), (23, -0.031), (24, -0.098), (25, -0.03), (26, 0.05), (27, -0.051), (28, 0.078), (29, 0.03), (30, -0.027), (31, -0.015), (32, 0.08), (33, -0.092), (34, 0.039), (35, -0.017), (36, -0.013), (37, 0.027), (38, -0.077), (39, 0.071), (40, 0.013), (41, -0.135), (42, 0.033), (43, 0.018), (44, 0.029), (45, 0.006), (46, 0.087), (47, 0.015), (48, 0.023), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.94295001 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama
Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error
2 0.80202985 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel
Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
4 0.54306817 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
Author: Wei Pan, Xiaotong Shen, Binghui Liu
Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)
5 0.44644696 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou
Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation
6 0.31094041 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
7 0.2953251 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
8 0.27510989 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
9 0.26928937 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
10 0.25830483 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
11 0.2512908 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
12 0.25121227 73 jmlr-2013-Multicategory Large-Margin Unified Machines
13 0.23863848 33 jmlr-2013-Dimension Independent Similarity Computation
14 0.23817617 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
15 0.23654662 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
16 0.23348677 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
17 0.23266001 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
18 0.22316632 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library
19 0.22169234 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
20 0.2163157 59 jmlr-2013-Large-scale SVD and Manifold Learning
topicId topicWeight
[(0, 0.025), (5, 0.097), (6, 0.028), (10, 0.579), (20, 0.011), (23, 0.019), (44, 0.011), (68, 0.018), (70, 0.02), (75, 0.031), (85, 0.029), (87, 0.016)]
simIndex simValue paperId paperTitle
1 0.98142481 96 jmlr-2013-Regularization-Free Principal Curve Estimation
Author: Samuel Gerber, Ross Whitaker
Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves define the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal definition leads to some technical and practical difficulties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difficulties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modification of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection
2 0.95166528 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
Author: Matthew J. Johnson, Alan S. Willsky
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model
same-paper 3 0.93503332 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama
Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error
4 0.83788854 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
Author: Ery Arias-Castro, Bruno Pelletier
Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing specific rates of convergence under standard assumptions. We find that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.
5 0.62530667 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer
Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modification to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modified SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER
6 0.61347276 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
7 0.59983832 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
8 0.59948987 86 jmlr-2013-Parallel Vector Field Embedding
9 0.59824038 59 jmlr-2013-Large-scale SVD and Manifold Learning
10 0.59698123 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis
11 0.5880425 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
12 0.58585471 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
13 0.58125234 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
14 0.57870215 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
15 0.57838893 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
16 0.57687104 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
17 0.57592034 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
18 0.57437497 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
20 0.56902021 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds