nips nips2007 nips2007-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 DIFFRAC : a discriminative and flexible framework for clustering Francis R. [sent-1, score-0.612]
2 fr Abstract We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. [sent-6, score-1.071]
3 The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. [sent-7, score-0.249]
4 This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. [sent-8, score-0.444]
5 (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. [sent-9, score-0.833]
6 (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. [sent-10, score-0.408]
7 1 Introduction Many clustering frameworks have already been proposed, with numerous applications in machine learning, exploratory data analysis, computer vision and speech processing. [sent-12, score-0.364]
8 In this paper, we present a discriminative and flexible framework for clustering (D IFFRAC), which is aimed at alleviating some of those practical annoyances. [sent-14, score-0.612]
9 Our framework is based on a recent set of works [1, 2] that have used the support vector machine (SVM) cost function used for linear classification as a clustering criterion, with the intuitive goal of looking for clusters which are most linearly separable. [sent-15, score-0.721]
10 This line of work has led to promising results; however, the large convex optimization problems that have to be solved prevent application to datasets larger than few hundreds data points. [sent-16, score-0.257]
11 1 In this paper, we consider the maximum value of the regularized linear regression on indicator matrices. [sent-17, score-0.149]
12 By choosing a square loss (instead of the hinge loss), we obtain a simple cost function which can be simply expressed in closed form and is amenable to specific efficient convex optimization algorithms, that can deal with large datasets of size 10,000 to 50,000 data points. [sent-18, score-0.471]
13 Our cost function turns out to be a linear function of the “equivalence matrix” M , which is a square {0, 1}-matrix indexed by the data points, with value one for all pairs of data points that belong to the same clusters, and zero otherwise. [sent-19, score-0.353]
14 In Section 2, we present a derivation of our cost function and of the convex relaxations. [sent-22, score-0.241]
15 In Section 3, we show how the convex relaxed problem can be solved efficiently through a sequence of lower dimensional singular value decompositions, while in Section 4, we show how a priori knowledge can be incorporated into our framework. [sent-23, score-0.226]
16 2 Discriminative clustering framework In this section, we first assume that we are given n points x1 , . [sent-25, score-0.482]
17 , n} into k > 1 clusters by indicator matrices y ∈ {0, 1}n×k such that y1k = 1n , where 1k and 1n denote the constant vectors of all ones, of dimensions k and n. [sent-32, score-0.34]
18 1 Discriminative clustering cost Given y, we consider the regularized linear regression problem of y given X, which takes the form: min w∈Rd×k , b∈R1×k 1 n y − Xw − 1n b 2 F + κ tr w⊤ w, (1) where the Frobenius norm is defined for any vector or rectangular matrix as A 2 = trAA⊤ = F trA⊤ A. [sent-35, score-0.683]
19 The optimal value is then equal to J(y, X, κ) = tr yy ⊤ A(X, κ), (2) where the n × n matrix A(X, κ) is defined as: 1 A(X, κ) = n Πn (In −X(X ⊤Πn X + nκI)−1 X ⊤ )Πn . [sent-38, score-0.218]
20 (3) 0, and 1n is a Following [1] and [2], we are thus looking for a k-class indicator matrix y such that tr yy ⊤ A(X, κ) is minimal, i. [sent-44, score-0.282]
21 , for a partition such that the clusters are most linearly separated, where the separability of clusters is measured through the minimum of the discriminative cost with respect to all linear classifiers. [sent-46, score-0.693]
22 This combinatorial optimization is NP-hard in general [6], but efficient convex relaxations may be obtained, as presented in the next section. [sent-47, score-0.293]
23 2 Indicator and equivalence matrices The cost function defined in Eq. [sent-49, score-0.321]
24 We let denote Ek the set of “k-class equivalence matrices”, i. [sent-51, score-0.127]
25 , the set of matrices M such that there exists a k-class indicator matrix y with M = yy ⊤ . [sent-53, score-0.303]
26 There are many outer convex approximations of the discrete sets Ek , based on different properties of matrices in Ek , that were used in different contexts, such as maximum cut problems [6] or correlation clustering [7]. [sent-54, score-0.621]
27 We have the following usual properties of equivalence matrices (independent of k): if M ∈ Ek , then (a) M is positive semidefinite (denoted as M 0), (b) M has nonnegative values (denoted as M 0) , and (c) the diagonal of M is equal to 1n (denoted as diag(M ) = 1n ). [sent-55, score-0.298]
28 1 ⊤ Moreover, if M corresponds to at most k clusters, we have M k 1n 1n , which is a consequence to the convex outer approximation of [6] for the maximum k-cut problem. [sent-56, score-0.181]
29 We thus use the following convex outer approximation: Ck = {M ∈ Rn×n , M = M ⊤ , diag(M ) = 1n , M Note that when k = 2, the constraints M constraints. [sent-57, score-0.329]
30 3 Minimum cluster sizes Given the discriminative nature of our cost function (and in particular that A(X, κ)1n = 0), the minimum value 0 is always obtained with M = 1n 1⊤ , a matrix of rank one, equivalent to a single n cluster. [sent-60, score-0.436]
31 Following [1], we impose a minimum size λ0 for each cluster, through row sums and eigenvalues: Row sums If M ∈ Ek , then M 1n λ0 1n and M 1n (n − (k − 1)λ0 )1n (the cluster must be smaller than (n − (k − 1)λ0 ) if they are all larger than λ0 )–this is the same constraint as in [1]. [sent-62, score-0.178]
32 Eigenvalues When M ∈ Ek , the sizes of the clusters are exactly the k largest eigenvalues of M . [sent-63, score-0.22]
33 n Thus, for a matrix in Ek , the minimum cluster size constraint is equivalent to i=1 1λi (M) λ0 k, where λ1 (M ), . [sent-64, score-0.239]
34 Functions of the form Φ(M ) = n i=1 φ(λi (M )) are referred to as spectral functions and are particularly interesting in machine learning and optimization, since Φ inherits from φ many of its properties, such as differentiability and convexity [8]. [sent-68, score-0.149]
35 The previous constraint can be seen as Φ(M ) k, with φ(λ) = 1λ λ0 , which is not concave and thus does not lead to a convex constraint. [sent-69, score-0.191]
36 Our final convex relaxation is thus of minimizing trA(X, κ)M with respect to M ∈ Ck and such that Φλ0 (M ) k, M 1n λ0 1n and M 1n (n − (k − 1)λ0 )1n , where Φλ0 (M ) = n i=1 min{λi (M )/λ0 , 1}. [sent-71, score-0.185]
37 The clustering results are empirically robust to the value of λ0 . [sent-72, score-0.364]
38 Indeed, in the unregularized case (κ = 0), we aim to minimize tr Πn (In − X(X ⊤ Πn X)−1 X ⊤ )Πn yy ⊤ . [sent-76, score-0.157]
39 The main differences between the two cost functions are that (1) we require an additional parameter, namely the minimum number of elements per cluster and (2) our cost function normalizes the data, while the K-means distortion measure normalizes the labels. [sent-78, score-0.483]
40 Note that using a discriminative criterion based on the square loss may lead to the masking problem [4], which can be dealt with in the usual way by using second-order polynomials or, equivalently, a polynomial kernel. [sent-81, score-0.241]
41 Indeed, using the matrix inversion lemma, we get: A(K, κ) = κΠn (K + nκIn )−1 Πn , (4) where K = Πn KΠn is the “centered Gram matrix” of the points X. [sent-85, score-0.133]
42 We can thus apply our framework with any positive definite kernel [5]. [sent-86, score-0.153]
43 6 Additional relaxations Our convex optimization problem can be further relaxed. [sent-88, score-0.24]
44 An interesting relaxation is obtained by 1 ⊤ 0, (2) relaxing diag(M ) = 1n into trM = n, (1) relaxing the constraints M k 1n 1n into M clustering error 1 0. [sent-89, score-0.692]
45 The clustering performance is plotted against the number of irrelevant dimensions, for regular K-means and our D IFFRAC approach (right panel, averaged over 50 replications with the standard deviation in dotted lines) . [sent-95, score-0.447]
46 The clustering performance is measured by a metric between partitions defined in Section 5, which is always between 0 and 1. [sent-96, score-0.448]
47 and (3) removing the constraint M 0 and the constraints on the row sums. [sent-97, score-0.216]
48 A short calculation shows that this relaxation leads to an eigenvalue problem: let A = n ai ui u⊤ be an eigenvalue i i=1 decomposition of A, where a1 ··· an are the sorted eigenvalues. [sent-98, score-0.3]
49 The minimal value of the j relaxed convex optimization problem is attained at M = i=1 ui u⊤ + (n − λ0 j)uj+1 u⊤ , with i j+1 j = ⌊n/λ0 ⌋. [sent-99, score-0.263]
50 This additional relaxation into an eigenvalue problem is the basis of our efficient optimization algorithm in Section 3. [sent-100, score-0.245]
51 In the linear setting, since PCA has no clustering effects in general2 , it is clear that the constraints that were removed are essential to the clustering performance. [sent-103, score-0.923]
52 In the kernel setting, experiments have shown that the most important constraint to keep in order to achieve the best embedding and clustering is the constraint diag(M ) = 1n . [sent-104, score-0.555]
53 However, the number of variables is O(n2 ) and thus the complexity of general purpose algorithms will be at least O(n7 ); this remains much too slow for medium scale problems, where the number of data points is between 1,000 and 10,000. [sent-107, score-0.111]
54 1 Optimization by partial dualization We saw earlier that by relaxing some of the constraints, we get back an eigenvalue problem. [sent-110, score-0.178]
55 Eigenvalue decompositions are among the most important tools in numerical algebra and algorithms and codes are heavily optimized for these, and it is thus advantageous to rely on a sequence of eigenvalue decompositions for large scale algorithms. [sent-111, score-0.237]
56 We can dualize some constraints while keeping others; this leads to the following proposition: 2 Recent results show however that it does have an effect when clusters are spherical Gaussians [11]. [sent-112, score-0.294]
57 Proposition 1 The solution of the convex optimization problem defined in Section 2. [sent-113, score-0.187]
58 + + + The variables β1 , β2 , β3 , β4 , (β5 , β6 ) correspond to the respective dualizations of the constraints diag(M ) = 1n , M 1n (n − (k − 1)λ0 )1n , M 1n λ0 1n , M 0, and M 1n 1⊤ n k . [sent-115, score-0.148]
59 The function J(B) = minM 0,trM=n,Φλ0 (M) k trBM is a spectral convex function and may be computed in closed form through an eigenvalue decomposition. [sent-116, score-0.392]
60 Moreover, a subgradient may be easily computed, readily leading to a numerically efficient subgradient method in fewer dimensions than n2 . [sent-117, score-0.176]
61 Indeed, if we subsample the pointwise positivity constraint N 0 (so that β4 has only a size smaller than n1/2 × n1/2 ), then the set of dual variables β we are trying to maximize has linear size in n (instead of the primal variable M being quadratic in n). [sent-118, score-0.179]
62 More refined optimization schemes, based on smoothing of the spectral function J(B) by minM 0,trM=n,Φλ0 (M) k [trBM + εtrM 2 ] are also used to speed up convergence (steepest descent of a smoothed function is generally faster than subgradient iterations) [13]. [sent-119, score-0.23]
63 2 Computational complexity The running time complexity can be splitted into initialization procedures and per iteration complexity. [sent-121, score-0.12]
64 The per iteration complexity depends directly on the cost of our eigenvalue problems, which themselves are linear in the matrix-vector operation with the matrix A (we only require a fixed small number of eigenvalues). [sent-122, score-0.384]
65 In all situations, we manage to keep a linear complexity in the number n of data points. [sent-123, score-0.086]
66 For linear kernels with dimension d, the complexity of initialization is O(d2 n), while the complexity of each iteration is proportional to the cost of performing a matrix-vector operation with A, that is, O(dn). [sent-125, score-0.356]
67 For general kernels, the complexity of initialization is O(n3 ), while the complexity of each iteration is O(n2 ). [sent-126, score-0.12]
68 3 Rounding After the convex optimization, we obtain a low-rank matrix M ∈ Ck which is pointwise nonnegative with unit diagonal, of the form U U ⊤ where U ∈ Rn×m . [sent-129, score-0.248]
69 Those two constraints are linear in M and can thus easily be included in our convex formulation. [sent-134, score-0.318]
70 Moreover, we assume that the set of positive constraints is closed, i. [sent-136, score-0.2]
71 , that if there is a path of positive constraints between two data points, then these two data points are already forming a pair in P+ . [sent-138, score-0.272]
72 If the set of positive pairs does not satisfy this assumption, a larger set of pairs can be obtained by transitive closure. [sent-139, score-0.118]
73 clustering error 20 % x n 40 % x n 1 K−means diffrac 0. [sent-140, score-0.555]
74 5 0 0 0 20 40 noise dimension 0 20 40 noise dimension Figure 2: Comparison with K-means in the semi-supervised setting, with data taken from Figure 1: clustering performance (averaged over 50 replications, with standard deviations in dotted) vs. [sent-142, score-0.364]
75 Positive constraints Given our closure assumption on P+ , we get a partition of {1, . [sent-144, score-0.192]
76 The singletons in this partition correspond to data points that are not involved in any positive constraints, while other subsets corresponds to chunks of data points that must occur together in the final partition. [sent-148, score-0.311]
77 Forcing those groups is equivalent to considering M of the form M = P MP P ⊤ , where MP is an equivalence matrix of size p. [sent-153, score-0.188]
78 Note that the positive constraint Mij = 1 is in fact turned into the equality of columns (and thus rows by symmetry) i and j of M , which is equivalent when M ∈ Ek , but much stronger for M ∈ Ck . [sent-154, score-0.12]
79 Positive constraints can be similarly included into K-means, to form a reduced weighted K-means problem, which is simpler than other approaches to deal with positive constraints [17]. [sent-156, score-0.348]
80 In Figure 2, we compare constrained K-means and the D IFFRAC framework under the same setting as in Figure 1, with different numbers of randomly selected positive constraints. [sent-157, score-0.098]
81 Negative constraints After the chunks corresponding to positive constraints have been collapsed to one point, we extend the set of negative constraints to those collapsed points (if the constraints were originally consistent, the negative constraints can be uniquely extended). [sent-158, score-1.013]
82 5 Simulations In this section, we apply the D IFFRAC framework to various clustering problems and situations. [sent-163, score-0.41]
83 1 Clustering classification datasets We looked at the Isolet dataset (26 classes, 5,200 data points) from the UCI repository and the MNIST datasets of handwritten digits (10 classes, 5,000 data points). [sent-171, score-0.176]
84 For each of those datasets, we compare the performances of K-means, RCA [18] and D IFFRAC, for linear and Gaussian kernels (referred to as “rbf”), for fixed value of the regularization parameter, with different levels of supervision. [sent-172, score-0.118]
85 6 Table 1: Comparison of K-means, RCA and linear D IFFRAC, using the clustering metric defined in Section 5 (averaged over 10 replications), for linear and “rbf” kernels and various levels of supervision. [sent-237, score-0.529]
86 Note that all algorithms work on the same data representation (linear or kernelized) and that differences are due to the underlying clustering frameworks. [sent-239, score-0.364]
87 A primary goal in semi-supervised learning is to take into account a large number of labelled points in order to dramatically reduce the number of labelled points required to achieve a competitive classification accuracy. [sent-248, score-0.436]
88 Henceforth, our experimental setting consists in observing how fast the classification accuracy collapses as the number of labelled points increases. [sent-249, score-0.218]
89 The less labelled points a method needs to achieve decent classification accuracy, the more it is relevant for semi-supervised learning tasks. [sent-250, score-0.218]
90 As shown in Figure 3, our method yields competitive classification accuracy with very few labelled points on the three datasets. [sent-251, score-0.218]
91 Hence, for datasets exhibiting multi-class structure such as Text, D IFFRAC is more able to utilize unlabelled points since it based on a multi-class clustering algorithm rather than algorithms based on binary SVMs, where multi-class extensions are currently unclear. [sent-254, score-0.553]
92 Thus, our experiments support the fact that semisupervised learning algorithms built on clustering algorithms augmented with labelled data acting as hints on clusters are worth for investigation and further research. [sent-255, score-0.778]
93 6 Conclusion We have presented a discriminative framework for clustering based on the square loss and penalization through spectral functions of equivalence matrices. [sent-256, score-0.84]
94 Moreover, our discriminative framework should allow to use existing methods for learning the kernel matrix from data [20]. [sent-258, score-0.309]
95 In particular, early experiments on estimating the number of clusters using variation rates of our discriminative costs are very promising. [sent-260, score-0.293]
96 3 Number of labelled training points DIFFRAC LDS 50 100 150 Number of labelled training points 0. [sent-276, score-0.436]
97 1 0 50 100 150 200 Number of labelled training points Figure 3: Semi-supervised classification. [sent-277, score-0.218]
98 Fast SDP relaxations of graph cut clustering, transduction, and other combinatorial problems. [sent-289, score-0.106]
99 Nonlinear component analysis as a kernel eigenvalue o u problem. [sent-354, score-0.174]
100 Semidefinite clustering for image segmentation with a-priori o knowledge. [sent-400, score-0.364]
wordName wordTfidf (topN-words)
[('iffrac', 0.446), ('clustering', 0.364), ('ek', 0.203), ('diffrac', 0.191), ('constraints', 0.148), ('discriminative', 0.147), ('clusters', 0.146), ('labelled', 0.146), ('equivalence', 0.127), ('lds', 0.127), ('convex', 0.123), ('eigenvalue', 0.119), ('cost', 0.118), ('rn', 0.108), ('spectral', 0.105), ('yy', 0.102), ('minm', 0.096), ('rca', 0.096), ('partitions', 0.084), ('replications', 0.083), ('matrices', 0.076), ('eigenvalues', 0.074), ('points', 0.072), ('card', 0.071), ('chunks', 0.071), ('kernels', 0.071), ('datasets', 0.07), ('constraint', 0.068), ('diag', 0.068), ('cluster', 0.065), ('optimization', 0.064), ('ssl', 0.064), ('trbm', 0.064), ('trm', 0.064), ('xw', 0.064), ('rounding', 0.064), ('pointwise', 0.064), ('indicator', 0.064), ('singular', 0.062), ('relaxation', 0.062), ('subgradient', 0.061), ('matrix', 0.061), ('relaxing', 0.059), ('decompositions', 0.059), ('outer', 0.058), ('kernel', 0.055), ('tra', 0.055), ('flexible', 0.055), ('ck', 0.055), ('tr', 0.055), ('augmented', 0.055), ('dimensions', 0.054), ('classi', 0.053), ('semide', 0.053), ('combinatorial', 0.053), ('relaxations', 0.053), ('positive', 0.052), ('rue', 0.051), ('centering', 0.051), ('normalizes', 0.051), ('simulations', 0.051), ('square', 0.051), ('bach', 0.05), ('unlabelled', 0.047), ('linear', 0.047), ('framework', 0.046), ('closed', 0.045), ('minimum', 0.045), ('partition', 0.044), ('referred', 0.044), ('icml', 0.043), ('usual', 0.043), ('bci', 0.042), ('initialization', 0.042), ('relaxed', 0.041), ('mp', 0.041), ('mij', 0.041), ('eigenvectors', 0.04), ('complexity', 0.039), ('collapsed', 0.039), ('bi', 0.038), ('transductive', 0.038), ('regularized', 0.038), ('looked', 0.036), ('cj', 0.036), ('sdp', 0.035), ('bk', 0.035), ('attained', 0.035), ('paris', 0.035), ('namely', 0.035), ('project', 0.034), ('constrain', 0.034), ('france', 0.034), ('apparently', 0.034), ('gram', 0.034), ('built', 0.034), ('investigation', 0.033), ('pairs', 0.033), ('indexed', 0.032), ('rbf', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
2 0.20472735 80 nips-2007-Ensemble Clustering using Semidefinite Programming
Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu
Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1
3 0.20215751 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
4 0.16301252 46 nips-2007-Cluster Stability for Finite Samples
Author: Ohad Shamir, Naftali Tishby
Abstract: Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a sample-based framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a model-selection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stability-based cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1
5 0.14179489 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
6 0.13521762 63 nips-2007-Convex Relaxations of Latent Variable Training
7 0.13305224 58 nips-2007-Consistent Minimization of Clustering Objective Functions
8 0.11426925 70 nips-2007-Discriminative K-means for Clustering
9 0.11402169 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
10 0.10784312 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
11 0.10377467 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
12 0.09992066 112 nips-2007-Learning Monotonic Transformations for Classification
13 0.091777377 99 nips-2007-Hierarchical Penalization
14 0.088705115 84 nips-2007-Expectation Maximization and Posterior Constraints
15 0.083518356 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
16 0.082619213 109 nips-2007-Kernels on Attributed Pointsets with Applications
17 0.082322568 134 nips-2007-Multi-Task Learning via Conic Programming
18 0.080350004 62 nips-2007-Convex Learning with Invariances
19 0.080259062 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
20 0.078107648 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
topicId topicWeight
[(0, -0.258), (1, 0.044), (2, -0.155), (3, 0.1), (4, -0.016), (5, -0.104), (6, 0.056), (7, 0.1), (8, -0.181), (9, 0.121), (10, 0.051), (11, -0.048), (12, 0.126), (13, -0.214), (14, -0.156), (15, 0.212), (16, -0.026), (17, -0.109), (18, 0.044), (19, 0.08), (20, -0.136), (21, 0.012), (22, -0.1), (23, 0.073), (24, 0.009), (25, 0.05), (26, 0.027), (27, -0.017), (28, 0.05), (29, 0.016), (30, 0.067), (31, -0.032), (32, 0.014), (33, 0.081), (34, 0.026), (35, -0.015), (36, 0.115), (37, 0.093), (38, -0.009), (39, -0.033), (40, 0.032), (41, -0.032), (42, -0.106), (43, 0.001), (44, 0.064), (45, -0.077), (46, -0.027), (47, 0.097), (48, -0.017), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.96822584 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
2 0.81985724 80 nips-2007-Ensemble Clustering using Semidefinite Programming
Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu
Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1
3 0.79974145 70 nips-2007-Discriminative K-means for Clustering
Author: Jieping Ye, Zheng Zhao, Mingrui Wu
Abstract: We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets. 1
4 0.70419723 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
5 0.62344503 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
6 0.58862293 46 nips-2007-Cluster Stability for Finite Samples
7 0.56411362 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
8 0.55595434 63 nips-2007-Convex Relaxations of Latent Variable Training
9 0.54734772 112 nips-2007-Learning Monotonic Transformations for Classification
10 0.54002851 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
11 0.53443182 58 nips-2007-Consistent Minimization of Clustering Objective Functions
12 0.53084785 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
13 0.45007569 24 nips-2007-An Analysis of Inference with the Universum
14 0.44897112 49 nips-2007-Colored Maximum Variance Unfolding
15 0.42264736 109 nips-2007-Kernels on Attributed Pointsets with Applications
16 0.41965964 99 nips-2007-Hierarchical Penalization
17 0.41408199 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
18 0.37581372 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
19 0.37514451 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
20 0.36799723 84 nips-2007-Expectation Maximization and Posterior Constraints
topicId topicWeight
[(5, 0.051), (13, 0.044), (16, 0.042), (19, 0.02), (21, 0.047), (31, 0.034), (34, 0.016), (35, 0.057), (47, 0.116), (49, 0.052), (59, 0.169), (83, 0.177), (85, 0.036), (87, 0.024), (90, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.86361057 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
2 0.7962656 115 nips-2007-Learning the 2-D Topology of Images
Author: Nicolas L. Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, Balázs Kégl
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topologyextraction approaches and show how having the two-dimensional topology can be exploited.
3 0.79501939 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
4 0.78846478 134 nips-2007-Multi-Task Learning via Conic Programming
Author: Tsuyoshi Kato, Hisashi Kashima, Masashi Sugiyama, Kiyoshi Asai
Abstract: When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.
Author: Lars Buesing, Wolfgang Maass
Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1
6 0.78658324 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
7 0.78571588 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
8 0.78302598 24 nips-2007-An Analysis of Inference with the Universum
9 0.78293836 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
10 0.78198934 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
11 0.78072828 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
12 0.77844828 43 nips-2007-Catching Change-points with Lasso
13 0.77720439 185 nips-2007-Stable Dual Dynamic Programming
14 0.77708179 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
15 0.77677262 16 nips-2007-A learning framework for nearest neighbor search
16 0.77649021 116 nips-2007-Learning the structure of manifolds using random projections
17 0.7761693 102 nips-2007-Incremental Natural Actor-Critic Algorithms
18 0.77613646 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
19 0.77579618 7 nips-2007-A Kernel Statistical Test of Independence
20 0.77490306 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations