nips nips2007 nips2007-70 knowledge-graph by maker-knowledge-mining

70 nips-2007-Discriminative K-means for Clustering


Source: pdf

Author: Jieping Ye, Zheng Zhao, Mingrui Wu

Abstract: We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. [sent-7, score-0.491]

2 Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. [sent-8, score-0.155]

3 However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. [sent-9, score-0.416]

4 We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. [sent-10, score-0.582]

5 This provides significant and new insights into the nature of this subspace selection procedure. [sent-11, score-0.198]

6 Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. [sent-12, score-0.236]

7 We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. [sent-14, score-0.424]

8 The connection between DisKmeans and several other clustering algorithms is also analyzed. [sent-15, score-0.168]

9 A common practice is to project the data onto a low-dimensional subspace through unsupervised dimensionality reduction such as Principal Component Analysis (PCA) [9] and various manifold learning algorithms [1, 13] before the clustering. [sent-19, score-0.178]

10 However, the projection may not necessarily improve the separability of the data for clustering, due to the inherent separation between subspace selection (via dimensionality reduction) and clustering. [sent-20, score-0.222]

11 One natural way to overcome this limitation is to integrate dimensionality reduction and clustering in a joint framework. [sent-21, score-0.201]

12 Several recent work [5, 10, 16] incorporate supervised dimensionality reduction such as Linear Discriminant Analysis (LDA) [7] into the clustering framework, which performs clustering and LDA dimensionality reduction simultaneously. [sent-22, score-0.402]

13 The algorithm, called Discriminative Clustering (DisCluster) in the following discussion, works in an iterative fashion, alternating between LDA subspace selection and clustering. [sent-23, score-0.261]

14 In this framework, clustering generates the class labels for LDA, while LDA provides the subspace for clustering. [sent-24, score-0.287]

15 Empirical results have shown the benefits of clustering in a low dimensional discriminative space rather than in the principal component space (generative). [sent-25, score-0.249]

16 However, the integration between subspace selection and clustering in DisCluster is not well understood, due to the intertwined and iterative nature of the algorithm. [sent-26, score-0.424]

17 In this paper, we analyze this discriminative clustering framework by studying several fundamental and important issues: (1) What do we really gain by performing clustering in a low dimensional discriminative space? [sent-27, score-0.458]

18 (2) What is the nature of its iterative process alternating between subspace 1 selection and clustering? [sent-28, score-0.274]

19 The main contributions of this paper are summarized as follows: (1) We show that the LDA projection can be factored out from the integrated LDA subspace selection and clustering formulation. [sent-31, score-0.356]

20 DisKmeans is shown to be equivalent to kernel K-means, where discriminative subspace selection essentially constructs a kernel Gram matrix for clustering. [sent-33, score-0.514]

21 This provides new insights into the nature of this subspace selection procedure; (3) The DisKmeans algorithm is dependent on the value of the regularization parameter λ. [sent-34, score-0.225]

22 We propose an automatic parameter tuning process (model selection) for the estimation of λ; (4) We propose the nonlinear extension of DisKmeans using the kernels. [sent-35, score-0.091]

23 We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation, resulting in a semidefinite programming (SDP) [15]. [sent-36, score-0.424]

24 Denote X = [x1 , · · · , xn ] as the data matrix whose i-th column is given by xi . [sent-40, score-0.075]

25 Let F ∈ Rn×k i=1 j=1 be the cluster indicator matrix defined as follows: F = {fi,j }n×k , where fi,j = 1, iff xi ∈ Cj . [sent-42, score-0.17]

26 (1) We can define the weighted cluster indicator matrix as follows [4]: 1 L = [L1 , L2 , · · · , Lk ] = F (F T F )− 2 . [sent-43, score-0.185]

27 , 0)T /nj , (3) where nj is the sample size of the j-th cluster Cj . [sent-53, score-0.074]

28 The within-cluster scatter, between-cluster scatter, and total scatter matrices are defined as follows [7]: k k (xi − µj )(xi − µj )T , Sw = nj µj µT = XLLT X T , j Sb = j=1 xi ∈Cj St = XX T . [sent-55, score-0.114]

29 Since St = Sw + Sb , the optimal transformation matrix P is also given by maximizing the following objective function: trace (P T St P )−1 P T Sb P . [sent-59, score-0.274]

30 (5) For high-dimensional data, the estimation of the total scatter (covariance) matrix is often not reliable. [sent-60, score-0.079]

31 The regularization technique [6] is commonly applied to improve the estimation as follows: ˜ St = St + λIm = XX T + λIm , (6) where Im is the identity matrix of size m and λ > 0 is a regularization parameter. [sent-61, score-0.102]

32 In Discriminant Clustering (DisCluster) [5, 10, 16], the transformation matrix P and the weighted cluster indicator matrix L are computed by maximizing the following objective function: f (L, P ) ≡ = ˜ trace (P T St P )−1 P T Sb P trace (P T (XX T + λIm )P )−1 P T XLLT X T P . [sent-62, score-0.61]

33 2 (7) The algorithm works in an intertwined and iterative fashion, alternating between the computation of L for a given P and the computation of P for a given L. [sent-63, score-0.1]

34 Since trace(AB) = trace(BA) for any two matrices [8], for a given P , the objective function f (L, P ) can be expressed as: f (L, P ) = trace LT X T P (P T (XX T + λIm )P )−1 P T XL . [sent-65, score-0.221]

35 (8) Note that L is not an arbitrary matrix, but a weighted cluster indicator matrix, as defined in Eq. [sent-66, score-0.116]

36 The optimal L can be computed by applying the gradient descent strategy [10] or by solving a kernel K-means problem [5, 16] with X T P (P T (XX T + λIm )P )−1 P T X as the kernel Gram matrix [4]. [sent-68, score-0.242]

37 Experiments [5, 10, 16] have shown the effectiveness of DisCluster in comparison with several other popular clustering algorithms. [sent-70, score-0.168]

38 However, the inherent relationship between subspace selection via LDA and clustering is not well understood, and there is need for further investigation. [sent-71, score-0.356]

39 We show in the next section that the iterative subspace selection and clustering in DisCluster is equivalent to kernel K-means with a specific kernel Gram matrix. [sent-72, score-0.582]

40 Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering. [sent-73, score-0.216]

41 (7): f (L, P ) = trace (P T (XX T + λIm )P )−1 P T XLLT X T P . [sent-76, score-0.172]

42 (9) Here, P is a transformation matrix and L is a weighted cluster indicator matrix as in Eq. [sent-77, score-0.236]

43 It follows from the Representer Theorem [14] that the optimal transformation matrix P ∈ IRm×d can be expressed as P = XH, for some matrix H ∈ IRn×d . [sent-79, score-0.141]

44 It follows that f (L, P ) = trace H T (GG + λG) H −1 H T GLLT GH . [sent-81, score-0.193]

45 (10) We show that the matrix H can be factored out from the objective function in Eq. [sent-82, score-0.094]

46 Let G be the Gram matrix defined as above and λ > 0 be the regularization parameter. [sent-86, score-0.075]

47 Let L∗ and P ∗ be the optimal solution to the maximization of the objective function f (L, P ) in Eq. [sent-87, score-0.067]

48 Then L∗ solves the following maximization problem: 1 (11) L∗ = arg max trace LT In − (In + G)−1 L . [sent-89, score-0.209]

49 Define the matrix Z as 1 Z = U diag (Σ2 + λΣt )− 2 M, In−t , where diag(A, B) is a block diagonal matrix. [sent-95, score-0.096]

50 It follows that t Z T GLLT G Z = ˜ Σ 0 0 0 , Z T (GG + λG) Z = It 0 0 0 , (13) ˜ where Σ = (ΣR )2 is diagonal with non-increasing diagonal entries. [sent-96, score-0.065]

51 It can be verified that + ˜ f (L, P ) ≤ trace Σ = trace (GG + λG) GLLT G = + trace LT G (GG + λG) GL 1 −1 G) L , λ where the equality holds when P = XH and H consists of the first q columns of Z. [sent-97, score-0.516]

52 1 Computing the Weighted Cluster Matrix L The weighted cluster indicator matrix L solving the maximization problem in Eq. [sent-99, score-0.215]

53 (11) can be computed by solving a kernel K-means problem [5] with the kernel Gram matrix given by 1 ˜ G = In − In + G λ −1 . [sent-100, score-0.242]

54 (15) Thus, DisCluster is equivalent to a kernel K-means problem. [sent-101, score-0.105]

55 2 Constructing the Kernel Gram Matrix via Subspace Selection The kernel Gram matrix in Eq. [sent-104, score-0.138]

56 (16) Recall that the original DisCluster algorithm involves alternating LDA subspace selection and clustering. [sent-106, score-0.214]

57 The analysis above shows that the LDA subspace selection in DisCluster essentially constructs a kernel Gram matrix for clustering. [sent-107, score-0.335]

58 This elucidates the nature of the subspace selection procedure in DisCluster. [sent-109, score-0.198]

59 The clustering algorithm is dramatically simplified by removing the iterative subspace selection. [sent-110, score-0.346]

60 The optimal L is thus given by solving the following maximization problem: arg max trace LT GL . [sent-117, score-0.223]

61 L The solution is given by the standard K-means clustering [4, 5]. [sent-118, score-0.155]

62 Thus, the algorithm is equivalent to clustering in the (full) principal component space. [sent-123, score-0.19]

63 In this section, we show how to incorporate the automatic tuning of λ into the optimization framework, thus addressing issue (4) in Section 1. [sent-125, score-0.079]

64 (11) is equivalent to the minimization of the following function: trace LT In + 1 G λ −1 L . [sent-127, score-0.187]

65 This leads to the following optimization problem: min g(L, λ) ≡ trace LT L,λ In + 1 G λ −1 L + log det In + 1 G . [sent-131, score-0.21]

66 Let G be the Gram matrix defined above and let L be a given weighted cluster T indicator matrix. [sent-136, score-0.164]

67 (12), and ai be the i-th diagonal entry of the matrix U1 LLT U1 . [sent-138, score-0.082]

68 It follows that log det In + trace LT In + 1 G λ 1 G λ = log det It + 1 Σ1 λ −1 L trace LT U1 It + = t = log (1 + σi /λ) . [sent-143, score-0.417]

69 (20) i=1 −1 1 Σt λ T U1 L T + trace LT U2 U2 L t T (1 + σi /λ)−1 ai + trace LT U2 U2 L , = (21) i=1 T The result follows as the second term in Eq. [sent-144, score-0.377]

70 5 Kernel DisKmeans: Nonlinear Discriminative K-means using the kernels The DisKmeans algorithm can be easily extended to deal with nonlinear data using the kernel trick. [sent-151, score-0.114]

71 The nonlinear mapping can be implicitly specified by a symmetric kernel function K, which computes the inner product of the images of each data pair in the feature space. [sent-153, score-0.114]

72 For a given training data set {xi }n , the kernel Gram i=1 matrix GK is defined as follows: GK (i, j) = (φ(xi ), φ(xj )). [sent-154, score-0.138]

73 For a given GK , the weighted cluster matrix L = [L1 , · · · , Lk ] in kernel DisKmeans is given by minimizing the following objective function: k −1 −1 1 1 L = LT In + G K Lj . [sent-155, score-0.247]

74 (22) trace LT In + GK j λ λ j=1 The performance of kernel DisKmeans is dependent on the choice of the kernel Gram matrix. [sent-156, score-0.352]

75 Following [11], we assume that GK is restricted to be a convex combination of a given set of kernel Gram matrices {Gi }i=1 as GK = i=1 θi Gi , where the coefficients {θi }i=1 satisfy θi trace(Gi ) = 1 and θi ≥ 0 ∀i. [sent-157, score-0.109]

76 Let GK be constrained to be a convex combination of a given set of kernel matrices {Gi }i=1 as GK = i=1 θi Gi satisfying the constraints defined above. [sent-160, score-0.109]

77 This leads to an iterative algorithm alternating between the computation of the kernel Gram matrix GK and the computation of the cluster indicator matrix L. [sent-167, score-0.357]

78 The parameter λ can also be incorporated into the SDP formulation by treating the identity matrix In as one of the kernel Gram matrix as in [11]. [sent-168, score-0.208]

79 Note that unlike the kernel learning in [11], the class label information is not available in our formulation. [sent-170, score-0.09]

80 We test these albanding 29 238 2 soybean 35 562 15 gorithms on eight benchmark data sets. [sent-175, score-0.096]

81 They are five segment 19 2309 7 UCI data sets [2]: banding, soybean, segment, satimage, pendigits 16 10992 10 pendigits; one biological data set: leukemia (http://www. [sent-176, score-0.183]

82 To make the results of different algorithms comparable, we first run K-means and the clustering result of K-means is used to construct the set of k initial centroids, for all experiments. [sent-190, score-0.155]

83 626 −4 10 satimage ACC K−means DisCluster DisKmeans 0. [sent-251, score-0.086]

84 This justifies the development of an automatic parameter tuning process in Section 4. [sent-272, score-0.067]

85 We can also observe from the figure that when λ → ∞, the performance of DisKmeans approaches to that of K-means on all eight benchmark data sets. [sent-273, score-0.065]

86 Figure 2 also shows that the tuning process is dependent on the initial value of λ due to its nonconvex optimization, and when λ → ∞, the effect of the tuning process become less pronounced. [sent-323, score-0.094]

87 5 5 Figure 3: Comparison of the trace value achieved by DisKmean and DisCluster. [sent-359, score-0.172]

88 The trace value of DisCluster is bounded from above by that of DisKmean. [sent-361, score-0.172]

89 DisKmean versus DisCluster: Figure 3 compares the trace value achieved by DisKmean and the trace value achieved in each iteration of DisCluster on 4 data sets for a fixed λ. [sent-362, score-0.344]

90 It is clear that the trace value of DisCluster increases in each iteration but is bounded from above by that of DisKmean. [sent-363, score-0.172]

91 This is consistent with our analysis in Section 3 that both algorithms optimize the same objective function, and DisKmean is a direct approach for the trace maximization without the iterative process. [sent-365, score-0.286]

92 7 Conclusion In this paper, we analyze the discriminative clustering (DisCluster) framework, which integrates subspace selection and clustering. [sent-371, score-0.414]

93 We show that the iterative subspace selection and clustering in DisCluster is equivalent to kernel K-means with a specific kernel Gram matrix. [sent-372, score-0.582]

94 We then propose the DisKmeans algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter tuning procedure. [sent-373, score-0.283]

95 The connection between DisKmeans and several other clustering algorithms is also studied. [sent-374, score-0.168]

96 Our preliminary studies have shown the effectiveness of Kernel DisKmeansλ in learning the kernel Gram matrix. [sent-377, score-0.103]

97 We plan to examine various semi-learning techniques within the proposed framework and their effectiveness for clustering from both labeled and unlabeled data. [sent-382, score-0.168]

98 DisKmeans DisCluster DisKmeansλ Data Sets LLE LEI max max 10−2 10−1 100 101 ave ave ACC banding 0. [sent-388, score-0.185]

99 A unified view of kernel k-means, spectral clustering and graph partitioning. [sent-591, score-0.245]

100 Adaptive dimension reduction using discriminant analysis and k-means clustering. [sent-596, score-0.071]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('diskmeans', 0.734), ('discluster', 0.416), ('acc', 0.181), ('trace', 0.172), ('clustering', 0.155), ('subspace', 0.132), ('gram', 0.125), ('lt', 0.123), ('lda', 0.106), ('kernel', 0.09), ('pendigits', 0.086), ('satimage', 0.086), ('gk', 0.077), ('discriminative', 0.074), ('ave', 0.068), ('diskmean', 0.061), ('leukemia', 0.061), ('cluster', 0.058), ('sb', 0.057), ('gi', 0.056), ('orl', 0.054), ('nmi', 0.053), ('selection', 0.053), ('usps', 0.052), ('banding', 0.049), ('soybean', 0.049), ('sw', 0.049), ('matrix', 0.048), ('iterative', 0.047), ('tuning', 0.047), ('discriminant', 0.046), ('im', 0.045), ('gg', 0.043), ('irn', 0.043), ('cj', 0.042), ('lj', 0.041), ('xx', 0.038), ('maximization', 0.037), ('gllt', 0.037), ('xllt', 0.037), ('indicator', 0.037), ('segment', 0.036), ('sdp', 0.034), ('st', 0.033), ('irm', 0.032), ('scatter', 0.031), ('simultaneous', 0.031), ('objective', 0.03), ('alternating', 0.029), ('benchmark', 0.027), ('lle', 0.027), ('regularization', 0.027), ('xi', 0.027), ('diag', 0.026), ('det', 0.026), ('semide', 0.025), ('reduction', 0.025), ('arizona', 0.024), ('intertwined', 0.024), ('lei', 0.024), ('tempe', 0.024), ('xh', 0.024), ('ye', 0.024), ('svd', 0.024), ('nonlinear', 0.024), ('transformation', 0.024), ('incorporated', 0.022), ('diagonal', 0.022), ('dimensionality', 0.021), ('weighted', 0.021), ('follows', 0.021), ('principal', 0.02), ('eight', 0.02), ('automatic', 0.02), ('matrices', 0.019), ('tj', 0.019), ('stands', 0.019), ('observe', 0.018), ('understood', 0.018), ('laplacian', 0.017), ('orthogonal', 0.017), ('az', 0.016), ('factored', 0.016), ('inherent', 0.016), ('mutual', 0.016), ('nj', 0.016), ('equivalent', 0.015), ('lk', 0.015), ('gl', 0.015), ('zhao', 0.015), ('theories', 0.014), ('solving', 0.014), ('nature', 0.013), ('connection', 0.013), ('effectiveness', 0.013), ('embedding', 0.013), ('dramatically', 0.012), ('constructs', 0.012), ('ai', 0.012), ('optimization', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 70 nips-2007-Discriminative K-means for Clustering

Author: Jieping Ye, Zheng Zhao, Mingrui Wu

Abstract: We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets. 1

2 0.11426925 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

3 0.089678876 80 nips-2007-Ensemble Clustering using Semidefinite Programming

Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu

Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1

4 0.07673642 199 nips-2007-The Price of Bandit Information for Online Optimization

Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes

Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1

5 0.069989435 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

6 0.066524006 46 nips-2007-Cluster Stability for Finite Samples

7 0.061886244 61 nips-2007-Convex Clustering with Exemplar-Based Models

8 0.056733787 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

9 0.0550391 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

10 0.053247817 69 nips-2007-Discriminative Batch Mode Active Learning

11 0.049510527 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

12 0.04949604 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

13 0.049291942 118 nips-2007-Learning with Transformation Invariant Kernels

14 0.048977982 58 nips-2007-Consistent Minimization of Clustering Objective Functions

15 0.045740105 183 nips-2007-Spatial Latent Dirichlet Allocation

16 0.043536875 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

17 0.041575786 160 nips-2007-Random Features for Large-Scale Kernel Machines

18 0.040969636 109 nips-2007-Kernels on Attributed Pointsets with Applications

19 0.039064273 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

20 0.037540447 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.118), (1, 0.017), (2, -0.059), (3, 0.007), (4, 0.026), (5, -0.046), (6, 0.031), (7, 0.026), (8, -0.098), (9, 0.086), (10, -0.001), (11, 0.013), (12, 0.03), (13, -0.053), (14, -0.123), (15, 0.12), (16, 0.08), (17, -0.158), (18, 0.026), (19, 0.068), (20, -0.048), (21, -0.05), (22, 0.013), (23, 0.044), (24, 0.033), (25, 0.064), (26, -0.0), (27, -0.056), (28, -0.008), (29, 0.045), (30, 0.065), (31, -0.056), (32, 0.048), (33, -0.03), (34, -0.013), (35, -0.084), (36, 0.146), (37, -0.035), (38, 0.038), (39, 0.069), (40, 0.001), (41, 0.006), (42, -0.051), (43, 0.032), (44, 0.158), (45, -0.014), (46, -0.015), (47, 0.096), (48, -0.096), (49, -0.097)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9328962 70 nips-2007-Discriminative K-means for Clustering

Author: Jieping Ye, Zheng Zhao, Mingrui Wu

Abstract: We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets. 1

2 0.64495236 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

Author: Francis R. Bach, Zaïd Harchaoui

Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.

3 0.6066339 80 nips-2007-Ensemble Clustering using Semidefinite Programming

Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu

Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1

4 0.45734486 61 nips-2007-Convex Clustering with Exemplar-Based Models

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

5 0.42183125 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

6 0.39556566 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

7 0.37820789 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

8 0.37547085 46 nips-2007-Cluster Stability for Finite Samples

9 0.37229058 109 nips-2007-Kernels on Attributed Pointsets with Applications

10 0.3419607 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

11 0.31771785 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

12 0.31629902 24 nips-2007-An Analysis of Inference with the Universum

13 0.31593722 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

14 0.31528237 49 nips-2007-Colored Maximum Variance Unfolding

15 0.2908681 118 nips-2007-Learning with Transformation Invariant Kernels

16 0.2899448 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

17 0.27643484 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

18 0.27550352 58 nips-2007-Consistent Minimization of Clustering Objective Functions

19 0.27466208 199 nips-2007-The Price of Bandit Information for Online Optimization

20 0.27067682 156 nips-2007-Predictive Matrix-Variate t Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.036), (13, 0.045), (16, 0.031), (19, 0.014), (21, 0.037), (31, 0.033), (34, 0.023), (35, 0.039), (46, 0.018), (47, 0.088), (49, 0.019), (73, 0.298), (83, 0.115), (85, 0.017), (87, 0.029), (90, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69275874 70 nips-2007-Discriminative K-means for Clustering

Author: Jieping Ye, Zheng Zhao, Mingrui Wu

Abstract: We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets. 1

2 0.68999904 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

Author: Bill Triggs, Jakob J. Verbeek

Abstract: Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labeling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labeling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labeling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labeled datasets which can be costly and troublesome to produce. We introduce a method for learning CRFs from datasets with many unlabeled nodes by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that effective models can be learned from fragmentary labelings and that incorporating top-down aggregate features significantly improves the segmentations. The resulting segmentations are compared to the state-of-the-art on three different image datasets. 1

3 0.49984595 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

4 0.49573487 115 nips-2007-Learning the 2-D Topology of Images

Author: Nicolas L. Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, Balázs Kégl

Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topologyextraction approaches and show how having the two-dimensional topology can be exploited.

5 0.49326852 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

Author: Lars Buesing, Wolfgang Maass

Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1

6 0.49190184 189 nips-2007-Supervised Topic Models

7 0.4916172 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

8 0.49126321 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

9 0.49055031 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.48874566 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.488107 134 nips-2007-Multi-Task Learning via Conic Programming

12 0.48809245 84 nips-2007-Expectation Maximization and Posterior Constraints

13 0.48705757 86 nips-2007-Exponential Family Predictive Representations of State

14 0.48684877 24 nips-2007-An Analysis of Inference with the Universum

15 0.4867931 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

16 0.48638105 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

17 0.48511147 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

18 0.48488882 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

19 0.48409259 16 nips-2007-A learning framework for nearest neighbor search

20 0.48390284 158 nips-2007-Probabilistic Matrix Factorization