jmlr jmlr2006 jmlr2006-52 knowledge-graph by maker-knowledge-mining

52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Source: pdf

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. [sent-8, score-1.111]

2 Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. [sent-9, score-0.634]

3 Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. [sent-10, score-0.676]

4 We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. [sent-11, score-0.54]

5 Introduction Spectral clustering has many applications in machine learning, exploratory data analysis, computer vision and speech processing. [sent-13, score-0.47]

6 Most techniques explicitly or implicitly assume a metric or a similarity structure over the space of conﬁgurations, which is then used by clustering algorithms. [sent-14, score-0.429]

7 , 2003) or spectral clustering (Yu and Shi, 2002; Kamvar et al. [sent-20, score-0.453]

8 In this paper, we consider a complementary approach, providing a general framework for learning the similarity matrix for spectral clustering from examples. [sent-22, score-0.791]

9 We assume that we are given sample data with known partitions and are asked to build similarity matrices that will lead to these partitions when spectral clustering is performed. [sent-23, score-0.963]

10 BACH AND J ORDAN of speech signals via partitioning of the time-frequency plane (Brown and Cooke, 1994), training examples can be created by mixing previously captured signals. [sent-28, score-0.39]

11 Another important motivation for our work is the need to develop spectral clustering methods that are robust to irrelevant features. [sent-29, score-0.453]

12 Our work is based on cost functions J1 (W, E) and J2 (W, E) that characterize how close the eigenstructure of a similarity matrix W is to a partition E. [sent-33, score-0.555]

13 3 provides foundational material on the approximation of the eigensubspace of a symmetric matrix that is needed for Section 4, which presents learning algorithms for spectral clustering. [sent-39, score-0.488]

14 We highlight one other aspect of the problem here—the major computational challenge involved in applying spectral methods to domains such as vision or speech separation. [sent-40, score-0.583]

15 Indeed, in image segmentation, the number of pixels in an image is usually greater than hundreds of thousands, leading to similarity matrices of potential huge sizes, while, for speech separation, four seconds of speech sampled at 5. [sent-41, score-0.89]

16 Thus a major part of our effort to apply spectral clustering techniques to speech separation has involved the design of numerical approximation schemes that exploit the different time scales present in speech signals. [sent-43, score-1.043]

17 Spectral Clustering and Normalized Cuts In this section, we present our spectral clustering framework. [sent-48, score-0.453]

18 (2001), we derive the spectral relaxation through normalized cuts. [sent-50, score-0.402]

19 Alternative frameworks, based on Markov random walks (Meila and Shi, 2002), on different deﬁnitions of the normalized cut (Meila and Xu, 2003), or on constrained optimization (Higham and Kibble, 2004), lead to similar spectral relaxations. [sent-51, score-0.425]

20 1 Similarity Matrices Spectral clustering refers to a class of techniques for clustering that are based on pairwise similarity relations among data points. [sent-53, score-0.599]

21 The goal of clustering is to organize the data set into disjoint subsets with high intra-cluster similarity and low inter-cluster similarity. [sent-55, score-0.429]

22 But the spectral clustering algorithms presented in this paper are not limited to positive semideﬁnite matrices. [sent-61, score-0.453]

23 In addition to being more immune to outliers, the normalized cut criterion and the ensuing spectral relaxations have a simpler theoretical asymptotic behavior when the number of data points tend to inﬁnity (von Luxburg et al. [sent-85, score-0.425]

24 The following proposition gives the solution obtained from the spectral relaxation1 : Proposition 2 The maximum of trY D−1/2W D−1/2Y over matrices Y ∈ RP×R such that Y Y = I is the sum of the R largest eigenvalues of D−1/2W D−1/2 . [sent-114, score-0.475]

25 It is attained at all Y of the form Y =UB1 where U ∈ RP×R is any orthonormal basis of the R-th principal subspace of D −1/2W D−1/2 and B1 is an arbitrary orthogonal matrix in RR×R . [sent-115, score-0.411]

26 1966 L EARNING S PECTRAL C LUSTERING , W ITH A PPLICATION T O S PEECH S EPARATION where the maximum is attained for all matrices Y of the form Y = UB 1 , where U ∈ RP×R is any orthonormal basis of the R-th principal subspace of W and B1 is an arbitrary orthogonal matrix in RR×R . [sent-120, score-0.53]

27 1 C OMPARISON OF S UBSPACES Solutions of the relaxed problem are deﬁned up to an orthogonal matrix, that is, Yeig = UB1 , where U ∈ RP×R is any orthonormal basis of the R-th principal subspace of M and B 1 is an arbitrary orthogonal matrix. [sent-131, score-0.43]

28 er Der r J1 (W, E) = Note that if the similarity matrix W has rank equal to R, then our cost function J1 (W, E) is exactly equal to the normalized cut C(W, E). [sent-145, score-0.667]

29 (2002) have shown that when the similarity matrix is “close” to this idealized situation, the properly normalized rows tightly cluster around an orthonormal basis. [sent-158, score-0.543]

30 Minimizing with respect to the matrix W for a given partition E leads to algorithms for learning the similarity matrix, as we show in Section 3 and Section 4. [sent-166, score-0.413]

31 The ﬁrst cost function has closest ties to the normalized cut problem since it is equal to the normalized cut for similarity matrices of rank R, while the second cost function leads to better theoretical learning bounds as shown in Section 3. [sent-168, score-0.915]

32 While K-means is often used heuristically as a postprocessor for spectral clustering (Ng et al. [sent-173, score-0.453]

33 (2002), shows that the cost function can be interpreted as a weighted distortion measure: Theorem 3 Let W be a similarity matrix and let U =(u1 , . [sent-177, score-0.494]

34 For the second cost function, we have a similar theorem, which leads naturally to the K-means algorithm presented in Figure 2: Theorem 4 Let W be a similarity matrix and let U be an orthonormal basis of the R-th principal subspace of D−1/2W D−1/2 , and V = D1/2U(U DU)−1/2 . [sent-189, score-0.678]

35 The rounding procedures that we propose in this paper are similar to those in other spectral clustering algorithms (Ng et al. [sent-194, score-0.503]

36 The main advantage of our procedure—which differs from the others in being derived from a cost function—is that it naturally leads to an algorithm for learning the similarity matrix from data, presented in Section 3. [sent-197, score-0.444]

37 A similarity matrix W is said perfect with respect to a partition E with R clusters if the cost functions J1 (W, E) and J2 (W, E) are exactly equal to zero. [sent-232, score-0.587]

38 7 Variational Formulation for the Normalized Cut In this section, we show that there is a variational formulation of the normalized cut similar to Theorem 3 for positive semideﬁnite similarity matrices, that is, for matrices that can be factorized as W = GG where G ∈ RP×M , where M P. [sent-240, score-0.52]

39 Cost Functions for Learning the Similarity Matrix Given a similarity matrix W , the steps of a spectral clustering algorithms are (1) normalization, (2) computation of eigenvalues, and (3) partitioning of the eigenvectors using (weighted) K-means to obtain a partition E. [sent-251, score-0.96]

40 In this section, we assume that the partition E is given, and we develop a theoretical framework and a set of algorithms for learning a similarity matrix W . [sent-252, score-0.413]

41 It is important to note that if if we put no constraints on W , then there is a trivial solution, namely any perfect similarity matrix with respect to the partition E, in particular, any matrix that is block-constant with the appropriate blocks. [sent-253, score-0.492]

42 Given a distance between partitions, a naive algorithm would simply minimize the distance between the true partition E and the output of the spectral clustering algorithm. [sent-257, score-0.528]

43 2 Cost Functions as Upper Bounds We let E1 (W ) denote the clustering obtained by minimizing the cost function J1 (W, E) with respect to E, and let E2 (W ) denote the clustering obtained by minimizing the cost function J2 (W, E). [sent-276, score-0.552]

44 The following theorem shows that our cost functions are upper bounds on the distance between a partition and the output of the spectral clustering algorithm: Theorem 6 Let η(W ) = max p D pp / min p D pp 1. [sent-277, score-0.708]

45 This bound is tight at zero, consequently, if we are able to produce a similarity matrix W with small J1 (W, E) or J2 (W, E) cost, then the matrix will provably lead to partition that is close to E. [sent-289, score-0.492]

46 The set M P,R is open (Magnus and Neudecker, 1999), and for any matrix in MP,R , the R-th principal subspace ER (M) is uniquely deﬁned and the orthogonal projection ΠR (M) on that subspace is an unique identiﬁer of that subspace. [sent-308, score-0.405]

47 Note that the R-th eigensubspace is well deﬁned even if some eigenvalues larger than the R-th eigenvalue coalesce (in which case, the R eigenvectors are not well deﬁned but the R-th principal eigensubspace is). [sent-310, score-0.465]

48 Finally, in our setting of learning the similarity matrix, we can speed up the eigenvalue compu¨ tation by initializing the power or Lanczos method with the eigensubspace of previous iterations. [sent-323, score-0.396]

49 3 A PPROXIMATION OF E IGENSUBSPACE AND I TS D IFFERENTIAL When learning the similarity matrix, the cost function and its derivatives are computed many times and it is thus worthwhile to use an efﬁcient approximation of the eigensubspace as well as its differential. [sent-344, score-0.45]

50 5 N EGATIVE E IGENVALUES The spectral relaxation in Proposition 2 involves the largest eigenvalues of the matrix W = D−1/2W D−1/2 . [sent-372, score-0.487]

51 We also present results for two existing approaches, one based on a Markov chain interpretation of spectral clustering (Meila and Shi, 2002) and one based on the alignment (Cristianini et al. [sent-385, score-0.453]

52 The problem with the latter cost functions is that these functions essentially measure the distance between the similarity matrix W (or a normalized version of W ) and a matrix T which (after permutation) is block-diagonal with constant blocks. [sent-404, score-0.59]

53 Spectral clustering does work with matrices which are close to block-constant; however, one of the strengths of spectral clustering is its ability to work effectively with similarity matrices which are not block-constant, and which may exhibit strong variations among each block. [sent-405, score-1.12]

54 Thus taking powers can be interpreted as “thickening” the graph to make the clusters more apparent, while not changing the eigenstructure of the matrix (taking powers of symmetric matrices only changes the eigenvalues, not the eigenvectors). [sent-411, score-0.415]

55 Note that other clustering approaches based on taking powers of similarity matrices have been studied by Tishby and Slonim (2001) and Szummer and Jaakkola (2002); these differ from our approach in which we only take powers to approximate the cost function used for learning the similarity matrix. [sent-412, score-0.985]

56 In particular, we consider diagonally-scaled Gaussian kernel matrices (for which the parameters are the scales of each dimension), as well as more complex parameterized matrices for the segmentation of line drawings in Section 4. [sent-418, score-0.509]

57 (c) Gold standard clustering error (black solid), spectral cost function J1 (red dotted) and J2 (blue dashed). [sent-437, score-0.559]

58 1 Learning Algorithm We assume that we are given several related data sets with known partitions and our objective is to learn parameters of similarity matrices adapted to the overall problem. [sent-441, score-0.444]

59 In some cases, we may end the optimization with a few steps of steepest descent using the cost function with the true eigenvectors, that is, for q = ∞; this is particularly appropriate when the eigengaps of the optimal similarity matrices happen to be small. [sent-461, score-0.484]

60 2 Related Work Several other frameworks aim at learning the similarity matrices for spectral clustering or related procedures. [sent-463, score-0.831]

61 (2005) which optimizes directly the eigenvectors of the similarity matrix, rather than the eigensubpaces, and is applied to image segmentation tasks. [sent-465, score-0.505]

62 4 that the cost function, although convex, may lead to similarity matrices that do not perform well. [sent-469, score-0.484]

63 In order to cluster previously unseen data sets, we compute the similarity matrix W and use the algorithm of Figure 1 or Figure 2. [sent-477, score-0.394]

64 That is, for small λ we set the parameter value to α + βλ and perform spectral clustering, selecting λ such that the (weighted) distortion obtained after application of the spectral clustering algorithm of Figure 1 or Figure 2 is minimal. [sent-480, score-0.786]

65 If W is an element-wise product of similarity matrices, only a subset of which have a nice structure, we can still use these techniques, albeit with the possibility of requiring more than Q elements of the similarity matrix to be computed. [sent-509, score-0.597]

66 Without feature selection, the performance of spectral clustering degrades very rapidly when the number of irrelevant features increases, while our learning approach is very robust, even with only one training data set. [sent-569, score-0.453]

67 2 discusses the construction of the parameterized similarity matrices and Section 4. [sent-577, score-0.412]

68 Figure 8: Testing examples: (top) obtained from parameterized similarity matrices learned using the top row of Figure 7 for training, (bottom) obtained from parameterized similarity matrices learned using the bottom row of Figure 7 for training. [sent-640, score-0.824]

69 3 S IMULATIONS In a ﬁrst experiment, we learned the parameters of the similarity matrices on 50 small to medium images with two clusters and tested on various images with varying size and varying number of clusters. [sent-644, score-0.483]

70 See Figure 6 for examples in which segmentation was successful and examples in which the similarity matrices failed to lead to the proper segmentation. [sent-646, score-0.53]

71 We then tested the two estimated parameterized similarity matrices on ambiguous line drawings and we show some results in Figure 8. [sent-650, score-0.529]

72 Our approach involves a “discriminative” approach to the problem of speech separation that is based on the spectral learning methodology presented in Section 4. [sent-666, score-0.617]

73 We thus formulate speech separation as a problem in segmentation in the time-frequency plane. [sent-671, score-0.486]

74 Inversion Our speech separation framework is based on the segmentation of the spectrogram of a signal f [t] in R 2 disjoint subsets Ai , i = 1, . [sent-688, score-0.727]

75 We now need to ﬁnd R speech signals f i [t] such that each Ui is the spectrogram of f i . [sent-693, score-0.588]

76 For a speech sample of length 4 seconds, we have T = 22, 000 samples and then N = 407, which yields ≈ 2 × 10 5 spectrogram samples. [sent-706, score-0.454]

77 In this paper, we chose to rescale all speech signals as follows: for each time window n, we compute the total energy e n = ∑m |U fmn |2 , and its 20-point moving average. [sent-712, score-0.423]

78 In order to reduce the number of spectrogram samples to consider, for a given pre-normalized speech signal, we threshold coefﬁcients whose magnitudes are less than a value that was chosen so that the resulting distortion is inaudible. [sent-714, score-0.504]

79 That is, given two volume-normalized speech signals, f 1 and f2 , of the same duration, with spectrograms U1 and U2 , we build a training sample as U train = U1 + U2 , with a segmentation given by z = arg min{U1 ,U2 }. [sent-718, score-0.408]

80 2 H ARMONIC C UES This is the major cue for voiced speech (Gold and Morgan, 1999; Brown and Cooke, 1994; Bregman, 1990), and it acts at all time scales (small, medium and large): voiced speech is locally periodic and the local period is usually referred to as the pitch. [sent-745, score-0.549]

81 • Timbre The pitch extraction algorithm presented in Appendix C also outputs the spectral envelope of the signal (Gold and Morgan, 1999). [sent-752, score-0.581]

82 Timbre can be loosely deﬁned as the set of properties of a voiced speech signal once the pitch has been factored out (Bregman, 1990). [sent-754, score-0.554]

83 Those features all come with some form of energy level and all features involving pitch values ω should take this energy into account when the similarity matrix is built in Section 6. [sent-761, score-0.593]

84 Spectral Clustering for Speech Separation Given the features described in the previous section, we now show how to build similarity matrices that can be used to deﬁne a spectral segmenter. [sent-764, score-0.661]

85 Intuitively, if we consider the values of similarity as soft boolean variables, taking the product of two similarity matrices is equivalent to considering the conjunction of two matrices, while taking the sum can be seen as their disjunction. [sent-776, score-0.637]

86 We also performed two baseline experiments: (1) In order to show that the combination of features is indeed crucial for performance, we performed K-means clustering on the estimated pitch to separate the two signals (“Pitch”). [sent-828, score-0.559]

87 1993 Frequency BACH AND J ORDAN Frequency Time Time Figure 11: (Top) Optimal segmentation for the spectrogram of French speakers in Figure 9 (right), where the two speakers are “black” and “grey”; this segmentation is obtained from the known separated signals. [sent-834, score-0.732]

88 Finally, as mentioned earlier, there was a major computational challenge in applying spectral methods to single microphone speech separation. [sent-839, score-0.539]

89 Conclusions In this paper, we have presented two sets of algorithms—one for spectral clustering and one for learning the similarity matrix. [sent-843, score-0.712]

90 This cost function depends directly on the eigenstructure of the similarity matrix. [sent-845, score-0.401]

91 We have shown that it can be approximated efﬁciently using the power method, yielding a method for learning similarity matrices that can cluster effectively in cases in which non-adaptive approaches fail. [sent-846, score-0.434]

92 Note in particular that our new approach yields a spectral clustering method that is signiﬁcantly more robust to irrelevant features than current methods. [sent-847, score-0.453]

93 The former provide parameterized similarity matrices for spectral clustering, and the latter make use of our ability to generate segmented training data. [sent-850, score-0.745]

94 We have successfully demixed speech signals from two speakers using this approach. [sent-852, score-0.505]

95 We let U0 denote an orthogonal matrix composed of the ﬁrst R eigenvectors of M 0 (well deﬁned up to sign because all eigenvalues are simple), and S0 = U0 S0U0 the diagonal matrix of the ﬁrst R eigenvalues. [sent-879, score-0.461]

96 Pitch Extraction As seen in Section 5, harmonic features such as pitch are essential for successful speech separation. [sent-908, score-0.549]

97 Since the speech signals are real, the spectrogram is symmetric and we consider only M/2 samples. [sent-913, score-0.629]

98 A uniﬁed view of kernel k-means, spectral clustering and and graph cuts. [sent-1034, score-0.453]

99 On the equivalence of nonnegative matrix factorization and spectral clustering. [sent-1041, score-0.421]

100 On semideﬁnite relaxation for normalized k-cut and connections to spectral clustering. [sent-1252, score-0.402]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spectral', 0.283), ('similarity', 0.259), ('speech', 0.256), ('pitch', 0.255), ('spectrogram', 0.198), ('pectral', 0.179), ('peech', 0.179), ('clustering', 0.17), ('eparation', 0.152), ('ordan', 0.152), ('pplication', 0.152), ('segmentation', 0.152), ('lustering', 0.136), ('shi', 0.136), ('signals', 0.134), ('bach', 0.129), ('matrices', 0.119), ('ar', 0.118), ('speakers', 0.115), ('cost', 0.106), ('rp', 0.103), ('orthogonal', 0.098), ('eigenvectors', 0.094), ('meila', 0.088), ('drawings', 0.085), ('eigengap', 0.085), ('eigensubspace', 0.085), ('uu', 0.085), ('orthonormal', 0.082), ('matrix', 0.079), ('separation', 0.078), ('principal', 0.076), ('subspace', 0.076), ('partition', 0.075), ('cut', 0.075), ('eigenvalues', 0.073), ('clusters', 0.068), ('du', 0.067), ('normalized', 0.067), ('partitions', 0.066), ('nonnegative', 0.059), ('ith', 0.058), ('pitches', 0.057), ('cluster', 0.056), ('cues', 0.056), ('blind', 0.054), ('earning', 0.053), ('malik', 0.053), ('eigenvalue', 0.052), ('relaxation', 0.052), ('vq', 0.052), ('rounding', 0.05), ('segmented', 0.05), ('distortion', 0.05), ('semide', 0.049), ('ds', 0.049), ('ding', 0.047), ('eigensubspaces', 0.047), ('inked', 0.047), ('timbre', 0.047), ('gold', 0.047), ('eigenvector', 0.046), ('rr', 0.046), ('vision', 0.044), ('signal', 0.043), ('tr', 0.043), ('subspaces', 0.043), ('frequency', 0.042), ('rank', 0.041), ('symmetric', 0.041), ('rf', 0.04), ('ng', 0.04), ('er', 0.04), ('circles', 0.039), ('iterations', 0.038), ('loan', 0.038), ('diagonal', 0.038), ('auditory', 0.038), ('cooke', 0.038), ('fowlkes', 0.038), ('jourjine', 0.038), ('minr', 0.038), ('osculating', 0.038), ('zha', 0.038), ('harmonic', 0.038), ('golub', 0.037), ('medium', 0.037), ('pp', 0.037), ('eigenstructure', 0.036), ('qr', 0.036), ('powers', 0.036), ('parameterized', 0.034), ('mallat', 0.033), ('window', 0.033), ('bandwidth', 0.033), ('ui', 0.032), ('demixing', 0.032), ('francis', 0.032), ('shental', 0.032), ('ambiguous', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

2 0.17685176 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

Author: Rasmus Kongsgaard Olsson, Lars Kai Hansen

Abstract: We apply a type of generative modelling to the problem of blind source separation in which prior knowledge about the latent source signals, such as time-varying auto-correlation and quasiperiodicity, are incorporated into a linear state-space model. In simulations, we show that in terms of signal-to-error ratio, the sources are inferred more accurately as a result of the inclusion of strong prior knowledge. We explore different schemes of maximum-likelihood optimization for the purpose of learning the model parameters. The Expectation Maximization algorithm, which is often considered the standard optimization method in this context, results in slow convergence when the noise variance is small. In such scenarios, quasi-Newton optimization yields substantial improvements in a range of signal to noise ratios. We analyze the performance of the methods on convolutive mixtures of speech signals. Keywords: blind source separation, state-space model, independent component analysis, convolutive model, EM, speech modelling

3 0.17113365 33 jmlr-2006-Fast SDP Relaxations of Graph Cut Clustering, Transduction, and Other Combinatorial Problems (Special Topic on Machine Learning and Optimization)

Author: Tijl De Bie, Nello Cristianini

Abstract: The rise of convex programming has changed the face of many research ﬁelds in recent years, machine learning being one of the ones that beneﬁtted the most. A very recent developement, the relaxation of combinatorial problems to semi-deﬁnite programs (SDP), has gained considerable attention over the last decade (Helmberg, 2000; De Bie and Cristianini, 2004a). Although SDP problems can be solved in polynomial time, for many relaxations the exponent in the polynomial complexity bounds is too high for scaling to large problem sizes. This has hampered their uptake as a powerful new tool in machine learning. In this paper, we present a new and fast SDP relaxation of the normalized graph cut problem, and investigate its usefulness in unsupervised and semi-supervised learning. In particular, this provides a convex algorithm for transduction, as well as approaches to clustering. We further propose a whole cascade of fast relaxations that all hold the middle between older spectral relaxations and the new SDP relaxation, allowing one to trade oﬀ computational cost versus relaxation accuracy. Finally, we discuss how the methodology developed in this paper can be applied to other combinatorial problems in machine learning, and we treat the max-cut problem as an example. Keywords: convex transduction, normalized graph cut, semi-deﬁnite programming, semisupervised learning, relaxation, combinatorial optimization, max-cut

4 0.089660481 78 jmlr-2006-Rearrangement Clustering: Pitfalls, Remedies, and Applications

Author: Sharlee Climer, Weixiong Zhang

Abstract: Given a matrix of values in which the rows correspond to objects and the columns correspond to features of the objects, rearrangement clustering is the problem of rearranging the rows of the matrix such that the sum of the similarities between adjacent rows is maximized. Referred to by various names and reinvented several times, this clustering technique has been extensively used in many ﬁelds over the last three decades. In this paper, we point out two critical pitfalls that have been previously overlooked. The ﬁrst pitfall is deleterious when rearrangement clustering is applied to objects that form natural clusters. The second concerns a similarity metric that is commonly used. We present an algorithm that overcomes these pitfalls. This algorithm is based on a variation of the Traveling Salesman Problem. It offers an extra beneﬁt as it automatically determines cluster boundaries. Using this algorithm, we optimally solve four benchmark problems and a 2,467-gene expression data clustering problem. As expected, our new algorithm identiﬁes better clusters than those found by previous approaches in all ﬁve cases. Overall, our results demonstrate the beneﬁts of rectifying the pitfalls and exemplify the usefulness of this clustering technique. Our code is available at our websites. Keywords: clustering, visualization of patterns in data, bond energy algorithm, traveling salesman problem, asymmetric clustering

5 0.087918155 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

6 0.065591022 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

7 0.065167658 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

8 0.063625999 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.063092999 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

10 0.057918385 25 jmlr-2006-Distance Patterns in Structural Similarity

11 0.057430509 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

12 0.05302529 45 jmlr-2006-Learning Coordinate Covariances via Gradients

13 0.046482321 49 jmlr-2006-Learning Parts-Based Representations of Data

14 0.04473415 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

15 0.044617895 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

16 0.043627672 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

17 0.04326316 93 jmlr-2006-Universal Kernels

18 0.042784642 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

19 0.042498682 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

20 0.042448722 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.253), (1, -0.125), (2, -0.056), (3, 0.056), (4, -0.127), (5, -0.151), (6, -0.092), (7, 0.146), (8, -0.205), (9, 0.28), (10, -0.041), (11, -0.214), (12, -0.222), (13, 0.057), (14, 0.121), (15, 0.05), (16, 0.007), (17, 0.304), (18, -0.086), (19, 0.104), (20, -0.121), (21, 0.062), (22, 0.011), (23, -0.029), (24, -0.008), (25, 0.074), (26, 0.048), (27, -0.093), (28, -0.025), (29, 0.137), (30, 0.041), (31, -0.056), (32, 0.064), (33, 0.003), (34, -0.025), (35, -0.09), (36, 0.027), (37, 0.086), (38, 0.023), (39, -0.025), (40, 0.013), (41, -0.044), (42, 0.004), (43, 0.009), (44, -0.003), (45, -0.11), (46, -0.11), (47, -0.04), (48, -0.0), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96122849 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

2 0.67823368 33 jmlr-2006-Fast SDP Relaxations of Graph Cut Clustering, Transduction, and Other Combinatorial Problems (Special Topic on Machine Learning and Optimization)

Author: Tijl De Bie, Nello Cristianini

3 0.58230817 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

Author: Rasmus Kongsgaard Olsson, Lars Kai Hansen

4 0.43264201 78 jmlr-2006-Rearrangement Clustering: Pitfalls, Remedies, and Applications

Author: Sharlee Climer, Weixiong Zhang

5 0.29662523 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

6 0.29083058 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

7 0.26656732 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

8 0.22248451 25 jmlr-2006-Distance Patterns in Structural Similarity

9 0.2176346 49 jmlr-2006-Learning Parts-Based Representations of Data

10 0.21716341 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

11 0.20146921 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

12 0.19717775 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

13 0.19251269 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

14 0.18980442 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

15 0.18478385 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

16 0.18402746 93 jmlr-2006-Universal Kernels

17 0.18358417 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

18 0.18328424 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

19 0.17936149 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

20 0.17878363 68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.054), (8, 0.045), (19, 0.257), (35, 0.012), (36, 0.049), (38, 0.017), (45, 0.028), (50, 0.051), (61, 0.025), (63, 0.075), (68, 0.015), (76, 0.02), (78, 0.033), (81, 0.05), (84, 0.02), (90, 0.034), (91, 0.028), (96, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75547981 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

2 0.71067929 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

3 0.50384915 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

Author: Ron Begleiter, Ran El-Yaniv

Abstract: We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efﬁciency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms. Keywords: sequential prediction, the context tree weighting method, variable order Markov models, error bounds

4 0.48528877 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

5 0.46285868 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

Author: Tobias Glasmachers, Christian Igel

Abstract: Support vector machines are trained by solving constrained quadratic optimization problems. This is usually done with an iterative decomposition algorithm operating on a small working set of variables in every iteration. The training time strongly depends on the selection of these variables. We propose the maximum-gain working set selection algorithm for large scale quadratic programming. It is based on the idea to greedily maximize the progress in each single iteration. The algorithm takes second order information from cached kernel matrix entries into account. We prove the convergence to an optimal solution of a variant termed hybrid maximum-gain working set selection. This method is empirically compared to the prominent most violating pair selection and the latest algorithm using second order information. For large training sets our new selection scheme is signiﬁcantly faster. Keywords: working set selection, sequential minimal optimization, quadratic programming, support vector machines, large scale optimization

6 0.45639697 44 jmlr-2006-Large Scale Transductive SVMs

7 0.45260227 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

8 0.4513534 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.45075282 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

10 0.44997674 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

11 0.44671243 53 jmlr-2006-Learning a Hidden Hypergraph

12 0.44447935 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

13 0.44215626 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

14 0.44099599 70 jmlr-2006-Online Passive-Aggressive Algorithms

15 0.43928888 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)

16 0.43899977 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

17 0.43852556 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

18 0.43803638 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

19 0.43705341 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

20 0.43693867 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution