jmlr jmlr2007 jmlr2007-8 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
Reference: text
sentIndex sentText sentNum sentScore
1 IL School of Mathematical Sciences Tel-Aviv University Tel-Aviv 69978, Israel Editor: Charles Elkan Abstract Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. [sent-4, score-0.341]
2 In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. [sent-5, score-0.574]
3 Introduction The clustering problem is to partition a given data set into similar subsets or clusters, so that objects/points in a cluster are more similar to each other than to points in another cluster. [sent-9, score-0.305]
4 The interdisciplinary nature of clustering is evident through its vast literature which includes many clustering problem formulations, and even more algorithms. [sent-12, score-0.526]
5 Basically, the two main approaches to clustering are hierarchical clustering and partitioning clustering. [sent-15, score-0.565]
6 Most well known partitioning clustering methods iteratively update the so-called centroids or cluster centers, and as such, are often referred as center-based clustering methods. [sent-17, score-0.607]
7 These clustering methods are usually classified as: hard and soft, depending on the way data points are assigned to clusters. [sent-18, score-0.309]
8 In that class, one of the most celebrated and widely used hard clustering algorithm is the classical k-means algorithm, which basic ingredients can already be traced back to an earlier work of Steinhaus (1956). [sent-21, score-0.309]
9 Soft clustering is a relaxation of the binary strategy used in hard clustering, and allows for overlapping of the partitions. [sent-24, score-0.309]
10 As a result, the current literature abounds in approximation algorithms for the partitioning clustering problem. [sent-28, score-0.302]
11 This paper is a theoretical study of center-based clustering methods from a continuous optimization theory viewpoint. [sent-29, score-0.309]
12 One of the most basic and well-known formulation of the partitioning clustering problem consists of minimizing the sum of a finite collection of “min”-functions, which is a nonsmooth and nonconvex optimization problem. [sent-30, score-0.62]
13 Building on convex and nonlinear analysis techniques, we present a generic way to design, analyze and extend center-based clustering algorithms by replacing the nonsmooth clustering problem with a smooth optimization problem. [sent-31, score-1.079]
14 A more recent account of optimization approaches to clustering can be found in the excellent and extensive survey paper of Bagirov et al. [sent-46, score-0.309]
15 In hard clustering algorithms, several researchers have considered such an approach to extend the k-means algorithm. [sent-51, score-0.309]
16 More intensive research activities have focused on developing soft clustering algorithms. [sent-55, score-0.319]
17 Many soft clustering algorithms have emerged and have been developed from heuristic considerations, axiomatic approaches, or/and are based on statistical techniques, physical analogies, and information theoretic perspectives. [sent-57, score-0.39]
18 For example, well known soft clustering methods include the Fuzzy k-means (FKM), (see, for example, Bezdek, 1981), the Expectation Maximization algorithm (EM), (see, for example, Duda et al. [sent-58, score-0.319]
19 More recent and other soft clustering methods include for example the work of Zhang et al. [sent-62, score-0.319]
20 (1999) which proposes a clustering algorithm called k-harmonic means, and which relies on optimizing an objective function defined as the harmonic mean of the squared Euclidean distance from each data points to all centers. [sent-63, score-0.333]
21 This provides the basis for significantly extend the scope of center-based partitioning clustering algorithms, to bridge the gap between previous works that were relying on heuristics and experimental methods, and to bring new insights into their potential. [sent-69, score-0.302]
22 In Section 2 we begin with the standard optimization formulation of the partitioning clustering problem that focuses on the nonsmooth nonconvex optimization formulation in a finite dimensional Euclidean space. [sent-72, score-0.704]
23 In Section 3, we furnish the necessary background and known results from convex analysis, with a particular focus on two central mathematical objects: support and asymptotic functions, which will play a primary role in the forthcoming analysis of clustering problems. [sent-76, score-0.487]
24 The connection between support and asymptotic functions has been used in past optimization studies to develop a general approach to smoothing nonsmooth optimization problems (Ben-Tal and Teboulle, 1989; Auslender, 1999; Auslender and Teboulle, 2003). [sent-77, score-0.466]
25 Building on these ideas, in Section 4 we first describe an exact smoothing mechanism, which provides a very simple way to design and analyze a wide class of center-based hard clustering algorithms. [sent-78, score-0.395]
26 In turns, the support function formulation of the clustering problem also provides the starting point for developing a new and general approximate smoothing approach to clustering problems. [sent-79, score-0.65]
27 This enables us to arrive at a unified approach for the formulation and rigorous analysis of soft center-based clustering methods. [sent-83, score-0.357]
28 Finally, in Section 6, we show that all the aforementioned hard and soft center-based clustering methods, which have been proposed in the literature from different motivations and approaches can be realized as special cases of the proposed analysis, and we streamline their relationships. [sent-86, score-0.365]
29 The clustering problem consists of partitioning the data A into k subsets {A1 , . [sent-121, score-0.302]
30 , k, the cluster A l is represented by its center (centroid) xl , and we want to determine k cluster centers {x1 , . [sent-129, score-0.393]
31 xk } such that the sum of proximity measures from each point ai to a nearest cluster center xl is minimized. [sent-132, score-0.545]
32 Then, the distance from each ai ∈ A to the closest cluster center is: D(x, ai ) = min d(xl , ai ). [sent-136, score-0.488]
33 Thus, assigning a positive weight νi to each D(x, ai ), such that ∑m νi = 1, (for example, each νi can be used to i=1 model the relative importance of each point ai ), the clustering problem consists of finding the set of k centers {x1 , . [sent-138, score-0.57]
34 Furthermore, the number of variables is n × k, and since n is typically 69 T EBOULLE very large, even with a moderate number of clusters k, the clustering problem yields a very large scale optimization problem to be solved. [sent-157, score-0.309]
35 2 Clustering with General Distance-Like Functions We introduce a broad class of distance-like functions d that is used to formulate the clustering problem in its general form, and we provide two generic families of such distance-like functions for clustering. [sent-162, score-0.427]
36 For a convex function g : IRn → IR ∪ {+∞}, its effective domain is defined / by dom g = {u | g(u) < +∞}, and the function is called proper if dom g = 0, and g(u) > −∞, ∀u ∈ dom g. [sent-166, score-0.54]
37 71 T EBOULLE Let ϕ : IR → IR∪{+∞} be a proper, lsc, convex function such that dom ϕ ⊂ IR + , dom ∂ϕ = IR++ , and such that ϕ is C 2 , strictly convex, and nonnegative on IR++ , with ϕ(1) = ϕ (1) = 0. [sent-203, score-0.377]
38 An interesting recent study can be found for example, in the work of Modha and Spanger (2003) which have considered convex-k-means clustering algorithms based on some other proximity measures that are convex in the second argument. [sent-239, score-0.422]
39 More precisely, most smooth and nonsmooth optimization problems can be modelled via the following generic abstract optimization model inf{c0 (u) + σY (c(u)) | c(u) ∈ dom σY }, 74 A U NIFIED C ONTINUOUS O PTIMIZATION F RAMEWORK TO C LUSTERING where c(u) = (c1 (u), . [sent-272, score-0.544]
40 The re-formulation of optimization problems via their support functions provides an alternative way to view and tackle optimization problems by exploiting mathematical properties of support functions, and this is the line of analysis that will be used here for the clustering problem. [sent-276, score-0.379]
41 In fact, the support function provides a remarkable one-to-one correspondence between nonempty closed convex subsets of IRk and the class of positively homogeneous lsc proper convex functions through the conjugacy operation. [sent-304, score-0.482]
42 The functions which are the support functions of non-empty convex sets are the lsc proper convex functions which are positively homogeneous. [sent-308, score-0.49]
43 6 The asymptotic cone of a nonempty convex set C ⊂ IR k is a convex cone containing the origin defined by, C∞ := {v ∈ IRk : v +C ⊂ C}. [sent-312, score-0.413]
44 , when working with convex sets and convex functions, the asymptotic cone (function) is often called recession cone (function). [sent-324, score-0.373]
45 3) Let g : IRk → IR ∪ {+∞} be a proper convex and lsc function, and let C = dom g∗ be the effective domain of the conjugate g∗ . [sent-350, score-0.423]
46 This last result connecting support and asymptotic functions, together with (5) in Proposition 6, provides the basis and motivation for developing a general approach to smoothing nonsmooth optimization problems. [sent-352, score-0.396]
47 Building on these ideas, we now develop smoothing approaches to the clustering problem. [sent-354, score-0.349]
48 Smoothing Methodologies for Clustering We first describe a very simple exact smoothing mechanism which provides a novel and simple way to view and design all center-based hard clustering algorithms from an optimization perspective. [sent-356, score-0.441]
49 In turns, the support function formulation also provides the starting point for developing a new and general smoothing approach for clustering problems, which is based on combining asymptotic functions, and the fundamental notion of nonlinear means of Hardy et al. [sent-357, score-0.529]
50 The resulting smoothing approach lends to devise a simple generic algorithm, which computationally is as simple as the popular k-means scheme, (see Section 5), and encompasses and extend most well known soft clustering algorithms, see Section 6. [sent-359, score-0.483]
51 In the context of the clustering problem, we now briefly show that the support function allows to derive an equivalent smooth formulation of the clustering problem, and in fact provides the foundation to the design and analysis of hard clustering algorithms. [sent-374, score-0.931]
52 The nonsmooth term min1≤l≤k d(xl , ai ) can be replaced by a smooth one, by using the support function. [sent-382, score-0.385]
53 , m, min d(xl , ai ) = −σ∆i (−d i (x)) = min{ wi , d i (x) : wi ∈ ∆i }, 1≤l≤k (6) where ∆i is the unit simplex in IRk given by ∆i = wi ∈ IRk | k ∑ wil = 1, wil ≥ 0, l = 1, . [sent-386, score-0.438]
54 , k , l=1 and where wi is the ”membership” variables associated to cluster Al , which satisfies: wi = 1 if the l l point ai is closest to cluster Al , and wi = 0 otherwise. [sent-389, score-0.401]
55 Thus, substituting (6) in (NS), it follows that l the nonsmooth clustering problem (NS) is equivalent to the exact smooth problem: m (ES) min min x1 ,. [sent-390, score-0.56]
56 The smooth formulation (ES) explains precisely the mechanism of all well known, old and more recent, hard center-based hard clustering algorithms. [sent-400, score-0.451]
57 The algorithm HCD includes as special cases, not only the popular k-means algorithm, but also many others hard clustering methods mentioned in the introduction. [sent-402, score-0.309]
58 In particular, it includes and extend the Bregman hard clustering algorithm recently derived by Banerjee et al. [sent-403, score-0.309]
59 1, that the nonsmooth clustering problem (NS) is equivalent to the exact smooth formulation (ES). [sent-410, score-0.546]
60 Using (6), an equivalent representation of the clustering problem (ES) can also be written as m (NS) ∑ −νi σ∆ (−d(x1 , ai ), . [sent-411, score-0.403]
61 ,x ∈S 1 min k (7) i=1 Thus, with a function gi smooth enough, this approach leads to a generic smoothed approximate s reformulation of the nonsmooth problem (NS) which depends on the parameter s > 0 that plays the role of a smoothing parameter. [sent-441, score-0.468]
62 s→0 1≤l≤k Thus, using (8), the clustering problem (NS) can be approximated by solving the smoothed problem (NS)s which in this case reads: m (NS)s min Fs (x) := −s ∑ νi log x1 ,. [sent-452, score-0.322]
63 Further, when d is a Bregman function, as given in Example 3, the approximation model (NS)s yields precisely the Bregman soft clustering method recently derived by Banerjee et al. [sent-459, score-0.319]
64 As we shall see next, another natural way to smooth the clustering problem, and which later on will reconcile with the asymptotic function approach, is by using the so-called concept of nonlinear means. [sent-462, score-0.463]
65 At this juncture, it is interesting to note that Karayiannis (1999) has suggested an axiomatic approach to re-formulate clustering objective functions that essentially leads him to rediscover the notion of nonlinear means given in Hardy et al. [sent-479, score-0.415]
66 Recall that our basic clustering problem (NS) consists of minimizing the objective function F which can be rewritten as: m m F(x) = ∑ νi min d(xl , ai ) = − ∑ νi max {−d(xl , ai )}. [sent-487, score-0.608]
67 Example 10 (Harmonic Mean Approximation) Consider the function h 2 (t) = −t −1 from Example 9, with dom h2 = int dom h2 = (−∞, 0), that yields the harmonic mean Gh2 . [sent-494, score-0.303]
68 Example 11 (Geometric Mean Approximation) Take the function h 1 (t) = − log(−t) given in Example 9 with dom h1 = int dom h1 = (−∞, 0) that yields the geometric means Gh1 . [sent-501, score-0.303]
69 This suggests an approach that would combine nonlinear means and asymptotic functions to provide a generic smoothing model which will approximate the original clustering problem (NS), and approach it in the limit, for a broad class of smooth approximations. [sent-509, score-0.689]
70 Then, Gh is convex if and only if it l=1 satisfies the gradient inequality, that is, recalling that (h−1 ) (·) > 0, this is equivalent to say, k h−1 (ξ(y)) − h−1 (ξ(x)) ≥ ∑ πl (yl − xl )h (xl ), ∀x, y ∈ Ωk . [sent-524, score-0.401]
71 Applying Jensen’s Inequality for the concave function H(·, ·) at s := ξ(y),t := ξ(x), it follows that the function H is concave if and only if, H(ξ(y), ξ(x)) ≥ k k l=1 l=1 ∑ πl H(h(yl ), h(xl )) = ∑ πl (yl − xl ) . [sent-529, score-0.33]
72 Equipped with Lemma 10, the concept of asymptotic nonlinear mean associated to a given convex nonlinear mean Gh is now well defined through Proposition 6. [sent-539, score-0.326]
73 (7)), provide all the ingredients to combine nonlinear convex means as characterized in Lemma 10 with asymptotic functions, and to formulate a broad class of smooth approximations to the clustering problem (NS) as follows. [sent-546, score-0.62]
74 For any h ∈ H , any fixed s > 0, and any π ∈ ∆+ , we approximate the nonsmooth objective F of the original clustering problem (NS) by the smooth function: m Fs (x1 , . [sent-547, score-0.547]
75 1 Motivation Given d ∈ D (S), h ∈ H and any s > 0, to solve the clustering problem we consider a solution method that solves the approximate smoothed minimization problem, inf Fs (x) | x ∈ S , (14) where m Fs (x) = −s ∑ νi h−1 i=1 k ∑ πl h l=1 − d(xl , ai ) s . [sent-562, score-0.465]
76 , k m k ∇l Fs (x) = πl ∑ νi h−1 ∑ πl h i=1 l=1 − d(xl , ai s ·h − d(xl , ai ) ∇1 d(xl , ai ), s (19) where ∇1 d(xl , ai ) is the gradient of d with respect to the first variable xl . [sent-598, score-0.842]
77 , k, λ (x) := νi ρ (x) · il il m ∑ ν jρ jl −1 (x) , (22) j=1 one has λil (x) > 0, and ∑m λil (x) = 1, and (21) reduces to i=1 m xl = ∑ λil (x)ai , i=1 86 l = 1, . [sent-613, score-0.52]
78 In fact, the fixed point iteration (24) obtained from solving (20) reads equivalently as: m ∑ νi ρil (x(t))d(xl , ai ) xl (t + 1) = argmin xl , l = 1, . [sent-625, score-0.739]
79 Since by definition, νi > 0, ρil (u) > 0, ∀i, l, and since by (d2 ) of Definition 1, for each i ∈ [1, m], one has dom ∂1 d(·, ai ) = S a nonempty open convex set, it follows that y(u) ∈ S, and NS (yl (u)) = {0}, and hence (27) reduces to the desired equation (26). [sent-666, score-0.428]
80 (a) The computational complexity of SKM is as simple as the standard k-means algorithm, which makes SKM viable for large scale clustering problems, and allows to significantly extend the scope of soft center-based iterative clustering methods. [sent-704, score-0.582]
81 Since D (S) d(·, ai ) is given C2 on S, with positive definite matrix Hessian ∇2 d(·, ai ), and since νi ρil (x(t)) > 0, then 1 m ∑ νi ∇2 d(·, ai )ρil (x(t)) is also a positive definite 1 i=1 matrix. [sent-763, score-0.42]
82 wm ) ∈∆= k C1 (x, w) := ∑ ∑ νi wi d(xl , ai ), l F(x) = ∑ νi min d(xl , ai ); i=1 l=1 ∆1 × . [sent-783, score-0.365]
83 Below, we further exemplify the point (ii) by briefly showing how the methods mentioned in the introduction, such as, Maximum Entropy Clustering Algorithms (MECA), Expectation Maximization (EM), and the Bregman soft clustering algorithm are essentially equivalent smoothing methods. [sent-823, score-0.405]
84 Moreover, with g(z) = log ∑k πl ezl , one has l=1 l=1 k g∗ (y) = ∑ yl log l=1 yl , with dom g∗ = ∆. [sent-828, score-0.305]
85 Using the dual representation of the Log-Sum exponential function given in Lemma 18 into the objective function of (40), some algebra shows that the smooth optimization problem (40) is equivalent to: m k wi min C1 (x, w) + s ∑ ∑ νi wi log l | wi ∈ ∆i , i = 1, . [sent-832, score-0.346]
86 Of course, this shows that maximum entropy methods applied to the clustering problem, are thus a special case of our smoothing approach. [sent-838, score-0.373]
87 The later problem is a convex optimization problem in w (a probabilistic/soft assignment variable), that can be solved via the so-called Blahut-Arimoto algorithm, (Berger, 1971), which is an iterative fixed point type convex dual optimization method. [sent-865, score-0.33]
88 ¯l i ∑ j=1 π j e−d(x ,a ) l i il (44) Now, the second step in SKM consists of solving m min x∈S k x) ∑ ∑ νi d(xl , ai )ρil (¯ , i=1 l=1 which admits a unique global minimizer (cf. [sent-878, score-0.322]
89 x), l (47) i=1 Therefore, with the equations (44), (45) and (47), we have recovered through a special realization of SKM, the EM algorithm for learning mixtures model of exponential family distributions or equivalently the Bregman soft clustering method, as recently derived in Banerjee et al. [sent-890, score-0.319]
90 The above comparisons were just briefly outlined to demonstrate the fundamental and useful role that convex analysis and optimization theory can play in the analysis and interpretation of iterative center-based clustering algorithms. [sent-894, score-0.428]
91 First, the proposed optimization framework and formalism provides a systematic and simple way to design and analyze a broad class of hard and soft center-based clustering algorithms, which retain the computational simplicity of the k-means algorithm. [sent-905, score-0.472]
92 Similarly, can we characterize the classes of functions h or/and the classes distance-like functions d that would allow us to eliminate or/and control the inherent nonconvexity difficulty which is present in the clustering problem? [sent-916, score-0.311]
93 Can we rigorously measure the quality of clustering produced by the generic scheme, or some other possible refined variants, in terms of the problem data and the couple (h, d)? [sent-917, score-0.341]
94 In this appendix we briefly describe the basic mechanism of hard clustering center-based algorithms. [sent-923, score-0.309]
95 2 Hard Clustering with Distance-Like Functions The popular k-means algorithm with d being the squared of the Euclidean distance, is nothing else but the Gauss-Seidel method, when applied to the exact smooth formulation (ES) of the clustering problem. [sent-931, score-0.39]
96 Note that the main computational step in the generic HCD algorithm keeps the computational simplicity of the k-means algorithm, yet allows for significantly expand the scope of hard clustering center-based methods. [sent-934, score-0.387]
97 Then, an optimal wi is simply 1≤l≤k given by wi (t) = 1, that is, when xl is the center closest to ai , and wi (t) = 0, ∀l = l(i). [sent-964, score-0.599]
98 l l(i) To solve step 2, noting that the objective is separable in each variable x l , it reduces to solve for each xl : xl (t + 1) = argmin x m ∑ wil(i) (t)d(x, ai ) i=1 98 . [sent-965, score-0.804]
99 It is easy to see that algorithm HCD includes as special cases, not only the popular k-means algorithm, but also many others hard clustering methods mentioned in the introduction. [sent-970, score-0.309]
100 In particular, it includes and extend the Bregman hard clustering algorithm recently derived in Banerjee et al. [sent-971, score-0.309]
wordName wordTfidf (topN-words)
[('irk', 0.42), ('xl', 0.282), ('clustering', 0.263), ('teboulle', 0.233), ('irn', 0.224), ('nonsmooth', 0.187), ('ns', 0.16), ('skm', 0.156), ('gh', 0.155), ('ir', 0.148), ('eboulle', 0.148), ('bregman', 0.145), ('fs', 0.142), ('nified', 0.14), ('ptimization', 0.14), ('ai', 0.14), ('dom', 0.129), ('il', 0.119), ('convex', 0.119), ('ontinuous', 0.118), ('ramework', 0.118), ('auslender', 0.117), ('lsc', 0.117), ('lustering', 0.09), ('yl', 0.088), ('smoothing', 0.086), ('zl', 0.08), ('generic', 0.078), ('irm', 0.078), ('asymptotic', 0.077), ('annealing', 0.07), ('hcd', 0.07), ('nonlinear', 0.065), ('rose', 0.065), ('da', 0.062), ('wi', 0.059), ('smooth', 0.058), ('soft', 0.056), ('banerjee', 0.056), ('euclidean', 0.052), ('nonconvex', 0.047), ('emerged', 0.047), ('vti', 0.047), ('hardy', 0.046), ('hard', 0.046), ('optimization', 0.046), ('int', 0.045), ('bertsekas', 0.044), ('cluster', 0.042), ('xk', 0.041), ('rockafellar', 0.04), ('nonempty', 0.04), ('proximity', 0.04), ('partitioning', 0.039), ('aij', 0.039), ('fkm', 0.039), ('meca', 0.039), ('vtil', 0.039), ('objective', 0.039), ('formulation', 0.038), ('broad', 0.038), ('divergences', 0.038), ('elkan', 0.038), ('minimizer', 0.037), ('argmin', 0.035), ('proper', 0.034), ('pu', 0.033), ('wil', 0.033), ('smoothed', 0.033), ('fuzzy', 0.032), ('squared', 0.031), ('bagirov', 0.031), ('hal', 0.031), ('hamerly', 0.031), ('logt', 0.031), ('censor', 0.029), ('positively', 0.029), ('simplex', 0.029), ('cone', 0.029), ('deterministic', 0.029), ('inf', 0.029), ('forthcoming', 0.028), ('centers', 0.027), ('proposition', 0.027), ('separable', 0.026), ('lemma', 0.026), ('min', 0.026), ('quantization', 0.025), ('lim', 0.025), ('entropy', 0.024), ('functions', 0.024), ('conjugate', 0.024), ('concave', 0.024), ('axiomatic', 0.024), ('ueda', 0.024), ('distortion', 0.023), ('formalism', 0.023), ('inequality', 0.023), ('alluded', 0.023), ('centerbased', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
2 0.15806973 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
3 0.11420219 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
4 0.10828663 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
5 0.091867313 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
6 0.07799679 90 jmlr-2007-Value Regularization and Fenchel Duality
7 0.048535097 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
8 0.04660039 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
9 0.044171896 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
10 0.043795209 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
11 0.041432597 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
12 0.039276686 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
13 0.039001022 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
14 0.038165167 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
15 0.037755486 23 jmlr-2007-Concave Learners for Rankboost
17 0.034830309 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
18 0.034786664 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
19 0.033446707 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
20 0.03287011 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
topicId topicWeight
[(0, 0.222), (1, 0.001), (2, -0.002), (3, 0.102), (4, 0.06), (5, -0.307), (6, -0.043), (7, 0.234), (8, 0.172), (9, -0.071), (10, -0.024), (11, -0.001), (12, -0.109), (13, 0.115), (14, -0.286), (15, 0.091), (16, -0.126), (17, 0.042), (18, -0.044), (19, 0.158), (20, -0.122), (21, -0.036), (22, -0.103), (23, 0.132), (24, -0.033), (25, 0.049), (26, -0.056), (27, -0.054), (28, -0.16), (29, -0.035), (30, 0.032), (31, -0.055), (32, -0.075), (33, 0.001), (34, 0.054), (35, 0.045), (36, 0.107), (37, 0.07), (38, 0.051), (39, 0.014), (40, -0.108), (41, 0.055), (42, -0.014), (43, -0.056), (44, 0.02), (45, 0.042), (46, -0.042), (47, -0.003), (48, -0.097), (49, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.96031767 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
2 0.75862795 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
3 0.43965361 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.42936173 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
Author: Wei Pan, Xiaotong Shen
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage
5 0.37594184 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
6 0.34567842 90 jmlr-2007-Value Regularization and Fenchel Duality
7 0.22397628 23 jmlr-2007-Concave Learners for Rankboost
9 0.20874117 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
10 0.18064904 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
11 0.17578988 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
12 0.17154203 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
13 0.16759063 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
14 0.16750781 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
15 0.16255771 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
16 0.16253537 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
17 0.16027087 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
18 0.1584162 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
19 0.15644386 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
20 0.14107816 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
topicId topicWeight
[(8, 0.03), (10, 0.517), (12, 0.026), (15, 0.012), (26, 0.021), (28, 0.043), (40, 0.058), (45, 0.012), (48, 0.021), (60, 0.039), (85, 0.035), (98, 0.092)]
simIndex simValue paperId paperTitle
1 0.91970825 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales
Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners
same-paper 2 0.86882609 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
3 0.48864809 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
Author: J. Zico Kolter, Marcus A. Maloof
Abstract: We present an ensemble method for concept drift that dynamically creates and removes weighted experts in response to changes in performance. The method, dynamic weighted majority (DWM), uses four mechanisms to cope with concept drift: It trains online learners of the ensemble, it weights those learners based on their performance, it removes them, also based on their performance, and it adds new experts based on the global performance of the ensemble. After an extensive evaluation— consisting of five experiments, eight learners, and thirty data sets that varied in type of target concept, size, presence of noise, and the like—we concluded that DWM outperformed other learners that only incrementally learn concept descriptions, that maintain and use previously encountered examples, and that employ an unweighted, fixed-size ensemble of experts. Keywords: concept learning, online learning, ensemble methods, concept drift
4 0.39380053 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
5 0.38392368 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
6 0.37366903 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
7 0.35832834 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
8 0.35713953 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
11 0.35391057 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
12 0.34131563 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
13 0.34096462 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
14 0.33027583 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
15 0.32989091 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
16 0.32872704 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
17 0.32749498 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
18 0.32151639 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
19 0.31798279 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
20 0.31796482 51 jmlr-2007-Loop Corrections for Approximate Inference on Factor Graphs