jmlr jmlr2011 jmlr2011-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
Reference: text
sentIndex sentText sentNum sentScore
1 A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. [sent-6, score-0.288]
2 Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. [sent-10, score-0.237]
3 Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components 1. [sent-12, score-0.204]
4 The class of spectral clustering algorithms is presently emerging as a promising alternative, showing improved performance over classical clustering algorithms on several benchmark problems c 2011 Bruno Pelletier and Pierre Pudlo. [sent-19, score-0.259]
5 An overview of spectral clustering algorithms may be found in von Luxburg (2007), and connections with kernel methods are exposed in Fillipone et al. [sent-24, score-0.221]
6 The spectral clustering algorithm amounts at embedding the data into a feature space by using the eigenvectors of the similarity matrix in such a way that the clusters may be separated using simple rules, for example, a separation by hyperplanes. [sent-26, score-0.241]
7 The core component of the spectral clustering algorithm is therefore the similarity matrix, or certain normalizations of it, generally called graph Laplacian matrices; see Chung (1997). [sent-27, score-0.198]
8 In the context of spectral clustering, the convergence of the empirical graph Laplacian operators has been established in von Luxburg et al. [sent-34, score-0.214]
9 (2000, 2001), and in the related work by Azzalini and Torelli (2007), clustering is performed by estimating the connected components of L (t). [sent-54, score-0.209]
10 In the present paper, we adopt the definition of a cluster of Hartigan (1975), and we propose and study a spectral clustering algorithm on estimated level sets. [sent-57, score-0.199]
11 In the second step of the algorithm, we perform a spectral clustering of the extracted points. [sent-64, score-0.171]
12 For the spectral clustering part of the algorithm, we consider the setting where the kernel function, or similarity function, between any two pairs of observations is non negative and with a compact support of diameter 2h, for some fixed positive real number h. [sent-68, score-0.238]
13 This operator norm convergence is more amenable than the slightly weaker notion of convergence established in von Luxburg et al. [sent-76, score-0.212]
14 In the second set of results, we study the convergence of the spectrum of the empirical operator, as a corollary of the operator norm convergence. [sent-82, score-0.186]
15 As a consequence, in the asymptotic regime, any reasonable clustering algorithm applied on the transformed data partitions the observations according to the connected components of the level set. [sent-86, score-0.209]
16 Then, asymptotically, when the scale parameter is lower than the minimal distance between the connected components of L (t), this random walk cannot jump from one connected component to one another. [sent-92, score-0.23]
17 Next, by exploiting the continuity of the operators in the scale parameter h, we obtain similar consistency results when h 387 P ELLETIER AND P UDLO is slightly greater than the minimal distance between two connected components of L (t). [sent-93, score-0.2]
18 The special limit case t = 0 corresponds to performing a clustering on all the observations, and our results imply the convergence of the clustering to the partition of the support of the density into its connected components, for a suitable choice of the scale parameter. [sent-98, score-0.342]
19 At last, we obtain consistency in the sense of Hartigan’s definition when the correct number of clusters is requested, which corresponds to the number of connected components of L (t), and when the similarity function has a compact support . [sent-102, score-0.212]
20 Then we define the spectral clustering algorithm on estimated level sets, and we follow by introducing the functional operators associated with the algorithm. [sent-106, score-0.248]
21 We start by studying the properties of the limit operator in the case where the scale parameter h is lower than the minimal distance between two connected components of L (t). [sent-110, score-0.19]
22 Spectral Clustering Algorithm In this section we give a description of the spectral clustering algorithm on level sets that is suitable for our theoretical analysis. [sent-117, score-0.171]
23 The minimal distance between the connected components of L (t) is denoted by dmin , that is, (1) dmin := inf dist Ci , C j . [sent-135, score-0.381]
24 We denote by kh : Rd → R+ the map defined by kh (u) := k(u/h). [sent-152, score-0.963]
25 2 Algorithm The first ingredient of our algorithm is the similarity matrix Kn,h whose elements are given by Kn,h (i, j) := kh (X j − Xi ), and where the integers i and j range over the random set J(n). [sent-154, score-0.495]
26 Hence Kn,h is a random matrix indexed by J(n) × J(n), whose values depend on the function kh , and on the observations X j lying in the estimated level set Ln (t). [sent-155, score-0.468]
27 389 P ELLETIER AND P UDLO The spectral clustering algorithm is based on the matrix Qn,h defined by Qn,h := D−1 Kn,h . [sent-158, score-0.171]
28 To implement the spectral clustering algorithm, the data points of the partitioning problem are first embedded into Rℓ by using the eigenvectors of Qn,h associated with the ℓ largest eigenvalues, namely λn,1 , λn,2 , . [sent-167, score-0.214]
29 In this equation, Ptn is the discrete random probability measure given by Ptn := and qn,h (x, y) := kh (y − x) , Kn,h (x) 1 j(n) ∑ δX j , j∈J(n) where Kn,h (x) := 390 Ln (t) kh (y − x)Ptn (dy). [sent-189, score-0.936]
30 Using this relation, asymptotic properties of the spectral clustering algorithm may be deduced from the limit behavior of the sequence of operators {Qn,h }n . [sent-192, score-0.226]
31 j∈J(n) By definition of qn,h , setting y = ϕn (x), we have ∑ Vj j∈J(n) kh (y − X j ) = 0 for all y ∈ L (t − εn ). [sent-223, score-0.468]
32 Kn,h (y) Since the support of kh is hB, the support of the function Kn,h is equal to j∈J(n) (X j + hB), and since kh is positive, it follows that V j = 0 for all j in J(n). [sent-224, score-0.936]
33 Observe that for all j in J(n), g ϕ−1 (X j ) = n 1 λ j(n) ∑ j′ ∈J(n) = 1 λ j(n) = 1 Qn,hV λ qn,h (X j , X j′ )V j′ j(n) kh (X j′ − X j )V j′ K ( j) j′ ∈J(n) n,h ∑ j = Vj by definition of g, by definition of Kn,h and qn,h , since V is an eigenvector. [sent-235, score-0.468]
34 The first term in (9) is bounded uniformly by Rn g(x) − Sn g(x) ≤ n 1 − j(n) µ L (t) r ∞ g ∞ and since j(n)/n tends to µ(L (t)) almost surely as n → ∞, we conclude that sup Rn g − Sn g ∞ : g W ≤ 1 → 0 a. [sent-268, score-0.186]
35 (14) j=1 Third, I1 (x, g) ≤ sup g ϕ−1 (x) − g(x) ≤ Dx g n x∈L (t) ≤ Dx g ∞ sup x − ϕn (x) → 0 x∈L (t) 395 ∞ sup ϕ−1 (x) − x n x∈L (t) (15) P ELLETIER AND P UDLO as n → ∞ by Lemma 17. [sent-276, score-0.18]
36 Proof We will prove that, as n → ∞, almost surely, sup Qn,h g − Qh g ∞ : g W ≤1 →0 (20) and sup Dx Qn,h g − Dx Qh g ∞ : g W ≤1 →0 To this aim, we introduce the operator Qn,h acting on W (L (t)) as Qn,h g(x) = Ln (t) qh (ϕn (x), y)g ϕ−1 (y) Ptn (dy). [sent-292, score-0.885]
37 (22) First, by Lemma 14, the function r = qh satisfies the condition in Proposition 3, so that Qn,h g − Qh g sup ∞ : g W ≤1 →0 (23) with probability one as n → ∞. [sent-294, score-0.699]
38 Next, since qh ∞ < ∞ by Lemma 14, there exists a finite constant Ch such that, Qn,h g ∞ ≤ Ch for all n and all g with g W ≤ 1. [sent-295, score-0.639]
39 (24) By definition of qn,h , for all x, y in the level set L (t), we have qn,h (x, y) = Kh (x) qh (x, y). [sent-296, score-0.639]
40 By Lemma 14, the map r : (x, y) → Dx qh (x, y) satisfies the conditions in Proposition 3. [sent-304, score-0.666]
41 Thus, Rn g− Rg ∞ converges to 0 almost surely, uniformly over g in the unit ball of W (L (t)), and we deduce that sup Dx Qn,h g − Dx Qh g : g ∞ W ≤1 →0 a. [sent-305, score-0.159]
42 On the one hand, we have Dx qn,h (ϕn (x), y) = Kh ϕn (x) Kh ϕn (x) Dx ϕn (x)(Dx qh ) ϕn (x), y + Dx Kn,h ϕn (x) Kn,h ϕn (x) qh ϕn (x), y . [sent-309, score-1.278]
43 Hence, Dx Qn,h g(x) = Kh ϕn (x) Kh ϕn (x) Dx ϕn (x)Rn g(x) + Dx Kn,h ϕn (x) Kn,h ϕn (x) On the other hand, since Dx qh ϕn (x), y Qn,h g(x). [sent-310, score-0.639]
44 = Dx ϕn (x)(Dx qh ) ϕn (x), y , Dx Qn,h g(x) = Dx ϕn (x)Rn g(x). [sent-311, score-0.639]
45 Consistency of the Algorithm The consistency of the algorithm relies on the operator norm convergence of Qn,h to the limit operator Qh (Theorem 4), on the spectral properties of Qh stated below in Section 4. [sent-319, score-0.312]
46 1, and on the results collected in Appendix B on the perturbation theory of linear operators, The starting point is the fact that, provided that h < dmin , the connected components of the level set L (t) are the recurrent classes of the Markov chain whose transitions are defined by Qh . [sent-320, score-0.357]
47 Hence Qh defines the desired clustering via its eigenspace corresponding to the eigenvalue 1, since this latter is spanned by the characteristic functions of the connected components of L (t), as stated in Proposition 6 below. [sent-322, score-0.209]
48 2, the consistency of the clustering is obtained in Theorem 7 in the case where the scale parameter h is lower than dmin defined in (1), which is the minimum distance between any two connected components of L (t). [sent-324, score-0.363]
49 3, where h is allowed to be larger than dmin , up to a value depending only on the underlying density f . [sent-326, score-0.159]
50 1 Properties of the Limit Operator Qh When h < dmin The transition kernel qh (x, dy) := qh (x, y)µt (dy) associated with the operator Qh defines a Markov chain with state space L (t), which is not countable. [sent-328, score-1.594]
51 The properties of the Markov chain with transition kernel qh (x, dy) are stated in Proposition 5 below. [sent-331, score-0.734]
52 , Cℓ and that dmin , defined in (1), is the minimal distance between the connected components of L (t). [sent-335, score-0.251]
53 Proposition 5 Consider the Markov chain with state space L (t) and transition kernel qh (x, dy), and assume that h < dmin . [sent-336, score-0.886]
54 When started at a point x in some connected component of the state space, the chain evolves within this connected component only. [sent-340, score-0.26]
55 When the state space is reduced to some connected component of L (t), the chain is open set irreducible and positive Harris recurrent. [sent-342, score-0.213]
56 When the state space is reduced to some connected component Ck of L (t), the Markov chain has a unique invariant distribution νk (dy) and, for all x ∈ Ck , the sequence of distributions qn (x, dy) n∈N h over Ck converges in total variation to νk (dy). [sent-344, score-0.265]
57 Proof Denote by {ξn } the Markov chain with transition kernel qh (x, dy). [sent-345, score-0.734]
58 For all x ∈ L (t), the distribution qh (x, dy) = qh (x, y)µt (dy) is absolutely continuous with respect to the Lebesgue measure, with density y → fh (x, y) defined by fh (x, y) = qh (x, y) f (y) 1 (y). [sent-346, score-2.098]
59 ′ ′ L (t) y′ ∈L (t) f (y )dy Since the similarity function kh and the density f are both continuous, the map (x, y) → fh (x, y) is continuous. [sent-347, score-0.627]
60 Since the similarity function kh is continuous, with compact support hB, the map x → Qh g(x) = L (t) qh (x, dy)g(y) is continuous for every bounded, measurable function g. [sent-351, score-1.201]
61 Now we have to prove that the chain is topologically aperiodic, that is, that qn (x, x + ηB) > 0 h for each x ∈ L (t), for all n ≥ 1 and η > 0, where qn (x, ·) is the distribution of ξn conditioned on h n ξ0 = x. [sent-353, score-0.222]
62 Since the distribution qn (x, ·) admits a continuous density fh (x, ·), it is enough to prove that h n (x, x) > 0. [sent-354, score-0.163]
63 By induction over n, using (30), fh (x, x) > 0 and the chain is topologically aperiodic. [sent-356, score-0.182]
64 Whence, Px (ξ1 ∈ C1 ) = qh (x, C1 ) = C1 qh (x, y)µt (dy) = L (t) qh (x, y)µt (dy) = 1. [sent-361, score-1.917]
65 qh (xN−1 , y) > 0 which proves that the chain is open set irreducible. [sent-392, score-0.734]
66 Therefore kh (y − x) = kh (x − y) which yields Kh (x)qh (x, dy)µt (dx) = Kh (y)qh (y, dx)µt (dy). [sent-395, score-0.936]
67 There exists a sequence {ξn }n of invertible linear transformations of Rℓ such that, for all j ∈ J(∞), ξn ρn (X j ) converges almost surely to ek( j) , where ek( j) is the vector of Rℓ whose components are all 0 except the k( j)th component equal to 1. [sent-427, score-0.159]
68 Remark 8 The last step of the spectral clustering algorithm consists in partitioning the transformed data in the feature space, which can be performed by a standard clustering algorithm, like the kmeans algorithm or a hierarchical clustering. [sent-483, score-0.259]
69 Hence, splitting the graph into its connected components leads to the desired clustering as well. [sent-510, score-0.209]
70 But Theorem 7, by giving the asymptotic representation of the data when embedded in the feature space Rℓ , provides additional insight into spectral clustering algorithms. [sent-511, score-0.171]
71 Thus, the spectral properties of both operators will be close to the ones stated in Theorem 7 if h is in a neighborhood of the interval (0; dmin ). [sent-518, score-0.268]
72 In particular, the sum of the eigenspaces of Qh associated with the eigenvalues close to 1 is spanned by functions that are close to (in W (L (t))norm) the characteristic functions of the connected components of L (t). [sent-520, score-0.185]
73 However, when h is taken slightly larger than the critical value dmin , at least two connected components cannot be separated using the graph partitioning algorithm. [sent-530, score-0.251]
74 For all h ≤ dmin the ℓ largest eigenvalues of Qh are all equal to 1 and the corresponding eigenspace is spanned by the indicator functions of the connected components of the t-level set. [sent-532, score-0.293]
75 , gℓ , at distance (in · W -norm) less than C0 /2 from the indicator functions of the connected components 404 S PECTRAL C LUSTERING ON L EVEL S ETS of L (t) : gk − 1Ck ∞ ≤ gk − 1Ck W < C0 /2 for k = 1, . [sent-538, score-0.177]
76 4 Generalizations and Open Problems Our results allow to relate the limit partition of a spectral clustering algorithm with the connected components of either the support of the distribution (case t = 0) or of an upper level set of the density (case t > 0). [sent-557, score-0.347]
77 Interestingly, the scale parameter h of the similarity function may be larger than the minimal distance between two connected components, up to a threshold value hmax above which we have no theoretical guarantee that the connected components will be recovered. [sent-559, score-0.287]
78 Among these, interpreting the limit partition of the classical spectral clustering algorithm with the underlying distribution when one asks for more groups than the number of connected components of its support remains largely an unsolved problem. [sent-561, score-0.318]
79 Lemma 12 The two collections of functions F1 := y → kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) , F2 := y → Dx kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) , are Glivenko-Cantelli, where Dx kh denotes the differential of kh . [sent-617, score-1.872]
80 Define the functions gl and gu i=1 i,δ i,δ respectively by gl (y) = inf gx (y) and gu (y) = sup gx (y). [sent-627, score-0.232]
81 Observe that |gx (y)| ≤ kh ∞ for i,δ i,δ all x ∈ L (t − ε0 ) and all y ∈ L (t) since kh is uniformly bounded, and that for any fixed y ∈ L (t), the map x → gx (y) is continuous since k is of class C 2 on Rd under Assumption 2. [sent-632, score-1.027]
82 Therefore the function gu − gl converges pointwise to 0 and gu − gl L1 (Q) goes to 0 as δ → 0 by the Lebesgue i,δ i,δ i,δ i,δ dominated convergence theorem. [sent-633, score-0.172]
83 Since kh is continuously differentiable, the same arguments apply to each component of Dx kh , and so F2 is also a GlivenkoCantelli class. [sent-639, score-0.936]
84 Then sup | f (y) − ri (y)g j (y)1L (t) (y)| = y∈Rd = sup |r(x, y)g(y) − ri (y)g j (y)| y∈L (t) sup (r(x, y) − ri (y))g(y) + ri (y)(g(y) − g j (y)) y∈L (t) ≤ sup |r(x, y) − ri (y)| g ∞+ y∈L (t) ≤ ε+ r ri ∞ sup g(y) − g j (y) y∈L (t) ∞ ε, since ri ∞ = 1 for all i = 1, . [sent-677, score-0.531]
85 The following lemma gives useful bounds on Kh and qh , both defined in (19). [sent-690, score-0.668]
86 The kernel qh is uniformly bounded, that is, qh ∞ < ∞; 3. [sent-693, score-1.313]
87 The differential of qh with respect to x is uniformly bounded on L (t − ε0 ) × Rd , that is, sup Dx qh (x, y) : (x, y) ∈ L (t − ε0 ) × Rd < ∞; 4. [sent-694, score-1.373]
88 The Hessian of qh with respect to x is uniformly bounded on L (t − ε0 ) × Rd , that is, sup D2 qh (x, y) : (x, y) ∈ L (t − ε0 ) × Rd < ∞. [sent-695, score-1.373]
89 x Proof First observe that the statements 2, 3 and 4 are immediate consequences of statement 1 together with the fact that the function kh is of class C 2 with compact support, which implies that kh (y − x), Dx kh (y − x), and D2 kh (y − x) are uniformly bounded. [sent-696, score-1.947]
90 Moreover kh is bounded from below by some positive number on hB/2 by Assumption 2. [sent-704, score-0.468]
91 Proof Let † Kn,h (x) := n 1 kh (Xi − x)1L (t) (Xi ), nµ(L (t)) ∑ †† Kn,h (x) := n i=1 n 1 kh (Xi − x)1L (t) (Xi ). [sent-711, score-0.936]
92 Using the inequality † Kn,h (x) − Kn,h (x) ≤ 1 n − j(n) µ(L (t)) kh ∞ we conclude that the first term in (35) tends to 0 uniformly in x over L (t − ε0 ) with probability one as n → ∞, since j(n)/n → µ L (t) almost surely, and since kh is bounded on Rd . [sent-713, score-1.003]
93 (36) The first term in (36) is bounded by † †† Kn,h (x) − Kn,h (x) ≤ = kh ∞ 1 µ L (t) n n ∑ i=1 n 1Ln (t) (Xi ) − 1L (t) (Xi ) kh ∞ 1 ∑ 1L (t)∆L (t) (Xi ), µ L (t) n i=1 n where Ln (t)∆L (t) denotes the symmetric difference between Ln (t) and L (t). [sent-715, score-0.936]
94 Next, since the collection y → kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) is Glivenko-Cantelli by Lemma 12, we conclude that sup x∈L (t−ε0 ) †† Kn,h (x) − Kh (x) → 0, with probability one as n → ∞. [sent-719, score-0.528]
95 The second statement may be proved by developing similar arguments, with kh replaced by Dx kh , and by noting that the collection of functions y → Dx kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) is also Glivenko-Cantelli by Lemma 12. [sent-721, score-1.404]
96 Hence sup Kh ϕn (x) − Kh (x) → 0 as n → ∞, x∈L (t) and since Kn,h converges uniformly to Kh with probability one as n → ∞ by Lemma 15, this proves the first convergence result. [sent-725, score-0.18]
97 A Feller chain is said open set irreducible if, for every points x, y in S , and every η > 0, ∑ qn (x, y + ηB) > 0, n≥1 where qn (x, dy) stands for the n-step transition kernel; see (Meyn and Tweedie, 1993, p. [sent-807, score-0.249]
98 Such a behavior does not occur if the Feller chain is topologically aperiodic, that is, if for each initial state x, each η > 0, there exists n0 such that qn (x, x + ηB) > 0 for every n ≥ n0 ; see (Meyn and Tweedie, 1993, p. [sent-819, score-0.186]
99 Assuming that the chain is Feller, open set irreducible, topologically aperiodic and positive Harris recurrent, the sequence of distribution {qn (x, dy)}n≥1 converges in total variation to ν(dy), the unique invariant probability distribution; see Theorem 13. [sent-839, score-0.176]
100 Difusion maps, spectral clustering and reaction coordinates of dynamical systems. [sent-1040, score-0.171]
wordName wordTfidf (topN-words)
[('qh', 0.639), ('kh', 0.468), ('dx', 0.255), ('dy', 0.154), ('dmin', 0.13), ('elletier', 0.123), ('udlo', 0.123), ('pectral', 0.115), ('evel', 0.098), ('rd', 0.093), ('clustering', 0.088), ('connected', 0.085), ('spectral', 0.083), ('lustering', 0.081), ('fh', 0.076), ('ets', 0.071), ('operator', 0.069), ('chain', 0.068), ('lt', 0.065), ('tn', 0.061), ('sup', 0.06), ('surely', 0.059), ('hartigan', 0.059), ('qn', 0.058), ('luxburg', 0.055), ('operators', 0.055), ('hmax', 0.054), ('ptn', 0.054), ('tweedie', 0.054), ('ln', 0.053), ('harris', 0.053), ('meyn', 0.052), ('von', 0.05), ('spectrum', 0.05), ('feller', 0.046), ('rg', 0.045), ('eigenvectors', 0.043), ('eigenvalues', 0.042), ('balls', 0.042), ('norm', 0.041), ('compact', 0.04), ('aperiodic', 0.038), ('fn', 0.038), ('gn', 0.038), ('recurrent', 0.038), ('irreducible', 0.038), ('topologically', 0.038), ('components', 0.036), ('uniformly', 0.035), ('supremum', 0.035), ('laplacian', 0.034), ('vaart', 0.033), ('ch', 0.033), ('ri', 0.033), ('cubes', 0.033), ('converges', 0.032), ('belkin', 0.032), ('almost', 0.032), ('diffeomorphisms', 0.031), ('gl', 0.03), ('radius', 0.03), ('lemma', 0.029), ('gx', 0.029), ('density', 0.029), ('gk', 0.028), ('der', 0.028), ('cluster', 0.028), ('xi', 0.027), ('map', 0.027), ('transition', 0.027), ('multiplicity', 0.027), ('materials', 0.027), ('similarity', 0.027), ('isolated', 0.027), ('gu', 0.027), ('proves', 0.027), ('partition', 0.026), ('convergence', 0.026), ('projector', 0.026), ('pn', 0.026), ('acting', 0.025), ('banach', 0.025), ('markov', 0.025), ('fu', 0.025), ('ek', 0.025), ('sn', 0.024), ('consistency', 0.024), ('chains', 0.024), ('walk', 0.024), ('theorem', 0.023), ('px', 0.023), ('kato', 0.023), ('montpellier', 0.023), ('pelletier', 0.023), ('rennes', 0.023), ('gi', 0.022), ('state', 0.022), ('lebesgue', 0.022), ('functional', 0.022), ('eigenspaces', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999899 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
2 0.081031732 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
3 0.078935355 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
4 0.075411461 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
5 0.065687001 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
6 0.059380401 16 jmlr-2011-Clustering Algorithms for Chains
7 0.047464404 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
8 0.042920489 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
9 0.041378897 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
10 0.041312996 35 jmlr-2011-Forest Density Estimation
11 0.039891943 91 jmlr-2011-The Sample Complexity of Dictionary Learning
12 0.038959716 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
13 0.038710374 55 jmlr-2011-Learning Multi-modal Similarity
14 0.035631761 104 jmlr-2011-X-Armed Bandits
15 0.035207368 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
16 0.034799419 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
17 0.032657109 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
18 0.032603811 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
19 0.032003161 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
20 0.03152401 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
topicId topicWeight
[(0, 0.171), (1, -0.025), (2, 0.003), (3, 0.044), (4, 0.063), (5, -0.011), (6, -0.038), (7, 0.029), (8, -0.189), (9, 0.087), (10, 0.035), (11, 0.084), (12, -0.015), (13, 0.069), (14, 0.023), (15, -0.119), (16, 0.089), (17, -0.089), (18, -0.004), (19, -0.162), (20, 0.041), (21, -0.241), (22, 0.03), (23, -0.059), (24, -0.15), (25, -0.157), (26, -0.077), (27, -0.052), (28, 0.052), (29, -0.161), (30, 0.092), (31, 0.092), (32, -0.107), (33, 0.217), (34, 0.117), (35, -0.083), (36, 0.106), (37, 0.038), (38, 0.075), (39, 0.057), (40, -0.049), (41, -0.009), (42, -0.038), (43, 0.159), (44, -0.014), (45, 0.032), (46, -0.094), (47, -0.124), (48, 0.037), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.94930524 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
2 0.52214128 16 jmlr-2011-Clustering Algorithms for Chains
Author: Antti Ukkonen
Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing
3 0.45164683 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
4 0.44567141 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
5 0.4148137 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
6 0.37930143 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
7 0.31968686 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
8 0.30205908 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
9 0.23155762 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
10 0.23000225 91 jmlr-2011-The Sample Complexity of Dictionary Learning
11 0.22364226 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.22329344 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
13 0.21630509 104 jmlr-2011-X-Armed Bandits
14 0.2130222 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
15 0.19571233 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
16 0.19525605 6 jmlr-2011-A Simpler Approach to Matrix Completion
17 0.19349413 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
18 0.19255179 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
19 0.19055705 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
20 0.19011025 55 jmlr-2011-Learning Multi-modal Similarity
topicId topicWeight
[(4, 0.057), (6, 0.03), (9, 0.046), (10, 0.015), (11, 0.011), (23, 0.011), (24, 0.033), (31, 0.09), (32, 0.025), (39, 0.011), (41, 0.037), (60, 0.013), (65, 0.03), (66, 0.043), (71, 0.017), (73, 0.036), (77, 0.276), (78, 0.068), (90, 0.029), (94, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.71165764 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
2 0.47803175 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
Author: Tony Jebara
Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming
3 0.45074287 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.44003388 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
Author: Huixin Wang, Xiaotong Shen, Wei Pan
Abstract: In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of flat classification, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its flat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classification, structured learning
5 0.43857118 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
Author: Piotr Zwiernik
Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold
6 0.43611851 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
7 0.4348841 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
8 0.43394062 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
9 0.4315339 16 jmlr-2011-Clustering Algorithms for Chains
10 0.43005386 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
11 0.42816591 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.42571899 59 jmlr-2011-Learning with Structured Sparsity
13 0.42390537 104 jmlr-2011-X-Armed Bandits
14 0.42305619 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
15 0.42261463 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
16 0.42237836 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
17 0.42058766 12 jmlr-2011-Bayesian Co-Training
18 0.41870236 48 jmlr-2011-Kernel Analysis of Deep Networks
19 0.41796416 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
20 0.41756657 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices