jmlr jmlr2011 jmlr2011-74 knowledge-graph by maker-knowledge-mining

74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Source: pdf

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. [sent-6, score-0.288]

2 Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. [sent-10, score-0.237]

3 Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components 1. [sent-12, score-0.204]

4 The class of spectral clustering algorithms is presently emerging as a promising alternative, showing improved performance over classical clustering algorithms on several benchmark problems c 2011 Bruno Pelletier and Pierre Pudlo. [sent-19, score-0.259]

5 An overview of spectral clustering algorithms may be found in von Luxburg (2007), and connections with kernel methods are exposed in Fillipone et al. [sent-24, score-0.221]

6 The spectral clustering algorithm amounts at embedding the data into a feature space by using the eigenvectors of the similarity matrix in such a way that the clusters may be separated using simple rules, for example, a separation by hyperplanes. [sent-26, score-0.241]

7 The core component of the spectral clustering algorithm is therefore the similarity matrix, or certain normalizations of it, generally called graph Laplacian matrices; see Chung (1997). [sent-27, score-0.198]

8 In the context of spectral clustering, the convergence of the empirical graph Laplacian operators has been established in von Luxburg et al. [sent-34, score-0.214]

9 (2000, 2001), and in the related work by Azzalini and Torelli (2007), clustering is performed by estimating the connected components of L (t). [sent-54, score-0.209]

10 In the present paper, we adopt the deﬁnition of a cluster of Hartigan (1975), and we propose and study a spectral clustering algorithm on estimated level sets. [sent-57, score-0.199]

11 In the second step of the algorithm, we perform a spectral clustering of the extracted points. [sent-64, score-0.171]

12 For the spectral clustering part of the algorithm, we consider the setting where the kernel function, or similarity function, between any two pairs of observations is non negative and with a compact support of diameter 2h, for some ﬁxed positive real number h. [sent-68, score-0.238]

13 This operator norm convergence is more amenable than the slightly weaker notion of convergence established in von Luxburg et al. [sent-76, score-0.212]

14 In the second set of results, we study the convergence of the spectrum of the empirical operator, as a corollary of the operator norm convergence. [sent-82, score-0.186]

15 As a consequence, in the asymptotic regime, any reasonable clustering algorithm applied on the transformed data partitions the observations according to the connected components of the level set. [sent-86, score-0.209]

16 Then, asymptotically, when the scale parameter is lower than the minimal distance between the connected components of L (t), this random walk cannot jump from one connected component to one another. [sent-92, score-0.23]

17 Next, by exploiting the continuity of the operators in the scale parameter h, we obtain similar consistency results when h 387 P ELLETIER AND P UDLO is slightly greater than the minimal distance between two connected components of L (t). [sent-93, score-0.2]

18 The special limit case t = 0 corresponds to performing a clustering on all the observations, and our results imply the convergence of the clustering to the partition of the support of the density into its connected components, for a suitable choice of the scale parameter. [sent-98, score-0.342]

19 At last, we obtain consistency in the sense of Hartigan’s deﬁnition when the correct number of clusters is requested, which corresponds to the number of connected components of L (t), and when the similarity function has a compact support . [sent-102, score-0.212]

20 Then we deﬁne the spectral clustering algorithm on estimated level sets, and we follow by introducing the functional operators associated with the algorithm. [sent-106, score-0.248]

21 We start by studying the properties of the limit operator in the case where the scale parameter h is lower than the minimal distance between two connected components of L (t). [sent-110, score-0.19]

22 Spectral Clustering Algorithm In this section we give a description of the spectral clustering algorithm on level sets that is suitable for our theoretical analysis. [sent-117, score-0.171]

23 The minimal distance between the connected components of L (t) is denoted by dmin , that is, (1) dmin := inf dist Ci , C j . [sent-135, score-0.381]

24 We denote by kh : Rd → R+ the map deﬁned by kh (u) := k(u/h). [sent-152, score-0.963]

25 2 Algorithm The ﬁrst ingredient of our algorithm is the similarity matrix Kn,h whose elements are given by Kn,h (i, j) := kh (X j − Xi ), and where the integers i and j range over the random set J(n). [sent-154, score-0.495]

26 Hence Kn,h is a random matrix indexed by J(n) × J(n), whose values depend on the function kh , and on the observations X j lying in the estimated level set Ln (t). [sent-155, score-0.468]

27 389 P ELLETIER AND P UDLO The spectral clustering algorithm is based on the matrix Qn,h deﬁned by Qn,h := D−1 Kn,h . [sent-158, score-0.171]

28 To implement the spectral clustering algorithm, the data points of the partitioning problem are ﬁrst embedded into Rℓ by using the eigenvectors of Qn,h associated with the ℓ largest eigenvalues, namely λn,1 , λn,2 , . [sent-167, score-0.214]

29 In this equation, Ptn is the discrete random probability measure given by Ptn := and qn,h (x, y) := kh (y − x) , Kn,h (x) 1 j(n) ∑ δX j , j∈J(n) where Kn,h (x) := 390 Ln (t) kh (y − x)Ptn (dy). [sent-189, score-0.936]

30 Using this relation, asymptotic properties of the spectral clustering algorithm may be deduced from the limit behavior of the sequence of operators {Qn,h }n . [sent-192, score-0.226]

31 j∈J(n) By deﬁnition of qn,h , setting y = ϕn (x), we have ∑ Vj j∈J(n) kh (y − X j ) = 0 for all y ∈ L (t − εn ). [sent-223, score-0.468]

32 Kn,h (y) Since the support of kh is hB, the support of the function Kn,h is equal to j∈J(n) (X j + hB), and since kh is positive, it follows that V j = 0 for all j in J(n). [sent-224, score-0.936]

33 Observe that for all j in J(n), g ϕ−1 (X j ) = n 1 λ j(n) ∑ j′ ∈J(n) = 1 λ j(n) = 1 Qn,hV λ qn,h (X j , X j′ )V j′ j(n) kh (X j′ − X j )V j′ K ( j) j′ ∈J(n) n,h ∑ j = Vj by deﬁnition of g, by deﬁnition of Kn,h and qn,h , since V is an eigenvector. [sent-235, score-0.468]

34 The ﬁrst term in (9) is bounded uniformly by Rn g(x) − Sn g(x) ≤ n 1 − j(n) µ L (t) r ∞ g ∞ and since j(n)/n tends to µ(L (t)) almost surely as n → ∞, we conclude that sup Rn g − Sn g ∞ : g W ≤ 1 → 0 a. [sent-268, score-0.186]

35 (14) j=1 Third, I1 (x, g) ≤ sup g ϕ−1 (x) − g(x) ≤ Dx g n x∈L (t) ≤ Dx g ∞ sup x − ϕn (x) → 0 x∈L (t) 395 ∞ sup ϕ−1 (x) − x n x∈L (t) (15) P ELLETIER AND P UDLO as n → ∞ by Lemma 17. [sent-276, score-0.18]

36 Proof We will prove that, as n → ∞, almost surely, sup Qn,h g − Qh g ∞ : g W ≤1 →0 (20) and sup Dx Qn,h g − Dx Qh g ∞ : g W ≤1 →0 To this aim, we introduce the operator Qn,h acting on W (L (t)) as Qn,h g(x) = Ln (t) qh (ϕn (x), y)g ϕ−1 (y) Ptn (dy). [sent-292, score-0.885]

37 (22) First, by Lemma 14, the function r = qh satisﬁes the condition in Proposition 3, so that Qn,h g − Qh g sup ∞ : g W ≤1 →0 (23) with probability one as n → ∞. [sent-294, score-0.699]

38 Next, since qh ∞ < ∞ by Lemma 14, there exists a ﬁnite constant Ch such that, Qn,h g ∞ ≤ Ch for all n and all g with g W ≤ 1. [sent-295, score-0.639]

39 (24) By deﬁnition of qn,h , for all x, y in the level set L (t), we have qn,h (x, y) = Kh (x) qh (x, y). [sent-296, score-0.639]

40 By Lemma 14, the map r : (x, y) → Dx qh (x, y) satisﬁes the conditions in Proposition 3. [sent-304, score-0.666]

41 Thus, Rn g− Rg ∞ converges to 0 almost surely, uniformly over g in the unit ball of W (L (t)), and we deduce that sup Dx Qn,h g − Dx Qh g : g ∞ W ≤1 →0 a. [sent-305, score-0.159]

42 On the one hand, we have Dx qn,h (ϕn (x), y) = Kh ϕn (x) Kh ϕn (x) Dx ϕn (x)(Dx qh ) ϕn (x), y + Dx Kn,h ϕn (x) Kn,h ϕn (x) qh ϕn (x), y . [sent-309, score-1.278]

43 Hence, Dx Qn,h g(x) = Kh ϕn (x) Kh ϕn (x) Dx ϕn (x)Rn g(x) + Dx Kn,h ϕn (x) Kn,h ϕn (x) On the other hand, since Dx qh ϕn (x), y Qn,h g(x). [sent-310, score-0.639]

44 = Dx ϕn (x)(Dx qh ) ϕn (x), y , Dx Qn,h g(x) = Dx ϕn (x)Rn g(x). [sent-311, score-0.639]

45 Consistency of the Algorithm The consistency of the algorithm relies on the operator norm convergence of Qn,h to the limit operator Qh (Theorem 4), on the spectral properties of Qh stated below in Section 4. [sent-319, score-0.312]

46 1, and on the results collected in Appendix B on the perturbation theory of linear operators, The starting point is the fact that, provided that h < dmin , the connected components of the level set L (t) are the recurrent classes of the Markov chain whose transitions are deﬁned by Qh . [sent-320, score-0.357]

47 Hence Qh deﬁnes the desired clustering via its eigenspace corresponding to the eigenvalue 1, since this latter is spanned by the characteristic functions of the connected components of L (t), as stated in Proposition 6 below. [sent-322, score-0.209]

48 2, the consistency of the clustering is obtained in Theorem 7 in the case where the scale parameter h is lower than dmin deﬁned in (1), which is the minimum distance between any two connected components of L (t). [sent-324, score-0.363]

49 3, where h is allowed to be larger than dmin , up to a value depending only on the underlying density f . [sent-326, score-0.159]

50 1 Properties of the Limit Operator Qh When h < dmin The transition kernel qh (x, dy) := qh (x, y)µt (dy) associated with the operator Qh deﬁnes a Markov chain with state space L (t), which is not countable. [sent-328, score-1.594]

51 The properties of the Markov chain with transition kernel qh (x, dy) are stated in Proposition 5 below. [sent-331, score-0.734]

52 , Cℓ and that dmin , deﬁned in (1), is the minimal distance between the connected components of L (t). [sent-335, score-0.251]

53 Proposition 5 Consider the Markov chain with state space L (t) and transition kernel qh (x, dy), and assume that h < dmin . [sent-336, score-0.886]

54 When started at a point x in some connected component of the state space, the chain evolves within this connected component only. [sent-340, score-0.26]

55 When the state space is reduced to some connected component of L (t), the chain is open set irreducible and positive Harris recurrent. [sent-342, score-0.213]

56 When the state space is reduced to some connected component Ck of L (t), the Markov chain has a unique invariant distribution νk (dy) and, for all x ∈ Ck , the sequence of distributions qn (x, dy) n∈N h over Ck converges in total variation to νk (dy). [sent-344, score-0.265]

57 Proof Denote by {ξn } the Markov chain with transition kernel qh (x, dy). [sent-345, score-0.734]

58 For all x ∈ L (t), the distribution qh (x, dy) = qh (x, y)µt (dy) is absolutely continuous with respect to the Lebesgue measure, with density y → fh (x, y) deﬁned by fh (x, y) = qh (x, y) f (y) 1 (y). [sent-346, score-2.098]

59 ′ ′ L (t) y′ ∈L (t) f (y )dy Since the similarity function kh and the density f are both continuous, the map (x, y) → fh (x, y) is continuous. [sent-347, score-0.627]

60 Since the similarity function kh is continuous, with compact support hB, the map x → Qh g(x) = L (t) qh (x, dy)g(y) is continuous for every bounded, measurable function g. [sent-351, score-1.201]

61 Now we have to prove that the chain is topologically aperiodic, that is, that qn (x, x + ηB) > 0 h for each x ∈ L (t), for all n ≥ 1 and η > 0, where qn (x, ·) is the distribution of ξn conditioned on h n ξ0 = x. [sent-353, score-0.222]

62 Since the distribution qn (x, ·) admits a continuous density fh (x, ·), it is enough to prove that h n (x, x) > 0. [sent-354, score-0.163]

63 By induction over n, using (30), fh (x, x) > 0 and the chain is topologically aperiodic. [sent-356, score-0.182]

64 Whence, Px (ξ1 ∈ C1 ) = qh (x, C1 ) = C1 qh (x, y)µt (dy) = L (t) qh (x, y)µt (dy) = 1. [sent-361, score-1.917]

65 qh (xN−1 , y) > 0 which proves that the chain is open set irreducible. [sent-392, score-0.734]

66 Therefore kh (y − x) = kh (x − y) which yields Kh (x)qh (x, dy)µt (dx) = Kh (y)qh (y, dx)µt (dy). [sent-395, score-0.936]

67 There exists a sequence {ξn }n of invertible linear transformations of Rℓ such that, for all j ∈ J(∞), ξn ρn (X j ) converges almost surely to ek( j) , where ek( j) is the vector of Rℓ whose components are all 0 except the k( j)th component equal to 1. [sent-427, score-0.159]

68 Remark 8 The last step of the spectral clustering algorithm consists in partitioning the transformed data in the feature space, which can be performed by a standard clustering algorithm, like the kmeans algorithm or a hierarchical clustering. [sent-483, score-0.259]

69 Hence, splitting the graph into its connected components leads to the desired clustering as well. [sent-510, score-0.209]

70 But Theorem 7, by giving the asymptotic representation of the data when embedded in the feature space Rℓ , provides additional insight into spectral clustering algorithms. [sent-511, score-0.171]

71 Thus, the spectral properties of both operators will be close to the ones stated in Theorem 7 if h is in a neighborhood of the interval (0; dmin ). [sent-518, score-0.268]

72 In particular, the sum of the eigenspaces of Qh associated with the eigenvalues close to 1 is spanned by functions that are close to (in W (L (t))norm) the characteristic functions of the connected components of L (t). [sent-520, score-0.185]

73 However, when h is taken slightly larger than the critical value dmin , at least two connected components cannot be separated using the graph partitioning algorithm. [sent-530, score-0.251]

74 For all h ≤ dmin the ℓ largest eigenvalues of Qh are all equal to 1 and the corresponding eigenspace is spanned by the indicator functions of the connected components of the t-level set. [sent-532, score-0.293]

75 , gℓ , at distance (in · W -norm) less than C0 /2 from the indicator functions of the connected components 404 S PECTRAL C LUSTERING ON L EVEL S ETS of L (t) : gk − 1Ck ∞ ≤ gk − 1Ck W < C0 /2 for k = 1, . [sent-538, score-0.177]

76 4 Generalizations and Open Problems Our results allow to relate the limit partition of a spectral clustering algorithm with the connected components of either the support of the distribution (case t = 0) or of an upper level set of the density (case t > 0). [sent-557, score-0.347]

77 Interestingly, the scale parameter h of the similarity function may be larger than the minimal distance between two connected components, up to a threshold value hmax above which we have no theoretical guarantee that the connected components will be recovered. [sent-559, score-0.287]

78 Among these, interpreting the limit partition of the classical spectral clustering algorithm with the underlying distribution when one asks for more groups than the number of connected components of its support remains largely an unsolved problem. [sent-561, score-0.318]

79 Lemma 12 The two collections of functions F1 := y → kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) , F2 := y → Dx kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) , are Glivenko-Cantelli, where Dx kh denotes the differential of kh . [sent-617, score-1.872]

80 Deﬁne the functions gl and gu i=1 i,δ i,δ respectively by gl (y) = inf gx (y) and gu (y) = sup gx (y). [sent-627, score-0.232]

81 Observe that |gx (y)| ≤ kh ∞ for i,δ i,δ all x ∈ L (t − ε0 ) and all y ∈ L (t) since kh is uniformly bounded, and that for any ﬁxed y ∈ L (t), the map x → gx (y) is continuous since k is of class C 2 on Rd under Assumption 2. [sent-632, score-1.027]

82 Therefore the function gu − gl converges pointwise to 0 and gu − gl L1 (Q) goes to 0 as δ → 0 by the Lebesgue i,δ i,δ i,δ i,δ dominated convergence theorem. [sent-633, score-0.172]

83 Since kh is continuously differentiable, the same arguments apply to each component of Dx kh , and so F2 is also a GlivenkoCantelli class. [sent-639, score-0.936]

85 The following lemma gives useful bounds on Kh and qh , both deﬁned in (19). [sent-690, score-0.668]

86 The kernel qh is uniformly bounded, that is, qh ∞ < ∞; 3. [sent-693, score-1.313]

87 The differential of qh with respect to x is uniformly bounded on L (t − ε0 ) × Rd , that is, sup Dx qh (x, y) : (x, y) ∈ L (t − ε0 ) × Rd < ∞; 4. [sent-694, score-1.373]

88 The Hessian of qh with respect to x is uniformly bounded on L (t − ε0 ) × Rd , that is, sup D2 qh (x, y) : (x, y) ∈ L (t − ε0 ) × Rd < ∞. [sent-695, score-1.373]

89 x Proof First observe that the statements 2, 3 and 4 are immediate consequences of statement 1 together with the fact that the function kh is of class C 2 with compact support, which implies that kh (y − x), Dx kh (y − x), and D2 kh (y − x) are uniformly bounded. [sent-696, score-1.947]

90 Moreover kh is bounded from below by some positive number on hB/2 by Assumption 2. [sent-704, score-0.468]

91 Proof Let † Kn,h (x) := n 1 kh (Xi − x)1L (t) (Xi ), nµ(L (t)) ∑ †† Kn,h (x) := n i=1 n 1 kh (Xi − x)1L (t) (Xi ). [sent-711, score-0.936]

92 Using the inequality † Kn,h (x) − Kn,h (x) ≤ 1 n − j(n) µ(L (t)) kh ∞ we conclude that the ﬁrst term in (35) tends to 0 uniformly in x over L (t − ε0 ) with probability one as n → ∞, since j(n)/n → µ L (t) almost surely, and since kh is bounded on Rd . [sent-713, score-1.003]

93 (36) The ﬁrst term in (36) is bounded by † †† Kn,h (x) − Kn,h (x) ≤ = kh ∞ 1 µ L (t) n n ∑ i=1 n 1Ln (t) (Xi ) − 1L (t) (Xi ) kh ∞ 1 ∑ 1L (t)∆L (t) (Xi ), µ L (t) n i=1 n where Ln (t)∆L (t) denotes the symmetric difference between Ln (t) and L (t). [sent-715, score-0.936]

94 Next, since the collection y → kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) is Glivenko-Cantelli by Lemma 12, we conclude that sup x∈L (t−ε0 ) †† Kn,h (x) − Kh (x) → 0, with probability one as n → ∞. [sent-719, score-0.528]

95 The second statement may be proved by developing similar arguments, with kh replaced by Dx kh , and by noting that the collection of functions y → Dx kh (y − x)1L (t) (y) : x ∈ L (t − ε0 ) is also Glivenko-Cantelli by Lemma 12. [sent-721, score-1.404]

96 Hence sup Kh ϕn (x) − Kh (x) → 0 as n → ∞, x∈L (t) and since Kn,h converges uniformly to Kh with probability one as n → ∞ by Lemma 15, this proves the ﬁrst convergence result. [sent-725, score-0.18]

97 A Feller chain is said open set irreducible if, for every points x, y in S , and every η > 0, ∑ qn (x, y + ηB) > 0, n≥1 where qn (x, dy) stands for the n-step transition kernel; see (Meyn and Tweedie, 1993, p. [sent-807, score-0.249]

98 Such a behavior does not occur if the Feller chain is topologically aperiodic, that is, if for each initial state x, each η > 0, there exists n0 such that qn (x, x + ηB) > 0 for every n ≥ n0 ; see (Meyn and Tweedie, 1993, p. [sent-819, score-0.186]

99 Assuming that the chain is Feller, open set irreducible, topologically aperiodic and positive Harris recurrent, the sequence of distribution {qn (x, dy)}n≥1 converges in total variation to ν(dy), the unique invariant probability distribution; see Theorem 13. [sent-839, score-0.176]

100 Difusion maps, spectral clustering and reaction coordinates of dynamical systems. [sent-1040, score-0.171]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qh', 0.639), ('kh', 0.468), ('dx', 0.255), ('dy', 0.154), ('dmin', 0.13), ('elletier', 0.123), ('udlo', 0.123), ('pectral', 0.115), ('evel', 0.098), ('rd', 0.093), ('clustering', 0.088), ('connected', 0.085), ('spectral', 0.083), ('lustering', 0.081), ('fh', 0.076), ('ets', 0.071), ('operator', 0.069), ('chain', 0.068), ('lt', 0.065), ('tn', 0.061), ('sup', 0.06), ('surely', 0.059), ('hartigan', 0.059), ('qn', 0.058), ('luxburg', 0.055), ('operators', 0.055), ('hmax', 0.054), ('ptn', 0.054), ('tweedie', 0.054), ('ln', 0.053), ('harris', 0.053), ('meyn', 0.052), ('von', 0.05), ('spectrum', 0.05), ('feller', 0.046), ('rg', 0.045), ('eigenvectors', 0.043), ('eigenvalues', 0.042), ('balls', 0.042), ('norm', 0.041), ('compact', 0.04), ('aperiodic', 0.038), ('fn', 0.038), ('gn', 0.038), ('recurrent', 0.038), ('irreducible', 0.038), ('topologically', 0.038), ('components', 0.036), ('uniformly', 0.035), ('supremum', 0.035), ('laplacian', 0.034), ('vaart', 0.033), ('ch', 0.033), ('ri', 0.033), ('cubes', 0.033), ('converges', 0.032), ('belkin', 0.032), ('almost', 0.032), ('diffeomorphisms', 0.031), ('gl', 0.03), ('radius', 0.03), ('lemma', 0.029), ('gx', 0.029), ('density', 0.029), ('gk', 0.028), ('der', 0.028), ('cluster', 0.028), ('xi', 0.027), ('map', 0.027), ('transition', 0.027), ('multiplicity', 0.027), ('materials', 0.027), ('similarity', 0.027), ('isolated', 0.027), ('gu', 0.027), ('proves', 0.027), ('partition', 0.026), ('convergence', 0.026), ('projector', 0.026), ('pn', 0.026), ('acting', 0.025), ('banach', 0.025), ('markov', 0.025), ('fu', 0.025), ('ek', 0.025), ('sn', 0.024), ('consistency', 0.024), ('chains', 0.024), ('walk', 0.024), ('theorem', 0.023), ('px', 0.023), ('kato', 0.023), ('montpellier', 0.023), ('pelletier', 0.023), ('rennes', 0.023), ('gi', 0.022), ('state', 0.022), ('lebesgue', 0.022), ('functional', 0.022), ('eigenspaces', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

2 0.081031732 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efﬁcient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is ﬁxed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modiﬁed algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never ﬁnd the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a ﬁxed prior. Keywords: convergence rates, efﬁcient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

3 0.078935355 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefﬁcient which often characterizes the intrinsic difﬁculty of the learning problem. In this paper, we study the disagreement coefﬁcient of classiﬁcation problems for which the classiﬁcation boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefﬁcients of both ﬁnitely and inﬁnitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefﬁcient, label complexity, smooth function

4 0.075411461 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

5 0.065687001 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

Author: Thomas L. Griffiths, Zoubin Ghahramani

Abstract: The Indian buffet process is a stochastic process deﬁning a probability distribution over equivalence classes of sparse binary matrices with a ﬁnite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially inﬁnite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an inﬁnite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices

6 0.059380401 16 jmlr-2011-Clustering Algorithms for Chains

7 0.047464404 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

8 0.042920489 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

9 0.041378897 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

10 0.041312996 35 jmlr-2011-Forest Density Estimation

11 0.039891943 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.038959716 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

13 0.038710374 55 jmlr-2011-Learning Multi-modal Similarity

14 0.035631761 104 jmlr-2011-X-Armed Bandits

15 0.035207368 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.034799419 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

17 0.032657109 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

18 0.032603811 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

19 0.032003161 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

20 0.03152401 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.171), (1, -0.025), (2, 0.003), (3, 0.044), (4, 0.063), (5, -0.011), (6, -0.038), (7, 0.029), (8, -0.189), (9, 0.087), (10, 0.035), (11, 0.084), (12, -0.015), (13, 0.069), (14, 0.023), (15, -0.119), (16, 0.089), (17, -0.089), (18, -0.004), (19, -0.162), (20, 0.041), (21, -0.241), (22, 0.03), (23, -0.059), (24, -0.15), (25, -0.157), (26, -0.077), (27, -0.052), (28, 0.052), (29, -0.161), (30, 0.092), (31, 0.092), (32, -0.107), (33, 0.217), (34, 0.117), (35, -0.083), (36, 0.106), (37, 0.038), (38, 0.075), (39, 0.057), (40, -0.049), (41, -0.009), (42, -0.038), (43, 0.159), (44, -0.014), (45, 0.032), (46, -0.094), (47, -0.124), (48, 0.037), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94930524 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

2 0.52214128 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a ﬁnite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-speciﬁc scores are either difﬁcult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The ﬁrst one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the signiﬁcance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artiﬁcial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

3 0.45164683 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

4 0.44567141 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

5 0.4148137 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

6 0.37930143 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

7 0.31968686 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

8 0.30205908 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

9 0.23155762 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

10 0.23000225 91 jmlr-2011-The Sample Complexity of Dictionary Learning

11 0.22364226 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

12 0.22329344 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

13 0.21630509 104 jmlr-2011-X-Armed Bandits

14 0.2130222 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

15 0.19571233 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

16 0.19525605 6 jmlr-2011-A Simpler Approach to Matrix Completion

17 0.19349413 92 jmlr-2011-The Stationary Subspace Analysis Toolbox

18 0.19255179 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

19 0.19055705 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

20 0.19011025 55 jmlr-2011-Learning Multi-modal Similarity

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.057), (6, 0.03), (9, 0.046), (10, 0.015), (11, 0.011), (23, 0.011), (24, 0.033), (31, 0.09), (32, 0.025), (39, 0.011), (41, 0.037), (60, 0.013), (65, 0.03), (66, 0.043), (71, 0.017), (73, 0.036), (77, 0.276), (78, 0.068), (90, 0.029), (94, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71165764 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

2 0.47803175 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

Author: Tony Jebara

Abstract: A multitask learning framework is developed for discriminative classiﬁcation and regression where multiple large-margin linear classiﬁers are estimated for different prediction problems. These classiﬁers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The ﬁrst multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classiﬁer as well as adaptively specialized individual classiﬁers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classiﬁcation, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming

3 0.45074287 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.44003388 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classiﬁcation, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classiﬁcation is speciﬁed by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of ﬂat classiﬁcation, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its ﬂat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classiﬁcation, structured learning

5 0.43857118 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

Author: Piotr Zwiernik

Abstract: The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisÄ?Ĺš ed in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold. Keywords: BIC, marginal likelihood, singular models, tree models, Bayesian networks, real logcanonical threshold

6 0.43611851 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

7 0.4348841 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

8 0.43394062 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

9 0.4315339 16 jmlr-2011-Clustering Algorithms for Chains

10 0.43005386 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

11 0.42816591 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.42571899 59 jmlr-2011-Learning with Structured Sparsity

13 0.42390537 104 jmlr-2011-X-Armed Bandits

14 0.42305619 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.42261463 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

16 0.42237836 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

17 0.42058766 12 jmlr-2011-Bayesian Co-Training

18 0.41870236 48 jmlr-2011-Kernel Analysis of Deep Networks

19 0.41796416 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

20 0.41756657 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices