jmlr jmlr2007 jmlr2007-36 knowledge-graph by maker-knowledge-mining

36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Source: pdf

Author: Philippe Rigollet

Abstract: We consider semi-supervised classiﬁcation when part of the available data is unlabeled. These unlabeled data can be useful for the classiﬁcation problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classiﬁcation, cluster assumption, generalization bounds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These unlabeled data can be useful for the classiﬁcation problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. [sent-6, score-0.189]

2 Seeger (2000) proposed the well-known cluster assumption as a reasonable one. [sent-7, score-0.165]

3 We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. [sent-8, score-0.363]

4 The methods try to give an answer to the question: “How to improve classiﬁcation accuracy using unlabeled data together with the labeled data? [sent-11, score-0.18]

5 The ﬁrst one consists in using the unlabeled data to reduce the complexity of the problem in a broad sense. [sent-15, score-0.139]

6 In that case, unlabeled data is used to measure the compatibility between the classiﬁers and reduces the complexity of the set of candidate classiﬁers (see, for example, Balcan and Blum, 2005; Blum and Mitchell, 1998). [sent-17, score-0.139]

7 It assumes that the data contains clusters that have homogeneous labels and the unlabeled observations are used to identify these clusters. [sent-21, score-0.294]

8 Most of these methods use Hartigan’s (1975) deﬁnition of clusters, namely the connected components of the density level sets. [sent-25, score-0.162]

9 In the same spirit, Tipping (1999) and Rattray (2000) propose methods that learn a distance using unlabeled data in order to have intra-cluster distances smaller than inter-clusters distances. [sent-29, score-0.139]

10 The whole family of graph-based methods aims also at using unlabeled data to learn the distances between points. [sent-30, score-0.139]

11 Finally, we mention kernel methods, where unlabeled data are used to build the kernel. [sent-33, score-0.139]

12 Recalling that the kernel measures proximity between points, such methods can also be viewed as learning a distance using unlabeled data (see Bousquet et al. [sent-34, score-0.139]

13 The cluster assumption can be interpreted in another way, that is, as the requirement that the decision boundary has to lie in low density regions. [sent-37, score-0.23]

14 Although most methods make, sometimes implicitly, the cluster assumption, no formulation in probabilistic terms has been provided so far. [sent-45, score-0.159]

15 We also discuss what can and cannot be done using unlabeled data. [sent-47, score-0.139]

16 1 Outline of the Paper After describing the model, we formulate the cluster assumption and discuss why and how it can improve classiﬁcation performance in Section 2. [sent-50, score-0.165]

17 1 which essentially states that the effect of unlabeled data on the rates of convergence cannot be observed on the whole excess-risk. [sent-52, score-0.17]

18 Indeed, such a population case corresponds in some way to the case where the amount of unlabeled data is inﬁnite. [sent-55, score-0.139]

19 Section 4 contains the main result: after having deﬁned the clusters in terms of density level sets, we propose an algorithm for which we derive rates of convergence for the cluster excess-risk as a measure of performance. [sent-56, score-0.396]

20 An example of consistent density level set estimators is given in Section 5. [sent-57, score-0.194]

21 n In many applications, a large amount of unlabeled data is available together with a small set of labeled data (X1 ,Y1 ), . [sent-85, score-0.18]

22 However, simulations indicate that much faster rates should be attainable when unlabeled data is used to identify homogeneous clusters. [sent-101, score-0.202]

23 Of course, it is well known that in order to make use of the additional unlabeled observations, we have to make an assumption on the dependence between the marginal distribution of X and the joint distribution of (X,Y ) (see, for example, Zhang and Oles, 2000). [sent-102, score-0.189]

24 Seeger (2000) formulated the rather intuitive cluster assumption as follows1 Two points x, x ∈ X should have the same label y if there is a path between them which passes only through regions of relatively high PX . [sent-103, score-0.165]

25 1371 R IGOLLET Assume for the moment that we know what the clusters are, so that we do not have to deﬁne them in terms of density level sets. [sent-110, score-0.227]

26 We now make the assumption that the T j ’s are clusters of homogeneous data. [sent-116, score-0.182]

27 It is not hard to see that the cluster assumption (CA) is equivalent to the following assumption. [sent-127, score-0.165]

28 Since the cluster assumption (CA) says nothing about what happens outside of the set C , we can only perform supervised classiﬁcation on C c . [sent-140, score-0.199]

29 Consider a classiﬁer gn,m built from labeled and ˆ unlabeled samples (Xl , Xu ) pooled together. [sent-141, score-0.21]

30 Since, the unlabeled sample is of no help to classify points n,m in C c , any reasonable classiﬁer should be based on the sample X l so that gn,m (x) = gn (x), ∀ x ∈ C c , ˆ ˆ and we have Z E (gn,m ) ≥ IEn |2η(x) − 1|1 {gn (x)=g (x)} dPX (x) . [sent-145, score-0.213]

31 1 is that even when the cluster assumption (CA) is valid the unlabeled data are useless to improve the rates of convergence. [sent-156, score-0.335]

32 To observe the effect of unlabeled data on the rates of convergence, we have to consider the cluster excess-risk of a classiﬁer gn,m deﬁned by ˆ EC (gn,m ) ˆ IEn,m Z C |2η(x) − 1|1 {gn,m (x)=g I ˆ (x)} dPX (x) . [sent-162, score-0.308]

33 ˆ We now propose a method to obtain good upper bounds on the cluster excess-risk, taking advantage of the cluster assumption (CA). [sent-172, score-0.303]

34 j≥1 Under the cluster assumption (CA), the function x → η(x) − 1/2 has constant sign on each T j . [sent-183, score-0.165]

35 For any j ≥ 1, deﬁne the random variable n j I Zn = ∑ (2Yi − 1) 1 {Xi ∈Tj } , i=1 I ˆ and denote by gn the function gn (x) = 1 {Znj >0} , for all x ∈ T j . [sent-187, score-0.148]

36 Consider the classiﬁer deﬁned on C ˆ by gn (x) = ∑ gn (x)1 {x∈Tj } , x ∈ C . [sent-188, score-0.148]

37 ˆ ˆj I j j j≥1 The following theorem gives rates of convergence for the cluster excess-risk of the classiﬁer g n ˆ under (CA) that can be exponential in n under a mild additional assumption. [sent-189, score-0.169]

38 However, we need estimators of the clusters T j , j = 1, 2, . [sent-197, score-0.172]

39 In the next section we provide the main result on semi-supervised learning, that is when the clusters are unknown but we can estimate them using the unlabeled sample X u . [sent-201, score-0.262]

40 We begin by giving a deﬁnition of the clusters in terms of density level sets. [sent-210, score-0.227]

41 The cluster assumption involves also a notion of connectedness of a set. [sent-217, score-0.201]

42 It can be easily show that R is an equivalence relation and its classes of equivalence are called connected components of C. [sent-220, score-0.146]

43 At this point, in view of the formulation of the cluster assumption, it is very tempting to deﬁne the clusters as the connected components of C. [sent-221, score-0.34]

44 In the next proposition we prove that given a certain scale s > 0, it is possible split a r 0 -standard and closed set C into a unique partition that is a coarser than the partition deﬁned by the connected components of C and that this partition is ﬁnite for such sets. [sent-268, score-0.227]

45 This is consistent with the fact that the ﬁnite number of unlabeled observations allows us to have only a blurred vision of the clusters. [sent-286, score-0.18]

46 We now formulate the cluster assumption when the clusters are deﬁned in terms of density level sets. [sent-290, score-0.392]

47 , HJ , it is therefore natural to use a suitable reordering of the (s0 + ˆ um )-connected components of Gm , where um is a positive sequence that tends to 0 as m tends to ˆ inﬁnity. [sent-326, score-0.222]

48 Since the measure of performance IEm Lebd (Gm Γ) is deﬁned up to a set of null Lebesgue ˆ ˆ m that satisﬁes IEm Lebd (Gm Γ) = 0 has measure it may be the case that even an estimator G only one (s0 + um )-connected components whereas Γ has several s0 -connected components. [sent-327, score-0.218]

49 Note that ˆ the performance of a density level set estimator Gm is measured by the quantity ˆ IEm Lebd (Gm ˆm ˆ Γ) = IEm Lebd (Gc ∩ Γ) + IEm Lebd (Gm ∩ Γc ) . [sent-335, score-0.196]

50 (7) For some estimators, such as the offset plug-in density level sets estimators presented in Section 5, ˆm we can prove that the dominant term in the RHS of (7) is IEm Lebd (Gc ∩ Γ) . [sent-336, score-0.189]

51 We say that the estimator Gm is consistent −α if it satisﬁes from inside at rate m ˆ IEm Lebd (Gm Γ) = O(m−α ) , and ˆ IEm Lebd (Gm ∩ Γc ) = O(m−2α ) . [sent-341, score-0.188]

52 The following proposition ensures that the clipped version of an estimator that is consistent from inside is also consistent from inside at the same rate. [sent-342, score-0.391]

53 Assume that X is bounded ˆ and let Gm be an estimator of Γ that is consistent from inside at rate m−α . [sent-345, score-0.188]

54 Then, the clipped ˜ ˆ ˆ estimator Gm = Gm \ Clip(Gm ) is also consistent from inside a rate m−α and has a ﬁnite number α of (s + u )-connected components that have Lebesgue measure greater than ˜ Km ≤ Lebd (X )m 0 m ˜ or equal to m−α . [sent-346, score-0.25]

55 Moreover, the (s0 + um )-connected components of Gm are mutually (s0 + θum )separated for any θ ∈ (0, 1). [sent-347, score-0.17]

56 3 Labeling the Clusters From the strong cluster assumption (SCA) the clusters are homogeneous regions. [sent-376, score-0.32]

57 Use the unlabeled data Xu to construct an estimator Gm of Γ that is consistent from −α . [sent-381, score-0.272]

58 Fix ˜ ˆ α > 0 and let Gm be an estimator of the density level set Γ, that is consistent from inside at rate m −α . [sent-388, score-0.292]

59 Plug-in Rules for Density Level Sets Estimation Fix λ > 0 and recall that our goal is to use the unlabeled sample X u of size m to construct an ˆ estimator Gm of Γ = Γ(λ) = {x ∈ X : p(x) ≥ λ}, that is consistent from inside at rate m −α for some α > 0 that should be as large as possible. [sent-408, score-0.327]

60 A simple and intuitive way to achieve this goal is to use plug-in estimators of Γ deﬁned by ˆ ˆ Γ = Γ(λ) = {x ∈ X : pm (x) ≥ λ} , ˆ where pm is some estimator of p. [sent-409, score-0.291]

61 A straightforward generalization are the offset plug-in estimators ˆ of Γ(λ), deﬁned by ˜ ˜ Γ = Γ (λ) = {x ∈ X : pm (x) ≥ λ + } , ˆ ˜ ˆ where > 0 is an offset. [sent-410, score-0.16]

62 Keeping in mind that we want estimators that are consistent from inside we are going to consider sufﬁciently large offset = (m). [sent-412, score-0.181]

63 One advantage of plug-in rules over direct methods is that once we ˜ have an estimator pm , we can compute the whole collection {Γ (λ), λ > 0}, which might be of ˆ interest for the user who wants to try several values of λ. [sent-416, score-0.19]

64 A density estimator can be parametric, typically based on a mixture model, or nonparametric such as histograms or kernel density estimators. [sent-418, score-0.222]

65 However, it is not our intent to propose here the best clustering algorithm or the best density level set estimator and we present a simple proof of convergence for offset plug-in rules only for the sake of completeness. [sent-421, score-0.232]

66 Moreover, it ensures that when the offset is suitably chosen, the plug-in estimator is consistent from inside. [sent-428, score-0.169]

67 Let pm be an estimator of the density p based on the ˆ sample Xu of size m ≥ 1 and let P be a class of densities on X . [sent-431, score-0.232]

68 (13) γa ˜ Then the plug-in estimator Γ is consistent from inside at rate m− 2 for any p ∈ P . [sent-434, score-0.188]

69 Consider a kernel density estimator pK based on the sample Xu deﬁned by ˆm pK (x) = ˆm 1 mhd n+m ∑ K i=n+1 Xi − x , h x∈X, ¨ where h > 0 is the bandwidth parameter and K : IRd → IR is a kernel. [sent-435, score-0.157]

70 Discussion We proposed a formulation of the cluster assumption in probabilistic terms. [sent-439, score-0.186]

71 This formulation relies on Hartigan’s (1975) deﬁnition of clusters but it can be modiﬁed to match other deﬁnitions of clusters. [sent-440, score-0.144]

72 Based on these remarks, we deﬁned the cluster excess-risk on which we observe the effect of unlabeled data. [sent-442, score-0.277]

73 Finally we proved that when we have consistent estimators of the clusters, it is possible to achieve exponential rates of convergence for the cluster excess-risk. [sent-443, score-0.259]

74 Note that our deﬁnition of clusters is parametrized by λ which is left to the user, depending on his trust in the cluster assumption. [sent-445, score-0.261]

75 In terms of the cluster assumption, it means that when λ decreases to 0, the assumption (SCA) concerns bigger and bigger sets Γ(λ) and in that sense, it becomes more and more restrictive. [sent-447, score-0.165]

76 As a result, the parameter λ can be considered as a level of conﬁdence characterizing to which extent the cluster assumption is valid for the distribution P. [sent-448, score-0.204]

77 These three algorithms implement clustering using a deﬁnition of clusters that involves density level sets and a certain notion of connectedness. [sent-459, score-0.227]

78 The label for X is then predicted using a majority vote on the labeled 1382 G ENERALIZATION E RROR B OUNDS IN S EMI - SUPERVISED C LASSIFICATION instances that are affected to the same cluster as X. [sent-465, score-0.205]

79 Observe that unlike the method described in the paper, the clusters depend on the labeled instances (X1 , . [sent-466, score-0.164]

80 Since all three algorithms are distance based, we could run them only on unlabeled instance and then affect each labeled instance and the new instance to the same cluster as its nearest neighbor. [sent-471, score-0.318]

81 We now describe more precisely why these algorithms produce estimated clusters that are related to the sm -connected components of a plug-in estimator of the density level set. [sent-473, score-0.375]

82 Also, deﬁne the kernel density estimator pm by: ˆ pm (x) = ˆ 1 mεd m ∑K j=1 x − Xi , ε where K : IRd → IR is deﬁned by K(x) = 1 { x ≤1} for any x ∈ IRd . [sent-489, score-0.307]

83 OPTICS • Both of the previous algorithms suffer from a major drawback that is inherent to our deﬁnition of cluster based on a global level when determining the density level sets. [sent-495, score-0.281]

84 Indeed, in many real data sets, some clusters can only be identiﬁed using several density levels. [sent-496, score-0.188]

85 Since, it does not implement our method, we do not describe the algorithm in detail but mention it because it implements a more suitable deﬁnition of clusters that is also based on connectedness and density level sets. [sent-498, score-0.263]

86 It uses a nearest neighbor density estimator as a running horse and uses a single input parameter that corresponds to the scale s0 . [sent-500, score-0.157]

87 This paper is an attempt to give a proper mathematical framework for the cluster assumption proposed in Seeger (2000). [sent-501, score-0.165]

88 As mentioned above, the deﬁnition of clusters that we use here is one 1383 R IGOLLET among several available and it could be interesting to modify the formulation of the cluster assumption to match other deﬁnitions of cluster. [sent-502, score-0.309]

89 In particular, the deﬁnition of cluster as s 0 -connected components of the λ-level set of the density leaves the problem of choosing λ correctly. [sent-503, score-0.233]

90 than or equal to m d ˜ ˜ ˜ To prove that the (s0 + um )-connected components of Gm are mutually s0 -separated, let T1 = T2 ˜ ˜ ˜ be two (s0 +um )-connected components of Gm and ﬁx x1 ∈ T1 , x2 ∈ T2 . [sent-588, score-0.2]

91 Thus d∞ (T1 , T2 ) ≥ s0 + um > s0 + θum for any um > 0, θ ∈ (0, 1). [sent-590, score-0.192]

92 Thus, for any m ≥ m0 , it holds (3 log m)−1 ≤ s0 ∧ r0 and ˜ A1 ( j) ⊂ {Lebd [B (x, (3 log m)−1 ) ∩ T j ] ≤ Lebd [Gm ˜ Γ]} ⊂ {Lebd [Gm Γ] ≥ c0 (3 log m)−d } , where in the last inclusion we used the fact that Γ is r0 -standard. [sent-603, score-0.159]

93 On the other hand, we have ˜ ˆ Lebd (Gm ∩ B (y j , (3 log m)−1 )) = Lebd (Gm ∩ B (y j , (3 log m)−1 )) ˆ − Lebd (Clip(Gm ) ∩ B (y j , (3 log m)−1 )) ˆ ≥ m−α (3 log m)−d − Lebd (Gm ∩ Γc ) ≥ m−α (3 log m)−d − c3 m−1. [sent-653, score-0.265]

94 m Using the Markov inequality for both terms we obtain ˜ IP Lebd Gm m Γ > c0 (3 log m)−d = O m−α , and ˜ IP Lebd (Gm ∩ Γc ) ≥ c5 m−α (log m)−d = O m−α , m ˜ where we used the fact that Gm is consistent from inside with rate m−α . [sent-657, score-0.149]

95 We have ˜ Γ ∩ Γc = x ∈ X : pm (x) ≥ λ + , p(x) < λ ⊂ x ∈ X : | pm (x) − p(x)| ≥ ˆ ˆ . [sent-690, score-0.15]

96 Consider the following decomposition where we drop the dependence in x for notational convenience, ˜ Γc ∩ Γ = B 1 ∪ B 2 , where B1 = p m < λ + , p ≥ λ + 2 ˆ ⊂ | pm − p| ≥ ˆ and B2 = pm < λ + , λ ≤ p(x) < λ + 2 ˆ ⊂ |p − λ| ≤ . [sent-694, score-0.15]

97 A pac-style model for learning from labeled and unlabeled data. [sent-720, score-0.18]

98 The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-761, score-0.18]

99 Fast rates for plug-in estimators of density level sets. [sent-864, score-0.184]

100 Estimating the cluster type of a density by analyzing the minimal spanning tree of a sample. [sent-892, score-0.203]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lebd', 0.619), ('gm', 0.458), ('iem', 0.199), ('unlabeled', 0.139), ('cluster', 0.138), ('ien', 0.136), ('px', 0.135), ('igollet', 0.126), ('clusters', 0.123), ('xu', 0.113), ('tj', 0.111), ('dpx', 0.111), ('ip', 0.108), ('rror', 0.105), ('um', 0.096), ('estimator', 0.092), ('eneralization', 0.089), ('hk', 0.086), ('clip', 0.084), ('emi', 0.079), ('pm', 0.075), ('proposition', 0.075), ('gn', 0.074), ('ounds', 0.073), ('density', 0.065), ('ird', 0.062), ('lebesgue', 0.059), ('tsybakov', 0.058), ('fix', 0.057), ('lassification', 0.057), ('inside', 0.055), ('log', 0.053), ('dbscan', 0.052), ('xr', 0.052), ('estimators', 0.049), ('gc', 0.048), ('ec', 0.048), ('mutually', 0.044), ('equivalence', 0.044), ('rejection', 0.044), ('cuevas', 0.042), ('rigollet', 0.042), ('sca', 0.042), ('consistent', 0.041), ('labeled', 0.041), ('level', 0.039), ('card', 0.038), ('cc', 0.038), ('dx', 0.036), ('jumps', 0.036), ('offset', 0.036), ('connectedness', 0.036), ('rhs', 0.036), ('xn', 0.035), ('supervised', 0.034), ('event', 0.034), ('seeger', 0.034), ('chapelle', 0.033), ('clipped', 0.032), ('homogeneous', 0.032), ('castelli', 0.031), ('gnj', 0.031), ('xl', 0.031), ('km', 0.031), ('rates', 0.031), ('components', 0.03), ('pooled', 0.03), ('tn', 0.03), ('measurable', 0.029), ('piecewise', 0.028), ('connected', 0.028), ('remark', 0.028), ('assumption', 0.027), ('er', 0.027), ('dc', 0.027), ('hartigan', 0.027), ('optics', 0.027), ('classi', 0.026), ('sm', 0.026), ('vote', 0.026), ('ir', 0.026), ('partition', 0.024), ('contradicts', 0.024), ('mammen', 0.024), ('hereafter', 0.024), ('hj', 0.024), ('sup', 0.023), ('marginal', 0.023), ('collection', 0.023), ('closed', 0.022), ('audibert', 0.022), ('boucheron', 0.022), ('formulation', 0.021), ('alch', 0.021), ('ankerst', 0.021), ('buc', 0.021), ('dummit', 0.021), ('ester', 0.021), ('exivity', 0.021), ('herbei', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

2 0.11185756 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

3 0.09437985 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

Author: Ping Li, Trevor J. Hastie, Kenneth W. Church

Abstract: For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k D) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k , using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators (MLE). We derive tail bounds for the geometric mean estimators and establish that k = O log n sufﬁces with the constants explicitly ε2 given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80% efﬁcient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, Johnson-Lindenstrauss (JL) lemma, Cauchy random projections

4 0.081745632 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efﬁciently evaluate the approximate merit of candidate structure modiﬁcations and apply time consuming (exact) computations only to the most promising ones, thereby achieving signiﬁcant improvement in the running time of the search algorithm. Our method also naturally and efﬁciently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difﬁcult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables

5 0.066794641 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

6 0.065089121 44 jmlr-2007-Large Margin Semi-supervised Learning

7 0.060328469 71 jmlr-2007-Refinable Kernels

8 0.052614637 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

9 0.05186177 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

10 0.0484557 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

11 0.043409385 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

12 0.041432597 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

13 0.040013555 72 jmlr-2007-Relational Dependency Networks

14 0.039200909 9 jmlr-2007-AdaBoost is Consistent

15 0.038897615 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

16 0.038169321 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

17 0.037527449 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

18 0.036409661 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

19 0.033314098 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution

20 0.032415245 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.067), (2, 0.081), (3, 0.078), (4, 0.145), (5, -0.146), (6, 0.142), (7, -0.007), (8, 0.101), (9, -0.126), (10, -0.011), (11, -0.121), (12, 0.062), (13, -0.042), (14, 0.09), (15, -0.002), (16, 0.239), (17, 0.076), (18, -0.001), (19, 0.167), (20, 0.018), (21, 0.35), (22, -0.047), (23, -0.146), (24, 0.065), (25, -0.108), (26, -0.147), (27, 0.026), (28, 0.032), (29, -0.054), (30, 0.109), (31, 0.08), (32, -0.002), (33, 0.101), (34, -0.002), (35, -0.022), (36, -0.007), (37, -0.071), (38, -0.085), (39, -0.018), (40, -0.028), (41, -0.09), (42, -0.057), (43, 0.063), (44, -0.097), (45, 0.068), (46, -0.064), (47, 0.055), (48, -0.028), (49, -0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94864118 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

2 0.48731431 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

Author: Ping Li, Trevor J. Hastie, Kenneth W. Church

3 0.40090153 44 jmlr-2007-Large Margin Semi-supervised Learning

Author: Junhui Wang, Xiaotong Shen

Abstract: In classiﬁcation, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classiﬁcation through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two speciﬁc machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors

4 0.39509279 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

5 0.3879517 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

Author: Gal Elidan, Iftach Nachman, Nir Friedman

6 0.26612574 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

7 0.25476027 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

8 0.25282124 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

9 0.23920822 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

10 0.23367006 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

11 0.21238136 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution

12 0.20324343 71 jmlr-2007-Refinable Kernels

13 0.18861128 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

14 0.17137912 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

15 0.16278239 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

16 0.1516858 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

17 0.14970648 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

18 0.14870551 72 jmlr-2007-Relational Dependency Networks

19 0.1471608 9 jmlr-2007-AdaBoost is Consistent

20 0.14530873 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.024), (8, 0.041), (10, 0.022), (12, 0.031), (13, 0.013), (15, 0.015), (28, 0.079), (40, 0.038), (45, 0.012), (48, 0.047), (60, 0.044), (64, 0.354), (80, 0.018), (85, 0.041), (98, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70115507 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

2 0.42785549 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

3 0.42761114 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, ﬁrst, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds

4 0.4236452 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

Author: Onur C. Hamsici, Aleix M. Martinez

Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difﬁculty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classiﬁcation results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to deﬁne optimal classiﬁers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classiﬁers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classiﬁcation of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classiﬁers, computer vision

5 0.423639 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza

Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to deﬁne a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory

6 0.42303228 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

7 0.41794103 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

8 0.41766906 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

9 0.4175964 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

10 0.4158344 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

11 0.413041 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

12 0.41085094 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

13 0.41051668 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

14 0.41049808 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

15 0.40849745 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

16 0.40828788 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

17 0.40788797 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

18 0.40540203 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

19 0.40448663 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

20 0.40437907 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis