jmlr jmlr2007 jmlr2007-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
Reference: text
sentIndex sentText sentNum sentScore
1 These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. [sent-6, score-0.189]
2 Seeger (2000) proposed the well-known cluster assumption as a reasonable one. [sent-7, score-0.165]
3 We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. [sent-8, score-0.363]
4 The methods try to give an answer to the question: “How to improve classification accuracy using unlabeled data together with the labeled data? [sent-11, score-0.18]
5 The first one consists in using the unlabeled data to reduce the complexity of the problem in a broad sense. [sent-15, score-0.139]
6 In that case, unlabeled data is used to measure the compatibility between the classifiers and reduces the complexity of the set of candidate classifiers (see, for example, Balcan and Blum, 2005; Blum and Mitchell, 1998). [sent-17, score-0.139]
7 It assumes that the data contains clusters that have homogeneous labels and the unlabeled observations are used to identify these clusters. [sent-21, score-0.294]
8 Most of these methods use Hartigan’s (1975) definition of clusters, namely the connected components of the density level sets. [sent-25, score-0.162]
9 In the same spirit, Tipping (1999) and Rattray (2000) propose methods that learn a distance using unlabeled data in order to have intra-cluster distances smaller than inter-clusters distances. [sent-29, score-0.139]
10 The whole family of graph-based methods aims also at using unlabeled data to learn the distances between points. [sent-30, score-0.139]
11 Finally, we mention kernel methods, where unlabeled data are used to build the kernel. [sent-33, score-0.139]
12 Recalling that the kernel measures proximity between points, such methods can also be viewed as learning a distance using unlabeled data (see Bousquet et al. [sent-34, score-0.139]
13 The cluster assumption can be interpreted in another way, that is, as the requirement that the decision boundary has to lie in low density regions. [sent-37, score-0.23]
14 Although most methods make, sometimes implicitly, the cluster assumption, no formulation in probabilistic terms has been provided so far. [sent-45, score-0.159]
15 We also discuss what can and cannot be done using unlabeled data. [sent-47, score-0.139]
16 1 Outline of the Paper After describing the model, we formulate the cluster assumption and discuss why and how it can improve classification performance in Section 2. [sent-50, score-0.165]
17 1 which essentially states that the effect of unlabeled data on the rates of convergence cannot be observed on the whole excess-risk. [sent-52, score-0.17]
18 Indeed, such a population case corresponds in some way to the case where the amount of unlabeled data is infinite. [sent-55, score-0.139]
19 Section 4 contains the main result: after having defined the clusters in terms of density level sets, we propose an algorithm for which we derive rates of convergence for the cluster excess-risk as a measure of performance. [sent-56, score-0.396]
20 An example of consistent density level set estimators is given in Section 5. [sent-57, score-0.194]
21 n In many applications, a large amount of unlabeled data is available together with a small set of labeled data (X1 ,Y1 ), . [sent-85, score-0.18]
22 However, simulations indicate that much faster rates should be attainable when unlabeled data is used to identify homogeneous clusters. [sent-101, score-0.202]
23 Of course, it is well known that in order to make use of the additional unlabeled observations, we have to make an assumption on the dependence between the marginal distribution of X and the joint distribution of (X,Y ) (see, for example, Zhang and Oles, 2000). [sent-102, score-0.189]
24 Seeger (2000) formulated the rather intuitive cluster assumption as follows1 Two points x, x ∈ X should have the same label y if there is a path between them which passes only through regions of relatively high PX . [sent-103, score-0.165]
25 1371 R IGOLLET Assume for the moment that we know what the clusters are, so that we do not have to define them in terms of density level sets. [sent-110, score-0.227]
26 We now make the assumption that the T j ’s are clusters of homogeneous data. [sent-116, score-0.182]
27 It is not hard to see that the cluster assumption (CA) is equivalent to the following assumption. [sent-127, score-0.165]
28 Since the cluster assumption (CA) says nothing about what happens outside of the set C , we can only perform supervised classification on C c . [sent-140, score-0.199]
29 Consider a classifier gn,m built from labeled and ˆ unlabeled samples (Xl , Xu ) pooled together. [sent-141, score-0.21]
30 Since, the unlabeled sample is of no help to classify points n,m in C c , any reasonable classifier should be based on the sample X l so that gn,m (x) = gn (x), ∀ x ∈ C c , ˆ ˆ and we have Z E (gn,m ) ≥ IEn |2η(x) − 1|1 {gn (x)=g (x)} dPX (x) . [sent-145, score-0.213]
31 1 is that even when the cluster assumption (CA) is valid the unlabeled data are useless to improve the rates of convergence. [sent-156, score-0.335]
32 To observe the effect of unlabeled data on the rates of convergence, we have to consider the cluster excess-risk of a classifier gn,m defined by ˆ EC (gn,m ) ˆ IEn,m Z C |2η(x) − 1|1 {gn,m (x)=g I ˆ (x)} dPX (x) . [sent-162, score-0.308]
33 ˆ We now propose a method to obtain good upper bounds on the cluster excess-risk, taking advantage of the cluster assumption (CA). [sent-172, score-0.303]
34 j≥1 Under the cluster assumption (CA), the function x → η(x) − 1/2 has constant sign on each T j . [sent-183, score-0.165]
35 For any j ≥ 1, define the random variable n j I Zn = ∑ (2Yi − 1) 1 {Xi ∈Tj } , i=1 I ˆ and denote by gn the function gn (x) = 1 {Znj >0} , for all x ∈ T j . [sent-187, score-0.148]
36 Consider the classifier defined on C ˆ by gn (x) = ∑ gn (x)1 {x∈Tj } , x ∈ C . [sent-188, score-0.148]
37 ˆ ˆj I j j j≥1 The following theorem gives rates of convergence for the cluster excess-risk of the classifier g n ˆ under (CA) that can be exponential in n under a mild additional assumption. [sent-189, score-0.169]
38 However, we need estimators of the clusters T j , j = 1, 2, . [sent-197, score-0.172]
39 In the next section we provide the main result on semi-supervised learning, that is when the clusters are unknown but we can estimate them using the unlabeled sample X u . [sent-201, score-0.262]
40 We begin by giving a definition of the clusters in terms of density level sets. [sent-210, score-0.227]
41 The cluster assumption involves also a notion of connectedness of a set. [sent-217, score-0.201]
42 It can be easily show that R is an equivalence relation and its classes of equivalence are called connected components of C. [sent-220, score-0.146]
43 At this point, in view of the formulation of the cluster assumption, it is very tempting to define the clusters as the connected components of C. [sent-221, score-0.34]
44 In the next proposition we prove that given a certain scale s > 0, it is possible split a r 0 -standard and closed set C into a unique partition that is a coarser than the partition defined by the connected components of C and that this partition is finite for such sets. [sent-268, score-0.227]
45 This is consistent with the fact that the finite number of unlabeled observations allows us to have only a blurred vision of the clusters. [sent-286, score-0.18]
46 We now formulate the cluster assumption when the clusters are defined in terms of density level sets. [sent-290, score-0.392]
47 , HJ , it is therefore natural to use a suitable reordering of the (s0 + ˆ um )-connected components of Gm , where um is a positive sequence that tends to 0 as m tends to ˆ infinity. [sent-326, score-0.222]
48 Since the measure of performance IEm Lebd (Gm Γ) is defined up to a set of null Lebesgue ˆ ˆ m that satisfies IEm Lebd (Gm Γ) = 0 has measure it may be the case that even an estimator G only one (s0 + um )-connected components whereas Γ has several s0 -connected components. [sent-327, score-0.218]
49 Note that ˆ the performance of a density level set estimator Gm is measured by the quantity ˆ IEm Lebd (Gm ˆm ˆ Γ) = IEm Lebd (Gc ∩ Γ) + IEm Lebd (Gm ∩ Γc ) . [sent-335, score-0.196]
50 (7) For some estimators, such as the offset plug-in density level sets estimators presented in Section 5, ˆm we can prove that the dominant term in the RHS of (7) is IEm Lebd (Gc ∩ Γ) . [sent-336, score-0.189]
51 We say that the estimator Gm is consistent −α if it satisfies from inside at rate m ˆ IEm Lebd (Gm Γ) = O(m−α ) , and ˆ IEm Lebd (Gm ∩ Γc ) = O(m−2α ) . [sent-341, score-0.188]
52 The following proposition ensures that the clipped version of an estimator that is consistent from inside is also consistent from inside at the same rate. [sent-342, score-0.391]
53 Assume that X is bounded ˆ and let Gm be an estimator of Γ that is consistent from inside at rate m−α . [sent-345, score-0.188]
54 Then, the clipped ˜ ˆ ˆ estimator Gm = Gm \ Clip(Gm ) is also consistent from inside a rate m−α and has a finite number α of (s + u )-connected components that have Lebesgue measure greater than ˜ Km ≤ Lebd (X )m 0 m ˜ or equal to m−α . [sent-346, score-0.25]
55 Moreover, the (s0 + um )-connected components of Gm are mutually (s0 + θum )separated for any θ ∈ (0, 1). [sent-347, score-0.17]
56 3 Labeling the Clusters From the strong cluster assumption (SCA) the clusters are homogeneous regions. [sent-376, score-0.32]
57 Use the unlabeled data Xu to construct an estimator Gm of Γ that is consistent from −α . [sent-381, score-0.272]
58 Fix ˜ ˆ α > 0 and let Gm be an estimator of the density level set Γ, that is consistent from inside at rate m −α . [sent-388, score-0.292]
59 Plug-in Rules for Density Level Sets Estimation Fix λ > 0 and recall that our goal is to use the unlabeled sample X u of size m to construct an ˆ estimator Gm of Γ = Γ(λ) = {x ∈ X : p(x) ≥ λ}, that is consistent from inside at rate m −α for some α > 0 that should be as large as possible. [sent-408, score-0.327]
60 A simple and intuitive way to achieve this goal is to use plug-in estimators of Γ defined by ˆ ˆ Γ = Γ(λ) = {x ∈ X : pm (x) ≥ λ} , ˆ where pm is some estimator of p. [sent-409, score-0.291]
61 A straightforward generalization are the offset plug-in estimators ˆ of Γ(λ), defined by ˜ ˜ Γ = Γ (λ) = {x ∈ X : pm (x) ≥ λ + } , ˆ ˜ ˆ where > 0 is an offset. [sent-410, score-0.16]
62 Keeping in mind that we want estimators that are consistent from inside we are going to consider sufficiently large offset = (m). [sent-412, score-0.181]
63 One advantage of plug-in rules over direct methods is that once we ˜ have an estimator pm , we can compute the whole collection {Γ (λ), λ > 0}, which might be of ˆ interest for the user who wants to try several values of λ. [sent-416, score-0.19]
64 A density estimator can be parametric, typically based on a mixture model, or nonparametric such as histograms or kernel density estimators. [sent-418, score-0.222]
65 However, it is not our intent to propose here the best clustering algorithm or the best density level set estimator and we present a simple proof of convergence for offset plug-in rules only for the sake of completeness. [sent-421, score-0.232]
66 Moreover, it ensures that when the offset is suitably chosen, the plug-in estimator is consistent from inside. [sent-428, score-0.169]
67 Let pm be an estimator of the density p based on the ˆ sample Xu of size m ≥ 1 and let P be a class of densities on X . [sent-431, score-0.232]
68 (13) γa ˜ Then the plug-in estimator Γ is consistent from inside at rate m− 2 for any p ∈ P . [sent-434, score-0.188]
69 Consider a kernel density estimator pK based on the sample Xu defined by ˆm pK (x) = ˆm 1 mhd n+m ∑ K i=n+1 Xi − x , h x∈X, ¨ where h > 0 is the bandwidth parameter and K : IRd → IR is a kernel. [sent-435, score-0.157]
70 Discussion We proposed a formulation of the cluster assumption in probabilistic terms. [sent-439, score-0.186]
71 This formulation relies on Hartigan’s (1975) definition of clusters but it can be modified to match other definitions of clusters. [sent-440, score-0.144]
72 Based on these remarks, we defined the cluster excess-risk on which we observe the effect of unlabeled data. [sent-442, score-0.277]
73 Finally we proved that when we have consistent estimators of the clusters, it is possible to achieve exponential rates of convergence for the cluster excess-risk. [sent-443, score-0.259]
74 Note that our definition of clusters is parametrized by λ which is left to the user, depending on his trust in the cluster assumption. [sent-445, score-0.261]
75 In terms of the cluster assumption, it means that when λ decreases to 0, the assumption (SCA) concerns bigger and bigger sets Γ(λ) and in that sense, it becomes more and more restrictive. [sent-447, score-0.165]
76 As a result, the parameter λ can be considered as a level of confidence characterizing to which extent the cluster assumption is valid for the distribution P. [sent-448, score-0.204]
77 These three algorithms implement clustering using a definition of clusters that involves density level sets and a certain notion of connectedness. [sent-459, score-0.227]
78 The label for X is then predicted using a majority vote on the labeled 1382 G ENERALIZATION E RROR B OUNDS IN S EMI - SUPERVISED C LASSIFICATION instances that are affected to the same cluster as X. [sent-465, score-0.205]
79 Observe that unlike the method described in the paper, the clusters depend on the labeled instances (X1 , . [sent-466, score-0.164]
80 Since all three algorithms are distance based, we could run them only on unlabeled instance and then affect each labeled instance and the new instance to the same cluster as its nearest neighbor. [sent-471, score-0.318]
81 We now describe more precisely why these algorithms produce estimated clusters that are related to the sm -connected components of a plug-in estimator of the density level set. [sent-473, score-0.375]
82 Also, define the kernel density estimator pm by: ˆ pm (x) = ˆ 1 mεd m ∑K j=1 x − Xi , ε where K : IRd → IR is defined by K(x) = 1 { x ≤1} for any x ∈ IRd . [sent-489, score-0.307]
83 OPTICS • Both of the previous algorithms suffer from a major drawback that is inherent to our definition of cluster based on a global level when determining the density level sets. [sent-495, score-0.281]
84 Indeed, in many real data sets, some clusters can only be identified using several density levels. [sent-496, score-0.188]
85 Since, it does not implement our method, we do not describe the algorithm in detail but mention it because it implements a more suitable definition of clusters that is also based on connectedness and density level sets. [sent-498, score-0.263]
86 It uses a nearest neighbor density estimator as a running horse and uses a single input parameter that corresponds to the scale s0 . [sent-500, score-0.157]
87 This paper is an attempt to give a proper mathematical framework for the cluster assumption proposed in Seeger (2000). [sent-501, score-0.165]
88 As mentioned above, the definition of clusters that we use here is one 1383 R IGOLLET among several available and it could be interesting to modify the formulation of the cluster assumption to match other definitions of cluster. [sent-502, score-0.309]
89 In particular, the definition of cluster as s 0 -connected components of the λ-level set of the density leaves the problem of choosing λ correctly. [sent-503, score-0.233]
90 than or equal to m d ˜ ˜ ˜ To prove that the (s0 + um )-connected components of Gm are mutually s0 -separated, let T1 = T2 ˜ ˜ ˜ be two (s0 +um )-connected components of Gm and fix x1 ∈ T1 , x2 ∈ T2 . [sent-588, score-0.2]
91 Thus d∞ (T1 , T2 ) ≥ s0 + um > s0 + θum for any um > 0, θ ∈ (0, 1). [sent-590, score-0.192]
92 Thus, for any m ≥ m0 , it holds (3 log m)−1 ≤ s0 ∧ r0 and ˜ A1 ( j) ⊂ {Lebd [B (x, (3 log m)−1 ) ∩ T j ] ≤ Lebd [Gm ˜ Γ]} ⊂ {Lebd [Gm Γ] ≥ c0 (3 log m)−d } , where in the last inclusion we used the fact that Γ is r0 -standard. [sent-603, score-0.159]
93 On the other hand, we have ˜ ˆ Lebd (Gm ∩ B (y j , (3 log m)−1 )) = Lebd (Gm ∩ B (y j , (3 log m)−1 )) ˆ − Lebd (Clip(Gm ) ∩ B (y j , (3 log m)−1 )) ˆ ≥ m−α (3 log m)−d − Lebd (Gm ∩ Γc ) ≥ m−α (3 log m)−d − c3 m−1. [sent-653, score-0.265]
94 m Using the Markov inequality for both terms we obtain ˜ IP Lebd Gm m Γ > c0 (3 log m)−d = O m−α , and ˜ IP Lebd (Gm ∩ Γc ) ≥ c5 m−α (log m)−d = O m−α , m ˜ where we used the fact that Gm is consistent from inside with rate m−α . [sent-657, score-0.149]
95 We have ˜ Γ ∩ Γc = x ∈ X : pm (x) ≥ λ + , p(x) < λ ⊂ x ∈ X : | pm (x) − p(x)| ≥ ˆ ˆ . [sent-690, score-0.15]
96 Consider the following decomposition where we drop the dependence in x for notational convenience, ˜ Γc ∩ Γ = B 1 ∪ B 2 , where B1 = p m < λ + , p ≥ λ + 2 ˆ ⊂ | pm − p| ≥ ˆ and B2 = pm < λ + , λ ≤ p(x) < λ + 2 ˆ ⊂ |p − λ| ≤ . [sent-694, score-0.15]
97 A pac-style model for learning from labeled and unlabeled data. [sent-720, score-0.18]
98 The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-761, score-0.18]
99 Fast rates for plug-in estimators of density level sets. [sent-864, score-0.184]
100 Estimating the cluster type of a density by analyzing the minimal spanning tree of a sample. [sent-892, score-0.203]
wordName wordTfidf (topN-words)
[('lebd', 0.619), ('gm', 0.458), ('iem', 0.199), ('unlabeled', 0.139), ('cluster', 0.138), ('ien', 0.136), ('px', 0.135), ('igollet', 0.126), ('clusters', 0.123), ('xu', 0.113), ('tj', 0.111), ('dpx', 0.111), ('ip', 0.108), ('rror', 0.105), ('um', 0.096), ('estimator', 0.092), ('eneralization', 0.089), ('hk', 0.086), ('clip', 0.084), ('emi', 0.079), ('pm', 0.075), ('proposition', 0.075), ('gn', 0.074), ('ounds', 0.073), ('density', 0.065), ('ird', 0.062), ('lebesgue', 0.059), ('tsybakov', 0.058), ('fix', 0.057), ('lassification', 0.057), ('inside', 0.055), ('log', 0.053), ('dbscan', 0.052), ('xr', 0.052), ('estimators', 0.049), ('gc', 0.048), ('ec', 0.048), ('mutually', 0.044), ('equivalence', 0.044), ('rejection', 0.044), ('cuevas', 0.042), ('rigollet', 0.042), ('sca', 0.042), ('consistent', 0.041), ('labeled', 0.041), ('level', 0.039), ('card', 0.038), ('cc', 0.038), ('dx', 0.036), ('jumps', 0.036), ('offset', 0.036), ('connectedness', 0.036), ('rhs', 0.036), ('xn', 0.035), ('supervised', 0.034), ('event', 0.034), ('seeger', 0.034), ('chapelle', 0.033), ('clipped', 0.032), ('homogeneous', 0.032), ('castelli', 0.031), ('gnj', 0.031), ('xl', 0.031), ('km', 0.031), ('rates', 0.031), ('components', 0.03), ('pooled', 0.03), ('tn', 0.03), ('measurable', 0.029), ('piecewise', 0.028), ('connected', 0.028), ('remark', 0.028), ('assumption', 0.027), ('er', 0.027), ('dc', 0.027), ('hartigan', 0.027), ('optics', 0.027), ('classi', 0.026), ('sm', 0.026), ('vote', 0.026), ('ir', 0.026), ('partition', 0.024), ('contradicts', 0.024), ('mammen', 0.024), ('hereafter', 0.024), ('hj', 0.024), ('sup', 0.023), ('marginal', 0.023), ('collection', 0.023), ('closed', 0.022), ('audibert', 0.022), ('boucheron', 0.022), ('formulation', 0.021), ('alch', 0.021), ('ankerst', 0.021), ('buc', 0.021), ('dummit', 0.021), ('ester', 0.021), ('exivity', 0.021), ('herbei', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
2 0.11185756 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
3 0.09437985 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
Author: Ping Li, Trevor J. Hastie, Kenneth W. Church
Abstract: For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k D) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k , using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators (MLE). We derive tail bounds for the geometric mean estimators and establish that k = O log n suffices with the constants explicitly ε2 given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, Johnson-Lindenstrauss (JL) lemma, Cauchy random projections
4 0.081745632 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
Author: Gal Elidan, Iftach Nachman, Nir Friedman
Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables
5 0.066794641 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
Author: Yiming Ying, Ding-Xuan Zhou
Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number
6 0.065089121 44 jmlr-2007-Large Margin Semi-supervised Learning
7 0.060328469 71 jmlr-2007-Refinable Kernels
8 0.052614637 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
9 0.05186177 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
10 0.0484557 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
11 0.043409385 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
12 0.041432597 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
13 0.040013555 72 jmlr-2007-Relational Dependency Networks
14 0.039200909 9 jmlr-2007-AdaBoost is Consistent
15 0.038897615 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
16 0.038169321 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
17 0.037527449 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
18 0.036409661 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
19 0.033314098 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution
20 0.032415245 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)
topicId topicWeight
[(0, 0.202), (1, -0.067), (2, 0.081), (3, 0.078), (4, 0.145), (5, -0.146), (6, 0.142), (7, -0.007), (8, 0.101), (9, -0.126), (10, -0.011), (11, -0.121), (12, 0.062), (13, -0.042), (14, 0.09), (15, -0.002), (16, 0.239), (17, 0.076), (18, -0.001), (19, 0.167), (20, 0.018), (21, 0.35), (22, -0.047), (23, -0.146), (24, 0.065), (25, -0.108), (26, -0.147), (27, 0.026), (28, 0.032), (29, -0.054), (30, 0.109), (31, 0.08), (32, -0.002), (33, 0.101), (34, -0.002), (35, -0.022), (36, -0.007), (37, -0.071), (38, -0.085), (39, -0.018), (40, -0.028), (41, -0.09), (42, -0.057), (43, 0.063), (44, -0.097), (45, 0.068), (46, -0.064), (47, 0.055), (48, -0.028), (49, -0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.94864118 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
2 0.48731431 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
Author: Ping Li, Trevor J. Hastie, Kenneth W. Church
Abstract: For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k D) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k , using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators (MLE). We derive tail bounds for the geometric mean estimators and establish that k = O log n suffices with the constants explicitly ε2 given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, Johnson-Lindenstrauss (JL) lemma, Cauchy random projections
3 0.40090153 44 jmlr-2007-Large Margin Semi-supervised Learning
Author: Junhui Wang, Xiaotong Shen
Abstract: In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. In such a situation, how to enhance predictability of classification through unlabeled data is the focus. In this article, we introduce a novel large margin semi-supervised learning methodology, using grouping information from unlabeled data, together with the concept of margins, in a form of regularization controlling the interplay between labeled and unlabeled data. Based on this methodology, we develop two specific machines involving support vector machines and ψ-learning, denoted as SSVM and SPSI, through difference convex programming. In addition, we estimate the generalization error using both labeled and unlabeled data, for tuning regularizers. Finally, our theoretical and numerical analyses indicate that the proposed methodology achieves the desired objective of delivering high performance in generalization, particularly against some strong performers. Keywords: generalization, grouping, sequential quadratic programming, support vectors
4 0.39509279 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
5 0.3879517 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
Author: Gal Elidan, Iftach Nachman, Nir Friedman
Abstract: Bayesian networks in general, and continuous variable networks in particular, have become increasingly popular in recent years, largely due to advances in methods that facilitate automatic learning from data. Yet, despite these advances, the key task of learning the structure of such models remains a computationally intensive procedure, which limits most applications to parameter learning. This problem is even more acute when learning networks in the presence of missing values or hidden variables, a scenario that is part of many real-life problems. In this work we present a general method for speeding structure search for continuous variable networks with common parametric distributions. We efficiently evaluate the approximate merit of candidate structure modifications and apply time consuming (exact) computations only to the most promising ones, thereby achieving significant improvement in the running time of the search algorithm. Our method also naturally and efficiently facilitates the addition of useful new hidden variables into the network structure, a task that is typically considered both conceptually difficult and computationally prohibitive. We demonstrate our method on synthetic and real-life data sets, both for learning structure on fully and partially observable data, and for introducing new hidden variables during structure search. Keywords: Bayesian networks, structure learning, continuous variables, hidden variables
6 0.26612574 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
7 0.25476027 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
9 0.23920822 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
10 0.23367006 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
11 0.21238136 87 jmlr-2007-Undercomplete Blind Subspace Deconvolution
12 0.20324343 71 jmlr-2007-Refinable Kernels
13 0.18861128 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
14 0.17137912 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
15 0.16278239 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
16 0.1516858 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
17 0.14970648 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
18 0.14870551 72 jmlr-2007-Relational Dependency Networks
19 0.1471608 9 jmlr-2007-AdaBoost is Consistent
20 0.14530873 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
topicId topicWeight
[(4, 0.024), (8, 0.041), (10, 0.022), (12, 0.031), (13, 0.013), (15, 0.015), (28, 0.079), (40, 0.038), (45, 0.012), (48, 0.047), (60, 0.044), (64, 0.354), (80, 0.018), (85, 0.041), (98, 0.126)]
simIndex simValue paperId paperTitle
same-paper 1 0.70115507 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
Author: Philippe Rigollet
Abstract: We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classification, cluster assumption, generalization bounds
2 0.42785549 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
3 0.42761114 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
Author: Jean-Yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds
Author: Onur C. Hamsici, Aleix M. Martinez
Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difficulty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classification results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to define optimal classifiers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classifiers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classification of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classifiers, computer vision
5 0.423639 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza
Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to define a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory
6 0.42303228 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
7 0.41794103 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
8 0.41766906 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
9 0.4175964 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
10 0.4158344 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
11 0.413041 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
12 0.41085094 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
13 0.41051668 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
14 0.41049808 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
16 0.40828788 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
17 0.40788797 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
18 0.40540203 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
20 0.40437907 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis