nips nips2003 nips2003-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract This paper is concerned with transductive learning. [sent-7, score-0.198]
2 Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. [sent-8, score-0.466]
3 We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. [sent-9, score-0.635]
4 The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. [sent-10, score-0.685]
5 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. [sent-11, score-0.377]
6 The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. [sent-13, score-0.464]
7 Clearly, inferring the labels of points in the test set can be done using an inductive scheme. [sent-14, score-0.115]
8 Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e. [sent-17, score-0.377]
9 However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. [sent-20, score-0.327]
10 In this paper we provide new error bounds and a general technique for transductive learning. [sent-21, score-0.41]
11 Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. [sent-22, score-0.393]
12 The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). [sent-23, score-0.529]
13 Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. [sent-25, score-0.221]
14 Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. [sent-26, score-0.408]
15 The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e. [sent-28, score-0.198]
16 A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. [sent-32, score-0.378]
17 The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. [sent-33, score-0.164]
18 However, the authors state that the transduction bound is not much tighter than the induction bound. [sent-34, score-0.448]
19 Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. [sent-35, score-0.35]
20 Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. [sent-36, score-0.198]
21 The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. [sent-37, score-0.282]
22 An error bound for transduction, based on the effective VC Dimension, is given in [10]. [sent-38, score-0.162]
23 [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. [sent-40, score-0.294]
24 Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . [sent-44, score-0.112]
25 , (xm , ym )} is referred to as a training sample. [sent-62, score-0.141]
26 Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. [sent-70, score-0.282]
27 This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . [sent-71, score-0.427]
28 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). [sent-80, score-0.16]
29 Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. [sent-83, score-0.232]
30 The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). [sent-86, score-0.64]
31 While our objective in transduction is to achieve small error over the unlabeled set (i. [sent-88, score-0.383]
32 to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. [sent-90, score-0.326]
33 The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). [sent-91, score-0.712]
34 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. [sent-97, score-0.426]
35 A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. [sent-102, score-0.141]
36 While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. [sent-104, score-0.256]
37 Using this result we obtain the following error bound for transductive classification. [sent-118, score-0.334]
38 2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. [sent-120, score-0.286]
39 Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . [sent-122, score-0.503]
40 (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. [sent-123, score-0.466]
41 2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). [sent-134, score-0.135]
42 This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i. [sent-137, score-0.159]
43 Due to space limitations, the proof of the following theorem will appear in the full paper. [sent-152, score-0.159]
44 With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . [sent-157, score-0.259]
45 In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. [sent-158, score-0.239]
46 Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. [sent-159, score-0.241]
47 Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . [sent-161, score-0.297]
48 2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). [sent-165, score-0.129]
49 3 we can replace the KL-divergence term in the bound with mini D(q||pi ). [sent-167, score-0.137]
50 The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). [sent-168, score-0.193]
51 The second “extension” is simply the feature of our transduction bounds (Theorems 3. [sent-170, score-0.423]
52 3), which allows for the priors to be dependent on the full sample Xm+u . [sent-172, score-0.218]
53 The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. [sent-173, score-0.468]
54 After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. [sent-174, score-0.316]
55 The following theorem shows that one can use polynomially many priors with a minor penalty. [sent-179, score-0.208]
56 Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . [sent-188, score-0.233]
57 there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3. [sent-193, score-0.169]
58 However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. [sent-195, score-0.119]
59 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. [sent-197, score-0.133]
60 More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). [sent-200, score-0.329]
61 Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. [sent-201, score-0.192]
62 Let A be a deterministic compression scheme and consider the full sample Xm+u . [sent-202, score-0.333]
63 , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). [sent-206, score-0.162]
64 if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). [sent-211, score-0.213]
65 Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . [sent-217, score-0.162]
66 Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. [sent-219, score-0.203]
67 For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. [sent-220, score-0.206]
68 By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. [sent-221, score-0.254]
69 4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. [sent-224, score-0.147]
70 The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. [sent-225, score-0.345]
71 Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. [sent-229, score-0.259]
72 If the size of the compression set happens to be small, we obtain a tight bound. [sent-237, score-0.223]
73 SVM classification is one of the best studied compression schemes. [sent-238, score-0.164]
74 The compression set for a sample Sm is given by the subset of support vectors. [sent-239, score-0.283]
75 We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5. [sent-242, score-0.408]
76 Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). [sent-244, score-0.242]
77 Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3]. [sent-245, score-0.41]
78 4 3 It might be that for some dichotomies the algorithm will fail. [sent-246, score-0.116]
79 For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . [sent-247, score-0.136]
80 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. [sent-248, score-0.169]
81 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i. [sent-249, score-0.355]
82 A considerably stronger type of compression can often be achieved by clustering algorithms. [sent-252, score-0.235]
83 While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. [sent-253, score-0.465]
84 Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. [sent-254, score-0.298]
85 For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). [sent-263, score-0.206]
86 Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . [sent-265, score-0.115]
87 , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). [sent-268, score-0.203]
88 For each hτ there is a valid error bound as given by Theorem 3. [sent-273, score-0.136]
89 Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. [sent-275, score-0.121]
90 , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. [sent-283, score-0.219]
91 1 can be rather tight when the clustering algorithm is successful (i. [sent-287, score-0.13]
92 By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5. [sent-298, score-0.323]
93 1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). [sent-299, score-0.284]
94 This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. [sent-301, score-0.162]
95 6 Concluding Remarks We presented new bounds for transductive learning algorithms. [sent-302, score-0.339]
96 We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. [sent-303, score-0.789]
97 We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. [sent-304, score-0.636]
98 It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. [sent-305, score-0.459]
99 1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. [sent-308, score-0.268]
100 However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. [sent-309, score-0.116]
wordName wordTfidf (topN-words)
[('rh', 0.479), ('xm', 0.451), ('sm', 0.33), ('xu', 0.292), ('transduction', 0.282), ('transductive', 0.198), ('compression', 0.164), ('bounds', 0.141), ('dichotomies', 0.116), ('ln', 0.105), ('priors', 0.097), ('bound', 0.096), ('ym', 0.09), ('corollary', 0.085), ('rgq', 0.083), ('theorem', 0.082), ('hypothesis', 0.072), ('hs', 0.072), ('clustering', 0.071), ('sample', 0.068), ('unlabeled', 0.061), ('tight', 0.059), ('deriving', 0.058), ('classi', 0.056), ('vapnik', 0.054), ('full', 0.053), ('inductive', 0.052), ('induction', 0.051), ('gq', 0.05), ('hypotheses', 0.047), ('gibbs', 0.044), ('technion', 0.043), ('prior', 0.042), ('setting', 0.041), ('let', 0.04), ('error', 0.04), ('points', 0.039), ('cluster', 0.038), ('risk', 0.037), ('kc', 0.037), ('clusters', 0.035), ('log', 0.035), ('mrh', 0.033), ('urh', 0.033), ('replaced', 0.032), ('technique', 0.031), ('svm', 0.031), ('pk', 0.031), ('blum', 0.03), ('polynomially', 0.029), ('svms', 0.029), ('support', 0.028), ('training', 0.027), ('choices', 0.027), ('schemes', 0.027), ('ers', 0.026), ('lanckriet', 0.026), ('retrospect', 0.026), ('effective', 0.026), ('observing', 0.025), ('scheme', 0.025), ('holds', 0.025), ('choose', 0.025), ('binary', 0.025), ('pi', 0.025), ('easier', 0.024), ('mcallester', 0.024), ('mini', 0.024), ('referred', 0.024), ('labels', 0.024), ('proof', 0.024), ('deterministic', 0.023), ('proceeding', 0.023), ('extension', 0.023), ('subset', 0.023), ('least', 0.022), ('substituting', 0.022), ('super', 0.022), ('vc', 0.022), ('uniformly', 0.022), ('labeled', 0.022), ('ps', 0.021), ('cristianini', 0.021), ('construction', 0.02), ('colt', 0.02), ('margin', 0.02), ('er', 0.019), ('utilize', 0.019), ('provably', 0.019), ('tighter', 0.019), ('compact', 0.018), ('exibility', 0.017), ('guess', 0.017), ('learner', 0.017), ('replace', 0.017), ('pointed', 0.017), ('limitations', 0.017), ('inequalities', 0.017), ('observe', 0.017), ('advantage', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
2 0.13055184 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
3 0.10565405 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
4 0.094580546 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
5 0.093844123 113 nips-2003-Learning with Local and Global Consistency
Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf
Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1
6 0.076193683 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles
7 0.074758172 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
8 0.067569345 151 nips-2003-PAC-Bayesian Generic Chaining
9 0.06630829 41 nips-2003-Boosting versus Covering
10 0.063372828 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
11 0.062303856 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
12 0.062081411 73 nips-2003-Feature Selection in Clustering Problems
13 0.061725639 111 nips-2003-Learning the k in k-means
14 0.059421968 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
15 0.059374135 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
16 0.059308618 46 nips-2003-Clustering with the Connectivity Kernel
17 0.058705796 107 nips-2003-Learning Spectral Clustering
18 0.057381965 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
19 0.05694776 124 nips-2003-Max-Margin Markov Networks
20 0.053701214 122 nips-2003-Margin Maximizing Loss Functions
topicId topicWeight
[(0, -0.173), (1, -0.092), (2, -0.066), (3, -0.023), (4, 0.086), (5, 0.032), (6, -0.057), (7, 0.039), (8, -0.1), (9, 0.043), (10, -0.041), (11, -0.076), (12, -0.001), (13, 0.049), (14, 0.052), (15, 0.039), (16, -0.029), (17, 0.074), (18, 0.002), (19, -0.09), (20, -0.078), (21, 0.003), (22, -0.032), (23, -0.014), (24, -0.07), (25, 0.03), (26, 0.135), (27, 0.141), (28, 0.033), (29, -0.031), (30, 0.036), (31, -0.041), (32, -0.012), (33, -0.011), (34, 0.132), (35, 0.021), (36, 0.204), (37, 0.064), (38, 0.038), (39, -0.017), (40, 0.053), (41, 0.046), (42, -0.033), (43, 0.087), (44, 0.068), (45, -0.162), (46, -0.064), (47, -0.092), (48, 0.013), (49, 0.156)]
simIndex simValue paperId paperTitle
same-paper 1 0.94380265 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
2 0.60105002 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
3 0.53630197 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
4 0.52395403 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
5 0.42608142 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
6 0.41482511 113 nips-2003-Learning with Local and Global Consistency
7 0.38993883 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
8 0.38309851 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
9 0.37643975 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
10 0.34237453 111 nips-2003-Learning the k in k-means
11 0.33883306 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
12 0.33260936 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
13 0.31562531 107 nips-2003-Learning Spectral Clustering
14 0.31384173 3 nips-2003-AUC Optimization vs. Error Rate Minimization
15 0.31275362 124 nips-2003-Max-Margin Markov Networks
16 0.30312735 146 nips-2003-Online Learning of Non-stationary Sequences
17 0.30204108 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
18 0.30188221 73 nips-2003-Feature Selection in Clustering Problems
19 0.30046886 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles
20 0.29673406 41 nips-2003-Boosting versus Covering
topicId topicWeight
[(0, 0.032), (11, 0.02), (29, 0.017), (30, 0.013), (35, 0.043), (53, 0.119), (65, 0.254), (69, 0.015), (71, 0.093), (76, 0.056), (85, 0.077), (91, 0.092), (99, 0.046)]
simIndex simValue paperId paperTitle
1 0.9024564 27 nips-2003-Analytical Solution of Spike-timing Dependent Plasticity Based on Synaptic Biophysics
Author: Bernd Porr, Ausra Saudargiene, Florentin Wörgötter
Abstract: Spike timing plasticity (STDP) is a special form of synaptic plasticity where the relative timing of post- and presynaptic activity determines the change of the synaptic weight. On the postsynaptic side, active backpropagating spikes in dendrites seem to play a crucial role in the induction of spike timing dependent plasticity. We argue that postsynaptically the temporal change of the membrane potential determines the weight change. Coming from the presynaptic side induction of STDP is closely related to the activation of NMDA channels. Therefore, we will calculate analytically the change of the synaptic weight by correlating the derivative of the membrane potential with the activity of the NMDA channel. Thus, for this calculation we utilise biophysical variables of the physiological cell. The final result shows a weight change curve which conforms with measurements from biology. The positive part of the weight change curve is determined by the NMDA activation. The negative part of the weight change curve is determined by the membrane potential change. Therefore, the weight change curve should change its shape depending on the distance from the soma of the postsynaptic cell. We find temporally asymmetric weight change close to the soma and temporally symmetric weight change in the distal dendrite. 1
same-paper 2 0.83193851 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
3 0.62414187 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
4 0.62275958 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
5 0.61947185 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
6 0.6176787 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
7 0.61604035 30 nips-2003-Approximability of Probability Distributions
8 0.61578017 189 nips-2003-Tree-structured Approximations by Expectation Propagation
9 0.61422276 107 nips-2003-Learning Spectral Clustering
10 0.61383843 78 nips-2003-Gaussian Processes in Reinforcement Learning
11 0.61190611 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
12 0.61168075 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
13 0.61066931 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
14 0.61016238 143 nips-2003-On the Dynamics of Boosting
15 0.6089499 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
16 0.60836315 113 nips-2003-Learning with Local and Global Consistency
17 0.60788542 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
18 0.60706997 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
19 0.60627133 112 nips-2003-Learning to Find Pre-Images
20 0.60599846 48 nips-2003-Convex Methods for Transduction