nips nips2003 nips2003-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
Reference: text
sentIndex sentText sentNum sentScore
1 net Abstract The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. [sent-11, score-0.474]
2 In this form, the problem has exponential computational complexity in the size of the working set. [sent-12, score-0.221]
3 So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. [sent-13, score-0.072]
4 In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. [sent-14, score-0.287]
5 To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. [sent-16, score-0.118]
6 1 Introduction The general transduction task is the following: given a training set of labelled data, and a working set of unlabelled data (also called transduction samples), estimate the value of a classification function at the given points in the working set. [sent-18, score-0.963]
7 Given this general task, much research has been concentrated on a specific approach to transduction (first proposed by Vapnik [1]), based on the use of Support Vector Machines (SVM’s). [sent-21, score-0.247]
8 In this case, the algorithm is aimed at finding a separating hyperplane for the training set that is also maximally distant from the (unlabelled) working set. [sent-22, score-0.332]
9 This hyperplane is used to predict the labels for the working set points. [sent-23, score-0.251]
10 A recent development of convex optimization theory is Semi Definite Programming (SDP), a branch of that field aimed at optimizing over the cone of semi positive definite (SPD) matrices. [sent-26, score-0.206]
11 One of its main attractions is that it has proven successful in constructing tight convex relaxations of hard combinatorial optimization problems [7]. [sent-27, score-0.185]
12 In this paper we show how to relax the problem of transduction into an SDP problem, that can then be solved by (polynomial time) convex optimization methods. [sent-29, score-0.421]
13 Empirical results on mid-sized data sets are very promising, however, due to the dimensionality of the feasible region of the relaxed parameters, still the algorithm complexity appears too large to tackle large scale problems. [sent-30, score-0.216]
14 Therefore, we subsequently shrink the feasible region by making an approximation that is based on a spectral clustering method. [sent-31, score-0.277]
15 Based on the dual of the 1-norm soft margin SVM with zero bias1 , the dual formulation of the transductive SVM optimization problem can be written as a minimization of the dual SVM cost function (which is the inverse margin plus training errors) over label matrix Γ ([1], p. [sent-34, score-0.743]
16 C ≥ αi ≥ 0 (2) Γ= yt yw · yt yw ′ (3) w yi ∈ {1, −1} (4) The (symmetric) matrix Γ is thus parameterized by the unknown working set label vector yw ∈ {−1, 1}nw (with nw the size of the working set). [sent-37, score-1.454]
17 The vector yt ∈ {−1, 1}nt (with nt the number of training points) is the given fixed vector containing the known labels for the training points. [sent-38, score-0.946]
18 The (symmetric) matrix K ∈ ℜ(nw +nt )×(nw +nt ) is the entire kernel matrix on the training set together with the working set. [sent-39, score-0.368]
19 The computational complexity scales exponentially in the size of the working set. [sent-43, score-0.186]
20 Scalars are lower case; vectors boldface lower case; matrices boldface upper case. [sent-45, score-0.072]
21 For ease of notation, the training part of the label matrix (and thus also of the kernel matrix) is always assumed to be its upper nt × nt block (as is assumed already in (3)). [sent-48, score-1.267]
22 Furthermore, the nt+ positive training samples are assumed to correspond to the first entries in yt , the nt− negative samples being at the end of this vector. [sent-49, score-0.576]
23 2 Relaxation to an SDP problem In this section, we will gradually derive a relaxed version of the transductive SVM formulation. [sent-50, score-0.209]
24 To start with, we replace some of the constraints by an equivalent set: Proposition 2. [sent-51, score-0.085]
25 1 (3) and (4) are equivalent with the following set of constraints: t t (5) [Γ]i,j∈{1:nt ,1:nt } = yi yj 1 We do not include a bias term since this would make the problem too non-convex. [sent-52, score-0.06]
26 It is basically the rank constraint that makes the resulting constrained optimization problem combinatorial. [sent-55, score-0.322]
27 Note that these constraints imply that Γ is semi positive definite (SPD): Γ 0 (this follows trivially from (3), or from (6) together with (7)). [sent-56, score-0.161]
28 Now, in literature (see eg [7]) it is observed that such an SPD rank one constraint can often be relaxed to only the SPD constraint without sacrificing too much of the performance. [sent-57, score-0.398]
29 2 If we relax the constraints by replacing (7) with Γ 0, (8) the optimization problem becomes convex. [sent-59, score-0.189]
30 While this relaxation of the rank constraint makes the optimization problem convex, the result will not be a rank one matrix anymore; it will only provide an approximation for the optimal rank one matrix. [sent-62, score-0.653]
31 1 A principal submatrix of an SPD matrix is also SPD [10]. [sent-65, score-0.123]
32 By applying this lemma on all 2 × 2 principal submatrices of Γ, it is shown that Corollary 2. [sent-66, score-0.074]
33 This is the problem will solve here: optimize (1) subject to (2), (5), (6) and (8). [sent-68, score-0.061]
34 In the remainder of this section we will reformulate the optimization problem into a standard form of SDP, make further simplifications based on the problem structure, and show how to extract an approximation for the labels from the result. [sent-69, score-0.235]
35 1 Formulation as a standard SDP problem In the derivations in this subsection the equality constraints (5) and (6) will not be stated for brevity. [sent-71, score-0.095]
36 Furthermore, in the implementation, they will be enforced explicitly by the parameterization, thus they will not appear as constraints in the optimization problem. [sent-73, score-0.113]
37 Also the SPD constraint (8) is not written every time, it should be understood. [sent-74, score-0.111]
38 Let 2ν ≥ 0 be the Lagrange dual variables corresponding to constraint αi ≥ 0 and 2µ ≥ 0 corresponding to constraint αi ≤ C. [sent-75, score-0.267]
39 with as additional constraint that (e + ν − µ) is orthogonal to the null space of K ⊙ Γ. [sent-81, score-0.148]
40 This latter constraint and the quadratic constraint can be reformulated as one SPD constraint thanks to the following extension of the Schur complement lemma [10] (the proof is omitted due to space restrictions): Lemma 2. [sent-82, score-0.45]
41 2 (Extended Schur complement lemma) For symmetric A and C ≻ 0: The column space of B ⊥ the null space of A A B ⇔ 0. [sent-83, score-0.17]
42 B′ C C B ′ A† B 0 Indeed, applying this lemma to our problem with A = K ⊙ Γ, B = e + ν − µ and C = t − 2Cµ′ e, leads to the problem formulation in the standard SDP form: min t (9) Γ,ν ≥0,µ≥0,t K⊙Γ (e + ν − µ)′ s. [sent-84, score-0.2]
43 (e + ν − µ) t − 2Cµ′ e 0 (10) together with the constraints (5), (6) and (8). [sent-86, score-0.087]
44 The relaxation for the hard margin SVM is found by following a very similar derivation, or by just equating µ to 0. [sent-87, score-0.165]
45 The number of variables specifying Γ, and the size of constraint (8) can be greatly reduced due to structure in the problem. [sent-88, score-0.137]
46 2 Simplifications due to the problem structure ′ Γc where we have a Γw ′ training block yt yt ∈ ℜnt ×nt , cross blocks Γc ∈ ℜnt ×nw and Γc ′ , and a transduction block Γw ∈ ℜnw ×nw , which is a symmetric matrix with diagonal entries equal to 1. [sent-91, score-1.189]
47 2, it follows that γ c is proportional to yt (denoted by γ c = gi yt ), and i i 1 γ c ′ yt yt i ′ † t t′ γ c = γ c ′ yyy 4 γ c . [sent-95, score-1.1]
48 This implies that 1 ≥ gi yt t i i i ′ yt yt ′ t yt 4 y gi 2 = gi such that −1 ≤ gi ≤ 1. [sent-96, score-1.316]
49 (Note that this is a corollary of the SPD constraint and does not need to be imposed explicitly. [sent-97, score-0.136]
50 ) Thus, the parameterization of Γ can be reduced to: ′ yt yt yt g′ with Γw = 1 ′ ii gyt Γw where g is the vector with gi as ith entry. [sent-98, score-0.926]
51 3 The constraint Γ 0 is equivalent to (and can thus be replaced by) the following SPD constraint on a smaller matrix Γ: Γ= 1 g g′ Γw 0. [sent-100, score-0.316]
52 Since Γ is a principal submatrix of Γ (assuming at least one training label is equal to 1), lemma 2. [sent-101, score-0.358]
53 On the other hand, note that by adding a column and corresponding row to Γ, the rank is not increased. [sent-103, score-0.144]
54 Due to the interlacing property for bordered matrices [10] and the fact that Γ 0, we know this can only be the smallest eigenvalue of the resulting matrix. [sent-105, score-0.085]
55 For the soft margin case, the number of n2 +5n n2 +3n parameters is now 1+2nt + w 2 w . [sent-108, score-0.094]
56 3 Extraction of an estimate for the labels from Γ In general, the optimal Γ will of course not be rank one. [sent-111, score-0.174]
57 We can approximate it by a rank one matrix however, by taking g as an approximation for the labels optimizing the unrelaxed problem. [sent-112, score-0.376]
58 This is the approach we adopt: a thresholded value of the entries of g will be taken as a guess for the labels of the working set. [sent-113, score-0.34]
59 Note that the minimum of the relaxed problem is always smaller than or equal to the minimum of the unrelaxed problem. [sent-114, score-0.245]
60 Furthermore, the minimum of the unrelaxed problem is smaller than or equal to the value achieved by the thresholded relaxed labels. [sent-115, score-0.276]
61 Especially the limitation on the working set is a drawback, since the advantage of transduction becomes apparent especially for a large working set as compared to the number of training samples. [sent-120, score-0.586]
62 3 Subspace SDP formulation However, if we would know a subspace (spanned by the d columns of a matrix V ∈ ℜ(nt +nw )×d ) in which (or close to which) the label vector lies, we can restrict the feasible region for Γ, leading to a much more efficient algorithm. [sent-122, score-0.439]
63 If we know that the true label vector y lies in the column space of a matrix V, we know the true label matrix can be written in the form Γ = VMV′ , with M a symmetric matrix. [sent-125, score-0.506]
64 Furthermore, constraint (8) that Γ 0 is then equivalent to M 0, which is a cheaper constraint. [sent-127, score-0.136]
65 Note however that in practical cases, the true label vector will not lie within but only close to the subspace spanned by the columns of V. [sent-128, score-0.248]
66 Then the diagonal of the label matrix Γ can not always be made exactly equal to e as required by (6). [sent-129, score-0.232]
67 We thus relax this constraint to the requirement that the diagonal is not larger than 2 The worst case complexity for the problem at hand is O((nt +n2 )2 (nt +nw )2. [sent-130, score-0.237]
68 5 −1 −1 −1 0 1 0 20 40 60 −1 −1 0 1 Figure 1: The left picture shows 10 labelled samples represented by a ’o’ or a ’+’, depending on their class, together with 60 unlabelled samples represented by a ’·’. [sent-139, score-0.36]
69 The middle picture shows the labels for the working set as estimated using the SDP method before thresholding: all are already invisibly close to 1 or −1. [sent-140, score-0.311]
70 The right picture shows contour lines of the classification surface obtained by training an SVM using all labels as found by the SDP method. [sent-141, score-0.191]
71 The method clearly finds a visually good label assignment that takes cluster structure in the data into account. [sent-142, score-0.139]
72 Similarly, the block in the label matrix corresponding to the training samples may not contain 1’s and −1’s exactly (constraint (5)). [sent-144, score-0.368]
73 However, the better V is chosen, the better this constraint will be met. [sent-145, score-0.111]
74 Thus we optimize (9) subject to (10) together with three constraints that replace the constraints (5), (6) and (8): Γ = VMV′ diag(Γ) ≤ e M 0 Thus we can approximate the relaxed transductive SVM using this reduced parameterization for Γ. [sent-146, score-0.43]
75 The number of effective variables is now only a linear function of nw : 1 + nt + nw + d(d + 1)/2 for a hard margin and 1 + 2(nt + nw ) + d(d + 1)/2 for a soft margin SVM. [sent-147, score-1.658]
76 Furthermore, one of the SPD constraints is now a constraint on a d × d matrix instead of a potentially large (nw + 1) × (nw + 1) matrix. [sent-148, score-0.24]
77 For a constant d, the worst case complexity is thus reduced to O((nt + nw )4. [sent-149, score-0.406]
78 4 Spectral transduction to find the subspace In this section we will discuss how to find a subspace V close to which the label vector will lie. [sent-152, score-0.473]
79 Our approach is based on the spectral clustering algorithm proposed in [11]. [sent-153, score-0.159]
80 The optimization problem corresponding to this eigenvalue problem is: max v 4. [sent-156, score-0.176]
81 (11) Constrained spectral clustering We could apply this algorithm to the kernel matrix K, but we can do more since we already know some of the labels: we will constrain the estimates of the labels for the training samples that are known to be in the same class to be equal to each other. [sent-160, score-0.567]
82 This can be achieved by choosing the following parameterization for v: √ ent+ / nt+ 0 0 ht+ √ 0 ent− / nt− 0 · ht− = Lh v= hw 0 0 I where en+ and en− denote the vectors containing nt+ (the number of positive training samples) and nt− (the number of negative training samples) ones. [sent-162, score-0.217]
83 1 Optimization problem (11) is equivalent with: max h h′ L′ D−1/2 KD−1/2 Lh s. [sent-164, score-0.06]
84 h′ h = 1 which corresponds to the eigenvalue problem L′ D−1/2 KD−1/2 Lh = λh. [sent-166, score-0.088]
85 This is an extension of spectral clustering towards transduction3 . [sent-168, score-0.159]
86 2 Spectral transduction provides a good V By construction, all entries of vi corresponding to positive training samples will be √ equal to ht+ / nt+ ; entries corresponding to the negative ones will all be equal to i √ ht− / nt− . [sent-171, score-0.732]
87 Furthermore, as in spectral clustering, the other entries of vectors vi i with large eigenvalue λi will reflect the cluster structure of the entire data set, while respecting the label assignment of the training points however4 . [sent-172, score-0.495]
88 This means that such a vi will provide a good approximation for the labels. [sent-173, score-0.098]
89 More specifically, the label vector will lie close to the column space of V, having d dominant ‘centered’ vi as its columns; the larger d, the better the approximation. [sent-174, score-0.246]
90 The way we ‘center’ vi is by adding a constant so that entries for positive training samples become equal to minus those for the negative ones. [sent-175, score-0.361]
91 Since then the first nt columns of the resulting Γ = VMV′ will be equal up to a sign, we can adopt basically the same approach as in section 2. [sent-176, score-0.603]
92 3 to guess the labels: pick and threshold the first column of Γ. [sent-177, score-0.075]
93 The positive class is formed by 100 randomly chosen samples representing a number 0, and 100 representing a 1; the negative class by 100 samples representing a 2 and 100 representing a 3. [sent-179, score-0.184]
94 The training set is chosen to contain only 10 samples from each of both classes, and is randomly drawn but evenly distributed over the 4 numbers. [sent-181, score-0.146]
95 4 Note: to reduce the influence from outliers, large entries of the vi can be thresholded. [sent-185, score-0.133]
96 To illustrate the scalability of the method, and to show that a larger working set is effectively exploited, we used a similar setting (same training set size) but with 1000 samples and d = 3, giving an average ROC-score of 0. [sent-192, score-0.282]
97 6 Conclusions We developed a relaxation for the transductive SVM as first proposed by Vapnik. [sent-195, score-0.16]
98 It is shown how this combinatorial problem can be relaxed to an SDP problem. [sent-196, score-0.164]
99 Unfortunately, the number of variables in combination with the complexity of SDP is too high for it to scale to significant problem sizes. [sent-197, score-0.085]
100 Therefore we show how, based on a new spectral method, the feasible region of the variables can be shrinked, leading to an approximation for the original SDP method. [sent-198, score-0.221]
wordName wordTfidf (topN-words)
[('nt', 0.476), ('nw', 0.33), ('sdp', 0.33), ('spd', 0.294), ('yt', 0.257), ('transduction', 0.247), ('working', 0.136), ('constraint', 0.111), ('label', 0.107), ('spectral', 0.103), ('rank', 0.095), ('transductive', 0.093), ('unlabelled', 0.085), ('svm', 0.085), ('relaxed', 0.081), ('samples', 0.079), ('labels', 0.079), ('lemma', 0.074), ('unrelaxed', 0.073), ('vmv', 0.073), ('gi', 0.072), ('matrix', 0.069), ('entries', 0.068), ('relaxation', 0.067), ('training', 0.067), ('vi', 0.065), ('kd', 0.065), ('lh', 0.064), ('constraints', 0.06), ('margin', 0.059), ('parameterization', 0.057), ('clustering', 0.056), ('equal', 0.056), ('ht', 0.055), ('submatrix', 0.054), ('yw', 0.054), ('optimization', 0.053), ('feasible', 0.053), ('eigenvalue', 0.053), ('schur', 0.051), ('complexity', 0.05), ('column', 0.049), ('bie', 0.049), ('nello', 0.049), ('tijl', 0.049), ('semi', 0.049), ('combinatorial', 0.048), ('subspace', 0.047), ('cristianini', 0.046), ('block', 0.046), ('dual', 0.045), ('convex', 0.045), ('picture', 0.045), ('labelled', 0.045), ('semide', 0.045), ('proposition', 0.043), ('complement', 0.043), ('columns', 0.043), ('relax', 0.041), ('symmetric', 0.041), ('ent', 0.039), ('furthermore', 0.039), ('hard', 0.039), ('programming', 0.037), ('null', 0.037), ('boldface', 0.036), ('hyperplane', 0.036), ('problem', 0.035), ('soft', 0.035), ('diag', 0.034), ('approximation', 0.033), ('cluster', 0.032), ('distant', 0.032), ('restrictions', 0.032), ('aimed', 0.032), ('know', 0.032), ('region', 0.032), ('duality', 0.031), ('thresholded', 0.031), ('formulation', 0.031), ('maximally', 0.029), ('en', 0.029), ('basically', 0.028), ('together', 0.027), ('optimizing', 0.027), ('vapnik', 0.027), ('already', 0.026), ('optimize', 0.026), ('reduced', 0.026), ('negative', 0.026), ('spanned', 0.026), ('guess', 0.026), ('inductive', 0.026), ('min', 0.025), ('close', 0.025), ('imply', 0.025), ('induction', 0.025), ('berlin', 0.025), ('corollary', 0.025), ('equivalent', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
2 0.15124901 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
3 0.13055184 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
4 0.13029815 171 nips-2003-Semi-Definite Programming by Perceptron Learning
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
5 0.12948997 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
6 0.12215657 176 nips-2003-Sequential Bayesian Kernel Regression
7 0.095548496 145 nips-2003-Online Classification on a Budget
8 0.089935504 113 nips-2003-Learning with Local and Global Consistency
9 0.076754883 46 nips-2003-Clustering with the Connectivity Kernel
10 0.075627729 124 nips-2003-Max-Margin Markov Networks
11 0.074890494 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
12 0.073664628 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
13 0.072647519 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.071871608 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
15 0.071203262 128 nips-2003-Minimax Embeddings
16 0.061521623 1 nips-2003-1-norm Support Vector Machines
17 0.060475636 73 nips-2003-Feature Selection in Clustering Problems
18 0.059635121 121 nips-2003-Log-Linear Models for Label Ranking
19 0.059586927 148 nips-2003-Online Passive-Aggressive Algorithms
20 0.058991436 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
topicId topicWeight
[(0, -0.209), (1, -0.084), (2, -0.067), (3, -0.09), (4, 0.077), (5, 0.081), (6, -0.045), (7, 0.094), (8, 0.089), (9, -0.013), (10, 0.068), (11, -0.086), (12, -0.042), (13, 0.039), (14, -0.08), (15, 0.057), (16, -0.034), (17, -0.007), (18, -0.084), (19, -0.002), (20, -0.024), (21, -0.045), (22, -0.15), (23, -0.142), (24, 0.029), (25, -0.036), (26, 0.219), (27, 0.153), (28, -0.008), (29, -0.107), (30, -0.028), (31, -0.023), (32, -0.099), (33, -0.135), (34, 0.041), (35, -0.028), (36, 0.121), (37, -0.069), (38, 0.044), (39, 0.029), (40, 0.085), (41, 0.094), (42, 0.026), (43, 0.143), (44, 0.078), (45, -0.014), (46, 0.022), (47, 0.047), (48, -0.01), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.94696873 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
2 0.60166562 171 nips-2003-Semi-Definite Programming by Perceptron Learning
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
3 0.56647074 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
Author: Philip Derbeko, Ran El-Yaniv, Ron Meir
Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difficult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This flexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added flexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We briefly review some previous results. The idea of transduction, and a specific algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insignificant. Refining the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classification (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples define 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classifier h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it suffices to consider a finite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classification (for example) it suffices to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classification. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a finite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classification. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We first claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the definition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the fixed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classifiers, which are distributions over the base classifiers in H. For any distribution q over H denote by Gq the Gibbs classifier, which classifies an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classifiers we now extend definition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classifier over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The first extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (specifically the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk defined over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then defined to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classifier. For each such τ the prior pτ is constructed over all possible classifiers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identifies an upper bound on the divergence in terms of the observed size of the compression set of the final classifier. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classifier is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classification is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classification, within a transduction setting we can provide a principled approach to using clustering algorithms for classification. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and define the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classifiers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classifier (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classifications of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Specifically, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classification unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-definite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simplified pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002.
4 0.52567649 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
5 0.48219761 107 nips-2003-Learning Spectral Clustering
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1
6 0.45233443 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
7 0.41786855 176 nips-2003-Sequential Bayesian Kernel Regression
8 0.41677621 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
9 0.39491281 113 nips-2003-Learning with Local and Global Consistency
10 0.37798274 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
11 0.37592852 124 nips-2003-Max-Margin Markov Networks
12 0.37212846 145 nips-2003-Online Classification on a Budget
13 0.3661426 128 nips-2003-Minimax Embeddings
14 0.35866243 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
15 0.35215092 46 nips-2003-Clustering with the Connectivity Kernel
16 0.34778199 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
17 0.34303573 73 nips-2003-Feature Selection in Clustering Problems
18 0.3407082 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
19 0.33043125 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
20 0.32312387 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
topicId topicWeight
[(0, 0.069), (11, 0.029), (30, 0.014), (35, 0.063), (53, 0.107), (58, 0.27), (66, 0.016), (71, 0.079), (76, 0.057), (85, 0.067), (91, 0.098), (99, 0.045)]
simIndex simValue paperId paperTitle
1 0.87761909 170 nips-2003-Self-calibrating Probability Forecasting
Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov
Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.
same-paper 2 0.80305851 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
3 0.77277845 41 nips-2003-Boosting versus Covering
Author: Kohei Hatano, Manfred K. Warmuth
Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1
4 0.60249603 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
5 0.5973711 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
6 0.59720212 189 nips-2003-Tree-structured Approximations by Expectation Propagation
7 0.59665871 107 nips-2003-Learning Spectral Clustering
8 0.59657401 113 nips-2003-Learning with Local and Global Consistency
9 0.59016293 126 nips-2003-Measure Based Regularization
10 0.58991593 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
11 0.58924645 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
12 0.58899134 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
13 0.58587044 81 nips-2003-Geometric Analysis of Constrained Curves
14 0.58574808 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
15 0.58531928 143 nips-2003-On the Dynamics of Boosting
16 0.58467072 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
17 0.58359039 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
18 0.58283836 112 nips-2003-Learning to Find Pre-Images
19 0.58272392 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
20 0.58208936 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images