nips nips2003 nips2003-48 knowledge-graph by maker-knowledge-mining

48 nips-2003-Convex Methods for Transduction

Source: pdf

Author: Tijl D. Bie, Nello Cristianini

Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves ﬁnding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semideﬁnite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 net Abstract The 2-class transduction problem, as formulated by Vapnik [1], involves ﬁnding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. [sent-11, score-0.474]

2 In this form, the problem has exponential computational complexity in the size of the working set. [sent-12, score-0.221]

3 So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. [sent-13, score-0.072]

4 In this paper we present a relaxation of this task based on semideﬁnite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. [sent-14, score-0.287]

5 To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. [sent-16, score-0.118]

6 1 Introduction The general transduction task is the following: given a training set of labelled data, and a working set of unlabelled data (also called transduction samples), estimate the value of a classiﬁcation function at the given points in the working set. [sent-18, score-0.963]

7 Given this general task, much research has been concentrated on a speciﬁc approach to transduction (ﬁrst proposed by Vapnik [1]), based on the use of Support Vector Machines (SVM’s). [sent-21, score-0.247]

8 In this case, the algorithm is aimed at ﬁnding a separating hyperplane for the training set that is also maximally distant from the (unlabelled) working set. [sent-22, score-0.332]

9 This hyperplane is used to predict the labels for the working set points. [sent-23, score-0.251]

10 A recent development of convex optimization theory is Semi Deﬁnite Programming (SDP), a branch of that ﬁeld aimed at optimizing over the cone of semi positive definite (SPD) matrices. [sent-26, score-0.206]

11 One of its main attractions is that it has proven successful in constructing tight convex relaxations of hard combinatorial optimization problems [7]. [sent-27, score-0.185]

12 In this paper we show how to relax the problem of transduction into an SDP problem, that can then be solved by (polynomial time) convex optimization methods. [sent-29, score-0.421]

13 Empirical results on mid-sized data sets are very promising, however, due to the dimensionality of the feasible region of the relaxed parameters, still the algorithm complexity appears too large to tackle large scale problems. [sent-30, score-0.216]

14 Therefore, we subsequently shrink the feasible region by making an approximation that is based on a spectral clustering method. [sent-31, score-0.277]

15 Based on the dual of the 1-norm soft margin SVM with zero bias1 , the dual formulation of the transductive SVM optimization problem can be written as a minimization of the dual SVM cost function (which is the inverse margin plus training errors) over label matrix Γ ([1], p. [sent-34, score-0.743]

16 C ≥ αi ≥ 0 (2) Γ= yt yw · yt yw ′ (3) w yi ∈ {1, −1} (4) The (symmetric) matrix Γ is thus parameterized by the unknown working set label vector yw ∈ {−1, 1}nw (with nw the size of the working set). [sent-37, score-1.454]

17 The vector yt ∈ {−1, 1}nt (with nt the number of training points) is the given ﬁxed vector containing the known labels for the training points. [sent-38, score-0.946]

18 The (symmetric) matrix K ∈ ℜ(nw +nt )×(nw +nt ) is the entire kernel matrix on the training set together with the working set. [sent-39, score-0.368]

19 The computational complexity scales exponentially in the size of the working set. [sent-43, score-0.186]

20 Scalars are lower case; vectors boldface lower case; matrices boldface upper case. [sent-45, score-0.072]

21 For ease of notation, the training part of the label matrix (and thus also of the kernel matrix) is always assumed to be its upper nt × nt block (as is assumed already in (3)). [sent-48, score-1.267]

22 Furthermore, the nt+ positive training samples are assumed to correspond to the ﬁrst entries in yt , the nt− negative samples being at the end of this vector. [sent-49, score-0.576]

23 2 Relaxation to an SDP problem In this section, we will gradually derive a relaxed version of the transductive SVM formulation. [sent-50, score-0.209]

24 To start with, we replace some of the constraints by an equivalent set: Proposition 2. [sent-51, score-0.085]

25 1 (3) and (4) are equivalent with the following set of constraints: t t (5) [Γ]i,j∈{1:nt ,1:nt } = yi yj 1 We do not include a bias term since this would make the problem too non-convex. [sent-52, score-0.06]

26 It is basically the rank constraint that makes the resulting constrained optimization problem combinatorial. [sent-55, score-0.322]

27 Note that these constraints imply that Γ is semi positive deﬁnite (SPD): Γ 0 (this follows trivially from (3), or from (6) together with (7)). [sent-56, score-0.161]

28 Now, in literature (see eg [7]) it is observed that such an SPD rank one constraint can often be relaxed to only the SPD constraint without sacriﬁcing too much of the performance. [sent-57, score-0.398]

29 2 If we relax the constraints by replacing (7) with Γ 0, (8) the optimization problem becomes convex. [sent-59, score-0.189]

30 While this relaxation of the rank constraint makes the optimization problem convex, the result will not be a rank one matrix anymore; it will only provide an approximation for the optimal rank one matrix. [sent-62, score-0.653]

31 1 A principal submatrix of an SPD matrix is also SPD [10]. [sent-65, score-0.123]

32 By applying this lemma on all 2 × 2 principal submatrices of Γ, it is shown that Corollary 2. [sent-66, score-0.074]

33 This is the problem will solve here: optimize (1) subject to (2), (5), (6) and (8). [sent-68, score-0.061]

34 In the remainder of this section we will reformulate the optimization problem into a standard form of SDP, make further simpliﬁcations based on the problem structure, and show how to extract an approximation for the labels from the result. [sent-69, score-0.235]

35 1 Formulation as a standard SDP problem In the derivations in this subsection the equality constraints (5) and (6) will not be stated for brevity. [sent-71, score-0.095]

36 Furthermore, in the implementation, they will be enforced explicitly by the parameterization, thus they will not appear as constraints in the optimization problem. [sent-73, score-0.113]

37 Also the SPD constraint (8) is not written every time, it should be understood. [sent-74, score-0.111]

38 Let 2ν ≥ 0 be the Lagrange dual variables corresponding to constraint αi ≥ 0 and 2µ ≥ 0 corresponding to constraint αi ≤ C. [sent-75, score-0.267]

39 with as additional constraint that (e + ν − µ) is orthogonal to the null space of K ⊙ Γ. [sent-81, score-0.148]

40 This latter constraint and the quadratic constraint can be reformulated as one SPD constraint thanks to the following extension of the Schur complement lemma [10] (the proof is omitted due to space restrictions): Lemma 2. [sent-82, score-0.45]

41 2 (Extended Schur complement lemma) For symmetric A and C ≻ 0: The column space of B ⊥ the null space of A A B ⇔ 0. [sent-83, score-0.17]

42 B′ C C B ′ A† B 0 Indeed, applying this lemma to our problem with A = K ⊙ Γ, B = e + ν − µ and C = t − 2Cµ′ e, leads to the problem formulation in the standard SDP form: min t (9) Γ,ν ≥0,µ≥0,t K⊙Γ (e + ν − µ)′ s. [sent-84, score-0.2]

43 (e + ν − µ) t − 2Cµ′ e 0 (10) together with the constraints (5), (6) and (8). [sent-86, score-0.087]

44 The relaxation for the hard margin SVM is found by following a very similar derivation, or by just equating µ to 0. [sent-87, score-0.165]

45 The number of variables specifying Γ, and the size of constraint (8) can be greatly reduced due to structure in the problem. [sent-88, score-0.137]

46 2 Simpliﬁcations due to the problem structure ′ Γc where we have a Γw ′ training block yt yt ∈ ℜnt ×nt , cross blocks Γc ∈ ℜnt ×nw and Γc ′ , and a transduction block Γw ∈ ℜnw ×nw , which is a symmetric matrix with diagonal entries equal to 1. [sent-91, score-1.189]

47 2, it follows that γ c is proportional to yt (denoted by γ c = gi yt ), and i i 1 γ c ′ yt yt i ′ † t t′ γ c = γ c ′ yyy 4 γ c . [sent-95, score-1.1]

48 This implies that 1 ≥ gi yt t i i i ′ yt yt ′ t yt 4 y gi 2 = gi such that −1 ≤ gi ≤ 1. [sent-96, score-1.316]

49 (Note that this is a corollary of the SPD constraint and does not need to be imposed explicitly. [sent-97, score-0.136]

50 ) Thus, the parameterization of Γ can be reduced to: ′ yt yt yt g′ with Γw = 1 ′ ii gyt Γw where g is the vector with gi as ith entry. [sent-98, score-0.926]

51 3 The constraint Γ 0 is equivalent to (and can thus be replaced by) the following SPD constraint on a smaller matrix Γ: Γ= 1 g g′ Γw 0. [sent-100, score-0.316]

52 Since Γ is a principal submatrix of Γ (assuming at least one training label is equal to 1), lemma 2. [sent-101, score-0.358]

53 On the other hand, note that by adding a column and corresponding row to Γ, the rank is not increased. [sent-103, score-0.144]

54 Due to the interlacing property for bordered matrices [10] and the fact that Γ 0, we know this can only be the smallest eigenvalue of the resulting matrix. [sent-105, score-0.085]

55 For the soft margin case, the number of n2 +5n n2 +3n parameters is now 1+2nt + w 2 w . [sent-108, score-0.094]

56 3 Extraction of an estimate for the labels from Γ In general, the optimal Γ will of course not be rank one. [sent-111, score-0.174]

57 We can approximate it by a rank one matrix however, by taking g as an approximation for the labels optimizing the unrelaxed problem. [sent-112, score-0.376]

58 This is the approach we adopt: a thresholded value of the entries of g will be taken as a guess for the labels of the working set. [sent-113, score-0.34]

59 Note that the minimum of the relaxed problem is always smaller than or equal to the minimum of the unrelaxed problem. [sent-114, score-0.245]

60 Furthermore, the minimum of the unrelaxed problem is smaller than or equal to the value achieved by the thresholded relaxed labels. [sent-115, score-0.276]

61 Especially the limitation on the working set is a drawback, since the advantage of transduction becomes apparent especially for a large working set as compared to the number of training samples. [sent-120, score-0.586]

62 3 Subspace SDP formulation However, if we would know a subspace (spanned by the d columns of a matrix V ∈ ℜ(nt +nw )×d ) in which (or close to which) the label vector lies, we can restrict the feasible region for Γ, leading to a much more eﬃcient algorithm. [sent-122, score-0.439]

63 If we know that the true label vector y lies in the column space of a matrix V, we know the true label matrix can be written in the form Γ = VMV′ , with M a symmetric matrix. [sent-125, score-0.506]

64 Furthermore, constraint (8) that Γ 0 is then equivalent to M 0, which is a cheaper constraint. [sent-127, score-0.136]

65 Note however that in practical cases, the true label vector will not lie within but only close to the subspace spanned by the columns of V. [sent-128, score-0.248]

66 Then the diagonal of the label matrix Γ can not always be made exactly equal to e as required by (6). [sent-129, score-0.232]

67 We thus relax this constraint to the requirement that the diagonal is not larger than 2 The worst case complexity for the problem at hand is O((nt +n2 )2 (nt +nw )2. [sent-130, score-0.237]

68 5 −1 −1 −1 0 1 0 20 40 60 −1 −1 0 1 Figure 1: The left picture shows 10 labelled samples represented by a ’o’ or a ’+’, depending on their class, together with 60 unlabelled samples represented by a ’·’. [sent-139, score-0.36]

69 The middle picture shows the labels for the working set as estimated using the SDP method before thresholding: all are already invisibly close to 1 or −1. [sent-140, score-0.311]

70 The right picture shows contour lines of the classiﬁcation surface obtained by training an SVM using all labels as found by the SDP method. [sent-141, score-0.191]

71 The method clearly ﬁnds a visually good label assignment that takes cluster structure in the data into account. [sent-142, score-0.139]

72 Similarly, the block in the label matrix corresponding to the training samples may not contain 1’s and −1’s exactly (constraint (5)). [sent-144, score-0.368]

73 However, the better V is chosen, the better this constraint will be met. [sent-145, score-0.111]

74 Thus we optimize (9) subject to (10) together with three constraints that replace the constraints (5), (6) and (8): Γ = VMV′ diag(Γ) ≤ e M 0 Thus we can approximate the relaxed transductive SVM using this reduced parameterization for Γ. [sent-146, score-0.43]

75 The number of eﬀective variables is now only a linear function of nw : 1 + nt + nw + d(d + 1)/2 for a hard margin and 1 + 2(nt + nw ) + d(d + 1)/2 for a soft margin SVM. [sent-147, score-1.658]

76 Furthermore, one of the SPD constraints is now a constraint on a d × d matrix instead of a potentially large (nw + 1) × (nw + 1) matrix. [sent-148, score-0.24]

77 For a constant d, the worst case complexity is thus reduced to O((nt + nw )4. [sent-149, score-0.406]

78 4 Spectral transduction to ﬁnd the subspace In this section we will discuss how to ﬁnd a subspace V close to which the label vector will lie. [sent-152, score-0.473]

79 Our approach is based on the spectral clustering algorithm proposed in [11]. [sent-153, score-0.159]

80 The optimization problem corresponding to this eigenvalue problem is: max v 4. [sent-156, score-0.176]

81 (11) Constrained spectral clustering We could apply this algorithm to the kernel matrix K, but we can do more since we already know some of the labels: we will constrain the estimates of the labels for the training samples that are known to be in the same class to be equal to each other. [sent-160, score-0.567]

82 This can be achieved by choosing the following parameterization for v:     √ ent+ / nt+ 0 0 ht+ √ 0 ent− / nt− 0  ·  ht−  = Lh v= hw 0 0 I where en+ and en− denote the vectors containing nt+ (the number of positive training samples) and nt− (the number of negative training samples) ones. [sent-162, score-0.217]

83 1 Optimization problem (11) is equivalent with: max h h′ L′ D−1/2 KD−1/2 Lh s. [sent-164, score-0.06]

84 h′ h = 1 which corresponds to the eigenvalue problem L′ D−1/2 KD−1/2 Lh = λh. [sent-166, score-0.088]

85 This is an extension of spectral clustering towards transduction3 . [sent-168, score-0.159]

86 2 Spectral transduction provides a good V By construction, all entries of vi corresponding to positive training samples will be √ equal to ht+ / nt+ ; entries corresponding to the negative ones will all be equal to i √ ht− / nt− . [sent-171, score-0.732]

87 Furthermore, as in spectral clustering, the other entries of vectors vi i with large eigenvalue λi will reﬂect the cluster structure of the entire data set, while respecting the label assignment of the training points however4 . [sent-172, score-0.495]

88 This means that such a vi will provide a good approximation for the labels. [sent-173, score-0.098]

89 More speciﬁcally, the label vector will lie close to the column space of V, having d dominant ‘centered’ vi as its columns; the larger d, the better the approximation. [sent-174, score-0.246]

90 The way we ‘center’ vi is by adding a constant so that entries for positive training samples become equal to minus those for the negative ones. [sent-175, score-0.361]

91 Since then the ﬁrst nt columns of the resulting Γ = VMV′ will be equal up to a sign, we can adopt basically the same approach as in section 2. [sent-176, score-0.603]

92 3 to guess the labels: pick and threshold the ﬁrst column of Γ. [sent-177, score-0.075]

93 The positive class is formed by 100 randomly chosen samples representing a number 0, and 100 representing a 1; the negative class by 100 samples representing a 2 and 100 representing a 3. [sent-179, score-0.184]

94 The training set is chosen to contain only 10 samples from each of both classes, and is randomly drawn but evenly distributed over the 4 numbers. [sent-181, score-0.146]

95 4 Note: to reduce the inﬂuence from outliers, large entries of the vi can be thresholded. [sent-185, score-0.133]

96 To illustrate the scalability of the method, and to show that a larger working set is eﬀectively exploited, we used a similar setting (same training set size) but with 1000 samples and d = 3, giving an average ROC-score of 0. [sent-192, score-0.282]

97 6 Conclusions We developed a relaxation for the transductive SVM as ﬁrst proposed by Vapnik. [sent-195, score-0.16]

98 It is shown how this combinatorial problem can be relaxed to an SDP problem. [sent-196, score-0.164]

99 Unfortunately, the number of variables in combination with the complexity of SDP is too high for it to scale to signiﬁcant problem sizes. [sent-197, score-0.085]

100 Therefore we show how, based on a new spectral method, the feasible region of the variables can be shrinked, leading to an approximation for the original SDP method. [sent-198, score-0.221]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nt', 0.476), ('nw', 0.33), ('sdp', 0.33), ('spd', 0.294), ('yt', 0.257), ('transduction', 0.247), ('working', 0.136), ('constraint', 0.111), ('label', 0.107), ('spectral', 0.103), ('rank', 0.095), ('transductive', 0.093), ('unlabelled', 0.085), ('svm', 0.085), ('relaxed', 0.081), ('samples', 0.079), ('labels', 0.079), ('lemma', 0.074), ('unrelaxed', 0.073), ('vmv', 0.073), ('gi', 0.072), ('matrix', 0.069), ('entries', 0.068), ('relaxation', 0.067), ('training', 0.067), ('vi', 0.065), ('kd', 0.065), ('lh', 0.064), ('constraints', 0.06), ('margin', 0.059), ('parameterization', 0.057), ('clustering', 0.056), ('equal', 0.056), ('ht', 0.055), ('submatrix', 0.054), ('yw', 0.054), ('optimization', 0.053), ('feasible', 0.053), ('eigenvalue', 0.053), ('schur', 0.051), ('complexity', 0.05), ('column', 0.049), ('bie', 0.049), ('nello', 0.049), ('tijl', 0.049), ('semi', 0.049), ('combinatorial', 0.048), ('subspace', 0.047), ('cristianini', 0.046), ('block', 0.046), ('dual', 0.045), ('convex', 0.045), ('picture', 0.045), ('labelled', 0.045), ('semide', 0.045), ('proposition', 0.043), ('complement', 0.043), ('columns', 0.043), ('relax', 0.041), ('symmetric', 0.041), ('ent', 0.039), ('furthermore', 0.039), ('hard', 0.039), ('programming', 0.037), ('null', 0.037), ('boldface', 0.036), ('hyperplane', 0.036), ('problem', 0.035), ('soft', 0.035), ('diag', 0.034), ('approximation', 0.033), ('cluster', 0.032), ('distant', 0.032), ('restrictions', 0.032), ('aimed', 0.032), ('know', 0.032), ('region', 0.032), ('duality', 0.031), ('thresholded', 0.031), ('formulation', 0.031), ('maximally', 0.029), ('en', 0.029), ('basically', 0.028), ('together', 0.027), ('optimizing', 0.027), ('vapnik', 0.027), ('already', 0.026), ('optimize', 0.026), ('reduced', 0.026), ('negative', 0.026), ('spanned', 0.026), ('guess', 0.026), ('inductive', 0.026), ('min', 0.025), ('close', 0.025), ('imply', 0.025), ('induction', 0.025), ('berlin', 0.025), ('corollary', 0.025), ('equivalent', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

2 0.15124901 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classiﬁcation. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classiﬁers under known transformations based on semideﬁnite programming. We present a new learning algorithm— the Semideﬁnite Programming Machine (SDPM)—which is able to ﬁnd a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and ﬁnd improvements over known methods. 1

3 0.13055184 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

Author: Philip Derbeko, Ran El-Yaniv, Ron Meir

Abstract: This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of points in the test set can be done using an inductive scheme. However, as pointed out in [2], it makes little sense to solve an easier problem by ‘reducing’ it to a much more difﬁcult one. In particular, the prior knowledge carried by the (unlabeled) test points can be incorporated into an algorithm, potentially leading to superior performance. Indeed, a number of papers have demonstrated empirically that transduction can offer substantial advantage over induction whenever the training set is small or moderate (see e.g. [3, 4, 5, 6]). However, unlike the current state of affairs in induction, the question of what are provably effective learning principles for transduction is quite far from being resolved. In this paper we provide new error bounds and a general technique for transductive learning. Our technique is based on bounds that can be viewed as an extension of McAllester’s PAC-Bayesian framework [7, 8] to transductive learning. The main advantage of using this framework in transduction is that here priors can be selected after observing the unlabeled data (but before observing the labeled sample). This ﬂexibility allows for the choice of “compact priors” (with small support) and therefore, for tight bounds. Another simple observation is that the PAC-Bayesian framework can be operated with polynomially (in m, the training sample size) many different priors simultaneously. Altogether, this added ﬂexibility, of using data-dependent multiple priors allows for easy derivation of tight error bounds for “compression schemes” such as (transductive) SVMs and for clustering algorithms. We brieﬂy review some previous results. The idea of transduction, and a speciﬁc algorithm for SVM transductive learning, was introduced and studied by Vapnik (e.g. [2]), where an error bound is also proposed. However, this bound is implicit and rather unwieldy and, to the best of our knowledge, has not been applied in practical situations. A PAC-Bayes bound [7] for transduction with Perceptron Decision Trees is given in [9]. The bound is data-dependent depending on the number of decision nodes, the margins at each node and the sample size. However, the authors state that the transduction bound is not much tighter than the induction bound. Empirical tests show that this transduction algorithm performs slightly better than induction in terms of the test error, however, the advantage is usually statistically insigniﬁcant. Reﬁning the algorithm of [2] a transductive algorithm based on a SVMs is proposed in [3]. The paper also provides empirical tests indicating that transduction is advantageous in the text categorization domain. An error bound for transduction, based on the effective VC Dimension, is given in [10]. More recently Lanckriet et al. [11] derived a transductive bound for kernel methods based on spectral properties of the kernel matrix. Blum and Langford [12] recently also established an implicit bound for transduction, in the spirit of the results in [2]. 2 The Transduction Setup We consider the following setting proposed by Vapnik ([2] Chp. 8), which for simplicity is described in the context of binary classiﬁcation (the general case will be discussed in the full paper). Let H be a set of binary hypotheses consisting of functions from input space X to {±1} and let Xm+u = {x1 , . . . , xm+u } be a set of points from X each of which is chosen i.i.d. according to some unknown distribution µ(x). We call Xm+u the full sample. Let Xm = {x1 , . . . , xm } and Ym = {y1 , . . . , ym }, where Xm is drawn uniformly from Xm+u and yi ∈ {±1}. The set Sm = {(x1 , y1 ), . . . , (xm , ym )} is referred to as a training sample. In this paper we assume that yi = φ(xi ) for some unknown function φ. The remaining subset Xu = Xm+u \ Xm is referred to as the unlabeled sample. Based on Sm and Xu our goal is to choose h ∈ H which predicts the labels of points in Xu as accurately as possible. For each h ∈ H and a set Z = x1 , . . . , x|Z| of samples deﬁne 1 Rh (Z) = |Z| |Z| (h(xi ), yi ), (1) i=1 where in our case (·, ·) is the zero-one loss function. Our goal in transduction is to learn an h such that Rh (Xu ) is as small as possible. This problem setup is summarized by the following transduction “protocol” introduced in [2] and referred to as Setting 1: (i) A full sample Xm+u = {x1 , . . . , xm+u } consisting of arbitrary m + u points is given.1 (ii) We then choose uniformly at random the training sample Xm ⊆ Xm+u and receive its labeling Ym ; the resulting training set is Sm = (Xm , Ym ) and the remaining set Xu is the unlabeled sample, Xu = Xm+u \ Xm ; (iii) Using both Sm and Xu we select a classiﬁer h ∈ H whose quality is measured by Rh (Xu ). Vapnik [2] also considers another formulation of transduction, referred to as Setting 2: (i) We are given a training set Sm = (Xm , Ym ) selected i.i.d according to µ(x, y). (ii) An independent test set Su = (Xu , Yu ) of u samples is then selected in the same manner. 1 The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen independently at random according to some source distribution µ(x). (iii) We are required to choose our best h ∈ H based on Sm and Xu so as to minimize m+u Rm,u (h) = 1 (h(xi ), yi ) dµ(x1 , y1 ) · · · dµ(xm+u , ym+u ). u i=m+1 (2) Even though Setting 2 may appear more applicable in practical situations than Setting 1, the derivation of theoretical results can be easier within Setting 1. Nevertheless, as far as the expected losses are concerned, Vapnik [2] shows that an error bound in Setting 1 implies an equivalent bound in Setting 2. In view of this result we restrict ourselves in the sequel to Setting 1. We make use of the following quantities, which are all instances of (1). The quantity Rh (Xm+u ) is called the full sample risk of the hypothesis h, Rh (Xu ) is referred to as the transduction risk (of h), and Rh (Xm ) is the training error (of h). Thus, Rh (Xm ) is ˆ the standard training error denoted by Rh (Sm ). While our objective in transduction is to achieve small error over the unlabeled set (i.e. to minimize Rh (Xu )), it turns out that it is much easier to derive error bounds for the full sample risk. The following simple lemma translates an error bound on Rh (Xm+u ), the full sample risk, to an error bound on the transduction risk Rh (Xu ). Lemma 2.1 For any h ∈ H and any C ˆ Rh (Xm+u ) ≤ Rh (Sm ) + C ⇔ ˆ Rh (Xu ) ≤ Rh (Sm ) + m+u · C. u (3) Proof: For any h Rh (Xm+u ) = mRh (Xm ) + uRh (Xu ) . m+u (4) ˆ Substituting Rh (Sm ) for Rh (Xm ) in (4) and then substituting the result for the left-hand side of (3) we get Rh (Xm+u ) = ˆ mRh (Sm ) + uRh (Xu ) ˆ ≤ Rh (Sm ) + C. m+u The equivalence (3) is now obtained by isolating Rh (Xu ) on the left-hand side. 2 3 General Error Bounds for Transduction Consider a hypothesis class H and assume for simplicity that H is countable; in fact, in the case of transduction it sufﬁces to consider a ﬁnite hypothesis class. To see this note that all m + u points are known in advance. Thus, in the case of binary classiﬁcation (for example) it sufﬁces to consider at most 2m+u possible dichotomies. Recall that in the setting considered we select a sub-sample of m points from the set Xm+u of cardinality m+u. This corresponds to a selection of m points without replacement from a set of m+u points, leading to the m points being dependent. A naive utilization of large deviation bounds would therefore not be directly applicable in this setting. However, Hoeffding (see Theorem 4 in [13]) pointed out a simple procedure to transform the problem into one involving independent data. While this procedure leads to non-trivial bounds, it does not fully take advantage of the transductive setting and will not be used here. Consider for simplicity the case of binary classiﬁcation. In this case we make use of the following concentration inequality, based on [14]. Theorem 3.1 Let C = {c1 , . . . , cN }, ci ∈ {0, 1}, be a ﬁnite set of binary numbers, and N set c = (1/N ) i=1 ci . Let Z1 , . . . , Zm , be random variables obtaining their values ¯ by sampling C uniformly at random without replacement. Set Z = (1/m) β = m/N . Then, if 2 ε ≤ min{1 − c, c(1 − β)/β}, ¯¯ Pr {Z − EZ > ε} ≤ exp −mD(¯ + ε c) − (N − m) D c − c ¯ ¯ m i=1 Zi and βε c + 7 log(N + 1) ¯ 1−β where D(p q) = p log(p/q) = (1 − p) log(1 − p)/(1 − q), p, q, ∈ [0, 1] is the binary Kullback-Leibler divergence. Using this result we obtain the following error bound for transductive classiﬁcation. Theorem 3.2 Let Xm+u = Xm ∪Xu be the full sample and let p = p(Xm+u ) be a (prior) distribution over the class of binary hypotheses H that may depend on the full sample. Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over choices of Sm (from the full sample) the following bound holds for any h ∈ H, ˆ 2Rh (Sm )(m + u) u ˆ Rh (Xu ) ≤ Rh (Sm ) + + 2 log 1 p(h) log + ln m + 7 log(m + u + 1) δ m−1 + ln m + 7 log(m + u + 1) δ m−1 1 p(h) . (5) Proof: (sketch) In our transduction setting the set Xm (and therefore Sm ) is obtained by sampling the full sample Xm+u uniformly at random without replacement. We ﬁrst claim that ˆ EΣm Rh (Sm ) = Rh (Xm+u ), (6) where EΣm (·) is the expectation with respect to a random choice of Sm from Xm+u without replacement. This is shown as follows. ˆ EΣm Rh (Sm ) = 1 m+u m ˆ Rh (Sm ) = Sm 1 m+u m Xm ⊆Xm+n 1 m (h(x), φ(x)). x∈Sm By symmetry, all points x ∈ Xm+u are counted on the right-hand side an equal number of times; this number is precisely m+u − m+u−1 = m+u−1 . The equality (6) is obtained m m m−1 m by considering the deﬁnition of Rh (Xm+u ) and noting that m+u−1 / m+u = m+u . m−1 m The remainder of the proof combines Theorem 3.1 and the techniques presented in [15]. The details will be provided in the full paper. 2 ˆ Notice that when Rh (Sm ) → 0 the square root in (5) vanishes and faster rates are obtained. An important feature of Theorem 3.2 is that it allows one to use the sample Xm+u in order to choose the prior distribution p(h). This advantage has already been alluded to in [2], but does not seem to have been widely used in practice. Additionally, observe that (5) holds with probability at least 1 − δ with respect to the random selection of sub-samples of size m from the ﬁxed set Xm+u . This should be contrasted with the standard inductive setting results where the probabilities are with respect to a random choice of m training points chosen i.i.d. from µ(x, y). The next bound we present is analogous to McAllester’s Theorem 1 in [8]. This theorem concerns Gibbs composite classiﬁers, which are distributions over the base classiﬁers in H. For any distribution q over H denote by Gq the Gibbs classiﬁer, which classiﬁes an 2 The second condition, ε ≤ c(1 − β)/β, simply guarantees that the number of ‘ones’ in the ¯ sub-sample does not exceed their number in the original sample. , instance (in Xu ) by randomly choosing, according to q, one hypothesis h ∈ H. For Gibbs classiﬁers we now extend deﬁnition (1) as follows. Let Z = x1 , . . . , x|Z| be any set of samples and let Gq be a Gibbs classiﬁer over H. The risk of Gq over Z is RGq (Z) = Eh∼q (1/|Z|) |Z| i=1 (h(xi ), φ(xi )) . As before, when Z = Xm (the training set) we ˆ use the standard notation RGq (Sm ) = RGq (Xm ). Due to space limitations, the proof of the following theorem will appear in the full paper. Theorem 3.3 Let Xm+u be the full sample. Let p be a distribution over H that may depend on Xm+u and let q be a (posterior) distribution over H that may depend on both Sm and Xu . Let δ ∈ (0, 1) be given. With probability at least 1 − δ over the choices of Sm for any distribution q ˆ RGq (Xu ) ≤ RGq (Sm ) + + ˆ 2RGq (Sm )(m + u) u D(q p) + ln m + 7 log(m + u + 1) δ m−1 7 2 D(q p) + ln m + m log(m + u + 1) δ m−1 . In the context of inductive learning, a major obstacle in generating meaningful and effective bounds using the PAC-Bayesian framework [8] is the construction of “compact priors”. Here we discuss two extensions to the PAC-Bayesian scheme, which together allow for easy choices of compact priors that can yield tight error bounds. The ﬁrst extension we offer is the use of multiple priors. Instead of a single prior p in the original PACBayesian framework we observe that one can use all PAC-Bayesian bounds with a number of priors p1 , . . . , pk and then replace the complexity term ln(1/p(h)) (in Theorem 3.2) by mini ln(1/pi (h)), at a cost of an additional ln k term (see below). Similarly, in Theorem 3.3 we can replace the KL-divergence term in the bound with mini D(q||pi ). The penalty for using k priors is logarithmic in k (speciﬁcally the ln(1/δ) term in the original bound becomes ln(k/δ)). As long as k is sub-exponential in m we still obtain effective generalization bounds. The second “extension” is simply the feature of our transduction bounds (Theorems 3.2 and 3.3), which allows for the priors to be dependent on the full sample Xm+u . The combination of these two simple ideas yields a powerful technique for deriving error bounds in realistic transductive settings. After stating the extended result we later use it for deriving tight bounds for known learning algorithms and for deriving new algorithms. Suppose that instead of a single prior p over H we want to utilize k priors, p1 , . . . , pk and in retrospect choose the best among the k corresponding PAC-Bayesian bounds. The following theorem shows that one can use polynomially many priors with a minor penalty. The proof, which is omitted due to space limitations, utilizes the union bound in a straightforward manner. Theorem 3.4 Let the conditions of Theorem 3.2 hold, except that we now have k prior distributions p1 , . . . , pk deﬁned over H, each of which may depend on Xm+u . Let δ ∈ (0, 1) be given. Then, with probability at least 1 − δ over random choices of sub-samples of size m from the full-sample, for all h ∈ H, (5) holds with p(h) replaced by min1≤i≤k pi (h) and log 1 is replaced by log k . δ δ Remark: A similar result holds for the Gibbs algorithm of Theorem 3.3. Also, as noted by one of the reviewers, when the supports of the k priors intersect (i.e. there is at least one pair of priors pi and pj with overlapping support), then one can do better by utilizing the 1 “super prior” p = k i pi within the original Theorem 3.2. However, note that when the supports are disjoint, these two views (of multiple priors and a super prior) are equivalent. In the applications below we utilize non-intersecting priors. 4 Bounds for Compression Algorithms Here we propose a technique for bounding the error of “compression” algorithms based on appropriate construction of prior probabilities. Let A be a learning algorithm. Intuitively, A is a “compression scheme” if it can generate the same hypothesis using a subset of the data. More formally, a learning algorithm A (viewed as a function from samples to some hypothesis class) is a compression scheme with respect to a sample Z if there is a subsample Z , Z ⊂ Z, such that A(Z ) = A(Z). Observe that the SVM approach is a compression scheme, with Z being determined by the set of support vectors. Let A be a deterministic compression scheme and consider the full sample Xm+u . For each integer τ = 1, . . . , m, consider all subsets of Xm+u of size τ , and for each subset construct all possible dichotomies of that subset (note that we are not proposing this approach as an algorithm, but rather as a means to derive bounds; in practice one need not construct all these dichotomies). A deterministic algorithm A uniquely determines at most one hypothesis h ∈ H for each dichotomy.3 For each τ , let the set of hypotheses generated by this procedure be denoted by Hτ . For the rest of this discussion we assume the worst case where |Hτ | = m+u (i.e. if Hτ does not contains one hypothesis for each dichotomy τ the bounds improve). The prior pτ is then deﬁned to be a uniform distribution over Hτ . In this way we have m priors, p1 , . . . , pm which are constructed using only Xm+u (and are independent of Sm ). Any hypothesis selected by the learning algorithm A based on the labeled sample Sm and on the test set Xu belongs to ∪m Hτ . The motivation for this τ =1 construction is as follows. Each τ can be viewed as our “guess” for the maximal number of compression points that will be utilized by a resulting classiﬁer. For each such τ the prior pτ is constructed over all possible classiﬁers that use τ compression points. By systematically considering all possible dichotomies of τ points we can characterize a relatively small subset of H without observing labels of the training points. Thus, each prior pτ represents one such guess. Using Theorem 3.4 we are later allowed to choose in retrospect the bound corresponding to the best “guess”. The following corollary identiﬁes an upper bound on the divergence in terms of the observed size of the compression set of the ﬁnal classiﬁer. Corollary 4.1 Let the conditions of Theorem 3.4 hold. Let A be a deterministic learning algorithm leading to a hypothesis h ∈ H based on a compression set of size s. Then with probability at least 1 − δ for all h ∈ H, (5) holds with log(1/p(h)) replaced by s log(2e(m + u)/s) and ln(m/δ) replaced by ln(m2 /δ). Proof: Recall that Hs ⊆ H is the support set of ps and that ps (h) = 1/|Hs | for all h ∈ Hs , implying that ln(1/ps (h)) = |Hs |. Using the inequality m+u ≤ (e(m + u)/s)s s we have that |Hs | = 2s m+u ≤ (2e(m + u)/s)s . Substituting this result in Theorem 3.4 s while restricting the minimum over i to be over i ≥ s, leads to the desired result. 2 The bound of Corollary 4.1 can be easily computed once the classiﬁer is trained. If the size of the compression set happens to be small, we obtain a tight bound. SVM classiﬁcation is one of the best studied compression schemes. The compression set for a sample Sm is given by the subset of support vectors. Thus the bound in Corollary 4.1 immediately applies with s being the number of observed support vectors (after training). We note that this bound is similar to a recently derived compression bound for inductive learning (Theorem 5.18 in [16]). Also, observe that the algorithm itself (inductive SVM) did not use in this case the unlabeled sample (although the bound does use this sample). Nevertheless, using exactly the same technique we obtain error bounds for the transductive SVM algorithms in [2, 3].4 3 It might be that for some dichotomies the algorithm will fail. For example, an SVM in feature space without soft margin will fail to classify non linearly-separable dichotomies of Xm+u . 4 Note however that our bounds are optimized with a “minimum number of support vectors” approach rather than “maximum margin”. 5 Bounds for Clustering Algorithms Some learning problems do not allow for high compression rates using compression schemes such as SVMs (i.e. the number of support vectors can sometimes be very large). A considerably stronger type of compression can often be achieved by clustering algorithms. While there is lack of formal links between entirely unsupervised clustering and classiﬁcation, within a transduction setting we can provide a principled approach to using clustering algorithms for classiﬁcation. Let A be any (deterministic) clustering algorithm which, given the full sample Xm+u , can cluster this sample into any desired number of clusters. We use A to cluster Xm+u into 2, 3 . . . , c clusters where c ≤ m. Thus, the algorithm generates a collection of partitions of Xm+u into τ = 2, 3, . . . , c clusters, where each partition is denoted by Cτ . For each value of τ , let Hτ consist of those hypotheses which assign an identical label to all points in the same cluster of partition Cτ , and deﬁne the prior pτ (h) = 1/2τ for each h ∈ Hτ and zero otherwise (note that there are 2τ possible dichotomies). The learning algorithm selects a hypothesis as follows. Upon observing the labeled sample Sm = (Xm , Ym ), for each of the clusterings C2 , . . . , Cc constructed above, it assigns a label to each cluster based on the majority vote from the labels Ym of points falling within the cluster (in case of ties, or if no points from Xm belong to the cluster, choose a label arbitrarily). Doing this leads to c − 1 classiﬁers hτ , τ = 2, . . . , c. For each hτ there is a valid error bound as given by Theorem 3.4 and all these bounds are valid simultaneously. Thus we choose the best classiﬁer (equivalently, number of clusters) for which the best bound holds. We thus have the following corollary of Theorem 3.4 and Lemma 2.1. Corollary 5.1 Let A be any clustering algorithm and let hτ , τ = 2, . . . , c be classiﬁcations of test set Xu as determined by clustering of the full sample Xm+u (into τ clusters) and the training set Sm , as described above. Let δ ∈ (0, 1) be given. Then with probability at least 1 − δ, for all τ , (5) holds with log(1/p(h)) replaced by τ and ln(m/δ) replaced by ln(mc/δ). Error bounds obtained using Corollary 5.1 can be rather tight when the clustering algorithm is successful (i.e. when it captures the class structure in the data using a small number of clusters). Corollary 5.1 can be extended in a number of ways. One simple extension is the use of an ensemble of clustering algorithms. Speciﬁcally, we can concurrently apply k clustering algorithm (using each algorithm to cluster the data into τ = 2, . . . , c clusters). We thus obtain kc hypotheses (partitions of Xm+u ). By a simple application of the union bound we can replace ln cm by ln kcm in Corollary 5.1 and guarantee that kc bounds hold siδ δ multaneously for all kc hypotheses (with probability at least 1 − δ). We thus choose the hypothesis which minimizes the resulting bound. This extension is particularly attractive since typically without prior knowledge we do not know which clustering algorithm will be effective for the dataset at hand. 6 Concluding Remarks We presented new bounds for transductive learning algorithms. We also developed a new technique for deriving tight error bounds for compression schemes and for clustering algorithms in the transductive setting. We expect that these bounds and new techniques will be useful for deriving new error bounds for other known algorithms and for deriving new types of transductive learning algorithms. It would be interesting to see if tighter transduction bounds can be obtained by reducing the “slacks” in the inequalities we use in our analysis. Another promising direction is the construction of better (multiple) priors. For example, in our compression bound (Corollary 4.1), for each number of compression points we assigned the same prior to each possible point subset and each possible dichotomy. However, in practice a vast majority of all these subsets and dichotomies are unlikely to occur. Acknowledgments The work of R.E and R.M. was partially supported by the Technion V.P.R. fund for the promotion of sponsored research. Support from the Ollendorff center of the department of Electrical Engineering at the Technion is also acknowledged. We also thank anonymous referees for their useful comments. References [1] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982. [2] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [3] T. Joachims. Transductive inference for text classiﬁcation unsing support vector machines. In European Conference on Machine Learning, 1999. [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceeding of The Eighteenth International Conference on Machine Learning (ICML 2001), pages 19–26, 2001. [5] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semisupervised learning. In Advances in Neural Information Processing Systems (NIPS 2001), pages 1025–1032, 2001. [6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceeding of The Twentieth International Conference on Machine Learning (ICML-2003), 2003. [7] D. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999. [8] D. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003. [9] D. Wu, K. Bennett, N. Cristianini, and J. Shawe-Taylor. Large margin trees for induction and transduction. In International Conference on Machine Learning, 1999. [10] L. Bottou, C. Cortes, and V. Vapnik. On the effective VC dimension. Technical report, AT&T;, 1994. [11] G.R.G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M.I. Jordan. Learning the kernel matrix with semi-deﬁnite programming. Technical report, University of Berkeley, Computer Science Division, 2002. [12] A. Blum and J. Langford. Pac-mdl bounds. In COLT, pages 344–357, 2003. [13] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [14] A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer, New York, second edition, 1998. [15] D. McAllester. Simpliﬁed pac-bayesian margin bounds. In COLT, pages 203–215, 2003. [16] R. Herbrich. Learning Kernel Classiﬁers: Theory and Algorithms. MIT Press, Boston, 2002.

4 0.13029815 171 nips-2003-Semi-Definite Programming by Perceptron Learning

Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor

Abstract: We present a modiﬁed version of the perceptron learning algorithm (PLA) which solves semideﬁnite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semideﬁnite programs are linear programs with inﬁnitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in ﬁnitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1

5 0.12948997 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

6 0.12215657 176 nips-2003-Sequential Bayesian Kernel Regression

7 0.095548496 145 nips-2003-Online Classification on a Budget

8 0.089935504 113 nips-2003-Learning with Local and Global Consistency

9 0.076754883 46 nips-2003-Clustering with the Connectivity Kernel

10 0.075627729 124 nips-2003-Max-Margin Markov Networks

11 0.074890494 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

12 0.073664628 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

13 0.072647519 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

14 0.071871608 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

15 0.071203262 128 nips-2003-Minimax Embeddings

16 0.061521623 1 nips-2003-1-norm Support Vector Machines

17 0.060475636 73 nips-2003-Feature Selection in Clustering Problems

18 0.059635121 121 nips-2003-Log-Linear Models for Label Ranking

19 0.059586927 148 nips-2003-Online Passive-Aggressive Algorithms

20 0.058991436 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.209), (1, -0.084), (2, -0.067), (3, -0.09), (4, 0.077), (5, 0.081), (6, -0.045), (7, 0.094), (8, 0.089), (9, -0.013), (10, 0.068), (11, -0.086), (12, -0.042), (13, 0.039), (14, -0.08), (15, 0.057), (16, -0.034), (17, -0.007), (18, -0.084), (19, -0.002), (20, -0.024), (21, -0.045), (22, -0.15), (23, -0.142), (24, 0.029), (25, -0.036), (26, 0.219), (27, 0.153), (28, -0.008), (29, -0.107), (30, -0.028), (31, -0.023), (32, -0.099), (33, -0.135), (34, 0.041), (35, -0.028), (36, 0.121), (37, -0.069), (38, 0.044), (39, 0.029), (40, 0.085), (41, 0.094), (42, 0.026), (43, 0.143), (44, 0.078), (45, -0.014), (46, 0.022), (47, 0.047), (48, -0.01), (49, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94696873 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

2 0.60166562 171 nips-2003-Semi-Definite Programming by Perceptron Learning

Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor

3 0.56647074 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

Author: Philip Derbeko, Ran El-Yaniv, Ron Meir

4 0.52567649 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

5 0.48219761 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

6 0.45233443 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

7 0.41786855 176 nips-2003-Sequential Bayesian Kernel Regression

8 0.41677621 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

9 0.39491281 113 nips-2003-Learning with Local and Global Consistency

10 0.37798274 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

11 0.37592852 124 nips-2003-Max-Margin Markov Networks

12 0.37212846 145 nips-2003-Online Classification on a Budget

13 0.3661426 128 nips-2003-Minimax Embeddings

14 0.35866243 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

15 0.35215092 46 nips-2003-Clustering with the Connectivity Kernel

16 0.34778199 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

17 0.34303573 73 nips-2003-Feature Selection in Clustering Problems

18 0.3407082 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

19 0.33043125 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

20 0.32312387 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.069), (11, 0.029), (30, 0.014), (35, 0.063), (53, 0.107), (58, 0.27), (66, 0.016), (71, 0.079), (76, 0.057), (85, 0.067), (91, 0.098), (99, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87761909 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

same-paper 2 0.80305851 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

3 0.77277845 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm ﬁnds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We ﬁrst show that InfoBoost, a modiﬁcation of AdaBoost proposed by Aslam for a diﬀerent purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artiﬁcial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1

4 0.60249603 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.5973711 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a ﬁnite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

6 0.59720212 189 nips-2003-Tree-structured Approximations by Expectation Propagation

7 0.59665871 107 nips-2003-Learning Spectral Clustering

8 0.59657401 113 nips-2003-Learning with Local and Global Consistency

9 0.59016293 126 nips-2003-Measure Based Regularization

10 0.58991593 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

11 0.58924645 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

12 0.58899134 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.58587044 81 nips-2003-Geometric Analysis of Constrained Curves

14 0.58574808 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

15 0.58531928 143 nips-2003-On the Dynamics of Boosting

16 0.58467072 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

17 0.58359039 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

18 0.58283836 112 nips-2003-Learning to Find Pre-Images

19 0.58272392 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

20 0.58208936 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images