nips nips2005 nips2005-12 knowledge-graph by maker-knowledge-mining

12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Source: pdf

Author: François Laviolette, Mario Marchand, Mohak Shah

Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non trivial margin-sparsity trade-oﬀ. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non trivial margin-sparsity trade-oﬀ. [sent-7, score-0.243]

2 1 Introduction Learning algorithms try to produce classiﬁers with small prediction error by trying to optimize some function that can be computed from a training set of examples and a classiﬁer. [sent-8, score-0.152]

3 At one end of the spectrum, we have the set covering machine (SCM), proposed by Marchand and Shawe-Taylor (2002), that tries to ﬁnd the sparsest classiﬁer making few training errors. [sent-10, score-0.163]

4 (1992), that tries to ﬁnd the maximum soft-margin separating hyperplane on the training data. [sent-12, score-0.081]

5 , 2005) if better classiﬁers could be found by learning algorithms that try to optimize a non-trivial function that depends on both the sparsity of a classiﬁer and the magnitude of its separating margin. [sent-14, score-0.095]

6 Our main result was a general data-compression risk bound that applies to any algorithm producing classiﬁers represented by two complementary sources of information: a subset of the training set, called the compression set, and a message string of additional information. [sent-15, score-0.459]

7 In addition, we proposed a new algorithm for the SCM where the information string was used to encode radius values for data-dependent balls and, consequently, the location of the decision surface of the classiﬁer. [sent-16, score-0.407]

8 Since a small message string is suﬃcient when large regions of equally good radius values exist for balls, the data compression risk bound applied to this version of the SCM exhibits, indirectly, a non-trivial margin-sparsity trade-oﬀ. [sent-17, score-0.511]

9 Moreover, this version of the SCM currently suﬀers from the fact that the radius values, used in the ﬁnal classiﬁer, depends on a a priori chosen distance scale R. [sent-18, score-0.128]

10 Moreover, we propose a risk bound that depends more explicitly on the margin and which is also minimized by classiﬁers achieving a non-trivial margin-sparsity trade-oﬀ. [sent-20, score-0.285]

11 2 Deﬁnitions We consider binary classiﬁcation problems where the input space X consists of an def arbitrary subset of Rn and the output space Y = {0, 1}. [sent-21, score-0.23]

12 The (true) risk R(f ) of a classiﬁer f : X → Y is deﬁned to be the probability that f misclassiﬁes z on a random draw according to D: def R(f ) = Pr(x,y)∼D (f (x) = y) = E(x,y)∼D I(f (x) = y) where I(a) = 1 if predicate a is true and 0 otherwise. [sent-24, score-0.422]

13 , zm ) of m examples, the task of a learning algorithm is to construct a classiﬁer with the smallest possible risk without any information about D. [sent-28, score-0.171]

14 Each data-dependent ball is deﬁned by a center and a radius value. [sent-30, score-0.272]

15 The center is an input example xi chosen among the training set S. [sent-31, score-0.133]

16 For any test example x, the output of a ball h, of radius ρ and centered on example xi , and is given by def hi,ρ (x) = yi yi ¯ if d(x, xi ) ≤ ρ otherwise , where yi denotes the boolean complement of yi and d(x, xi ) denotes the distance ¯ between the two points. [sent-32, score-0.78]

17 To specify a conjunction of balls we ﬁrst need to list all the examples that participate def as centers for the balls in the conjunction. [sent-34, score-0.808]

18 < i|i| where |i| is the number of indices present in i (and thus the number of balls in the conjunction). [sent-44, score-0.263]

19 To complete the speciﬁcation of a conjunction of balls, we need a vector ρ = (ρi1 , ρi2 , . [sent-45, score-0.062]

20 A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classiﬁer called the Gibbs classiﬁer . [sent-59, score-0.24]

21 We ﬁrst choose a classiﬁer h according to the posterior distribution Q and then use h to assign the label h(x) to x. [sent-61, score-0.104]

22 However, for all these versions of the PAC-Bayes theorem, the prior P must be deﬁned without reference to the training data. [sent-63, score-0.084]

23 In the sample compression setting, each classiﬁer is described by a subset Si of the training data, called the compression set, and a message string σ that represents the additional information needed to obtain a classiﬁer. [sent-65, score-0.307]

24 In other words, in this setting, there exists a reconstruction function R that outputs a classiﬁer R(σ, Si ) when given an arbitrary compression set Si and a message string σ. [sent-66, score-0.161]

25 Given a training set S, the compression set Si ⊆ S is deﬁned by a vector of indices def i = (i1 , . [sent-67, score-0.399]

26 For the case of a conjunction of balls, each j ∈ i will point to a training example that is used for a ball center and the message string σ will be the vector ρ of radius values (deﬁned above) that ρ are used for the balls. [sent-71, score-0.488]

27 Hence, given Si and ρ , the classiﬁer obtained from R(ρ , Si ) is just the conjunction Ci,ρ deﬁned previously. [sent-72, score-0.062]

28 Their proposed risk bound depends on a dataindependent prior P and a data-dependent posterior Q that are both deﬁned on I × M where I denotes the set of the 2m possible index vectors i and M denotes, in our case, the set of possible radius vectors ρ . [sent-74, score-0.388]

29 Given a training set S and given a new (testing) input example x, a samplecompressed Gibbs classiﬁer GQ chooses randomly (i, ρ ) according to Q to obtain ρ classiﬁer R(ρ , Si ) which is then used to determine the class label of x. [sent-76, score-0.1]

30 In this paper we focus on the case where, given any training set S, the learner returns a Gibbs classiﬁer deﬁned with a posterior distribution Q having all its weight on a single vector i. [sent-77, score-0.141]

31 Hence, a single compression set Si will be used for the ﬁnal classiﬁer. [sent-78, score-0.065]

32 However, the radius ρi for each i ∈ i will be chosen stochastically according to the ρ posterior Q. [sent-79, score-0.166]

33 Hence we consider posteriors Q such that Q(i , ρ ) = I(i = i )Qi (ρ ) where i is the vector of indices chosen by the learner. [sent-80, score-0.066]

34 Hence, given a training set S, the true risk R(GQi ) of GQi and its empirical risk RS (GQi ) are deﬁned by def R(GQi ) = ρ E R(R(ρ , Si )) ρ ∼Qi ; def RS (GQi ) = ρ E RSi (R(ρ , Si )) , ρ ∼Qi where i denotes the set of indices not present in i. [sent-81, score-0.906]

35 In contrast with the posterior Q, the prior P assigns a non zero weight to several ρ vectors i. [sent-86, score-0.09]

36 Let PI (i) denote the prior probability P assigned to vector i and let Pi (ρ ) 1 We assume that the examples in Si are ordered as in S so that the kth radius value in ρ is assigned to the kth example in Si . [sent-87, score-0.246]

37 The risk bound depends on the Kullback-Leibler divergence KL(Q P ) between the posterior Q and the prior P which, in our case, gives KL(Qi P ) = E ρ ∼Qi ln ρ Qi (ρ ) . [sent-89, score-0.402]

38 ρ PI (i)Pi (ρ ) For these classes of posteriors Q and priors P , the PAC-Bayes theorem of Laviolette and Marchand (2005) reduces to the following simpler version. [sent-90, score-0.062]

39 Theorem 1 (Laviolette and Marchand (2005)) Given all our previous deﬁnitions, for any prior P and for any δ ∈ (0, 1] Pr S∼D m ∀ Qi : kl(RS (GQi ) R(GQi )) ≤ where def kl(q p) = q ln 1 m−|i| KL(Qi P ) + ln m+1 δ ≥ 1−δ , q 1−q + (1 − q) ln . [sent-91, score-0.508]

40 But since the risk bound will deteriorate for large |i|, it is generally preferable to choose, for p(d), a slowly decreasing function of d. [sent-98, score-0.217]

41 ρ For the speciﬁcation of Pi (ρ ), we assume that each radius value, in some predeﬁned interval [0, R], is equally likely to be chosen for each ρi such that i ∈ i. [sent-99, score-0.133]

42 For Qi (ρ ), a margin interval [ai , bi ] ⊆ [0, R] of equally good radius values is chosen by the learner for each i ∈ i. [sent-101, score-0.409]

43 bi − ai Therefore, the Gibbs classiﬁer returned by the learner will draw each radius ρi uniformly in [ai , bi ]. [sent-103, score-0.645]

44 A deterministic classiﬁer is then speciﬁed by ﬁxing each radius values ρi ∈ [ai , bi ]. [sent-104, score-0.27]

45 It is tempting at this point to choose ρi = (ai + bi )/2 ∀i ∈ i (i. [sent-105, score-0.187]

46 ρ ρ Consequently, with these choices for Qi (ρ ), PI (i), and Pi (ρ ), the KL divergence between Qi and P is given by KL(Qi P ) = ln m + ln |i| 1 p(|i|) + ln i∈i R bi − ai . [sent-109, score-0.619]

47 Notice that the KL divergence is small for small values of |i| (whenever p(|i|) is not too small) and for large margin values (bi − ai ). [sent-110, score-0.272]

48 Hence, the KL divergence term in Theorem 1 favors both sparsity (small |i|) and large margins. [sent-111, score-0.07]

49 Hence, in practice, the minimum might occur for some GQi that sacriﬁces sparsity whenever larger margins can be found. [sent-112, score-0.09]

50 Since the posterior Q is identiﬁed by i and by the intervals [ai , bi ] ∀i ∈ i, we will now refer to the Gibbs classiﬁer GQi by Gi where a and b are the vectors formed ab by the unions of ai s and bi s respectively. [sent-113, score-0.972]

51 To obtain a risk bound for Gi , we need ab to ﬁnd a closed-form expression for RS (Gi ). [sent-114, score-0.658]

52 Therefore, def i ζa,b (x) = Prρ∼U [a,b] (hi,ρ (x) = 1) = i σa,b (x) if i 1 − σa,b (x) if yi = 1 yi = 0 . [sent-116, score-0.308]

53 Now let Gi (x) denote the probability that Ci,ρ (x) = 1 when each ρi ∈ ρ are drawn ρ ab according to U [ai , bi ]. [sent-117, score-0.625]

54 i∈i R(x,y) (Gi ) ab Consequently, the risk on a single example (x, y) is given by Gi (x) if ab i y = 0 and by 1 − Gab (x) otherwise. [sent-119, score-1.053]

55 Therefore R(x,y) (Gi ) = ab y(1 − Gi (x)) + (1 − y)Gi (x) = (1 − 2y)(Gi (x) − y) . [sent-120, score-0.441]

56 ab ab ab Hence, the empirical risk RS (Gi ) of the Gibbs classiﬁer Gi is given by ab ab 1 m − |i| RS (Gi ) = ab (1 − 2yj )(Gi (xj ) − yj ) . [sent-121, score-2.87]

57 ab j∈i From this expression we see that RS (Gi ) is small when Gi (xj ) → yj ∀j ∈ i. [sent-122, score-0.494]

58 ab ab Training points where Gi (xj ) ≈ 1/2 should therefore be avoided. [sent-123, score-0.882]

59 ab The PAC-Bayes theorem below provides a risk bound for the Gibbs classiﬁer Gi . [sent-124, score-0.7]

60 ab i Since the Bayes classiﬁer Bab just performs a majority vote under the same posterior i distribution as the one used by Gi , we have that Bab (x) = 1 iﬀ Gi (x) > 1/2. [sent-125, score-0.479]

61 ab ab From the above deﬁnitions, note that the decision surface of the Bayes classiﬁer, given by Gi (x) = 1/2, diﬀers from the decision surface of classiﬁer Ciρ when ρ ab ρi = (ai + bi )/2 ∀i ∈ i. [sent-126, score-1.548]

62 ab Theorem 2 Given all our previous deﬁnitions, for any δ ∈ (0, 1], for any p satism fying d=0 p(d) = 1, and for any ﬁxed distance value R, we have: PrS∼Dm ∀i, a, b : R(Gi ) ≤ sup ab + ln 1 p(|i|) + ln i∈i i Furthermore: R(Bab ) ≤ 2R(Gi ) ab : kl(RS (Gi ) ) ≤ ab R bi − ai ∀i, a, b. [sent-131, score-2.283]

63 Recall that the KL divergence is small for small values of |i| (whenever p(|i|) is not too small) and for large margin values (bi − ai ). [sent-133, score-0.272]

64 Furthermore, the Gibbs empirical risk RS (Gi ) is small when the training points are located far away from the Bayes ab decision surface Gi (x) = 1/2 (with Gi (xj ) → yj ∀j ∈ i). [sent-134, score-0.754]

65 Consequently, the ab ab Gibbs classiﬁer with the smallest guarantee of risk should perform a non trivial margin-sparsity tradeoﬀ. [sent-135, score-1.079]

66 4 A Soft Greedy Learning Algorithm i Theorem 2 suggests that the learner should try to ﬁnd the Bayes classiﬁer Bab that uses a small number of balls (i. [sent-136, score-0.301]

67 , a small |i|), each with a large separating margin (bi − ai ), while keeping the empirical Gibbs risk RS (Gi ) at a low value. [sent-138, score-0.429]

68 To ab achieve this goal, we have adapted the greedy algorithm for the set covering machine (SCM) proposed by Marchand and Shawe-Taylor (2002). [sent-139, score-0.6]

69 Once the feature with the largest Ui is found, we remove Ni and Pi from the training set S and then repeat (on the remaining examples) until either no more negative examples are present or that a maximum number of features has been reached. [sent-141, score-0.161]

70 In our case, however, we need to keep the Gibbs risk on S low instead of the risk of a deterministic classiﬁer. [sent-142, score-0.342]

71 Since the Gibbs risk is a “soft measure” that uses the i piece-wise linear functions σa,b instead of “hard” indicator functions, we need a “softer” version of the utility function Ui . [sent-143, score-0.273]

72 Following this observation, let k be the vector of indices of the examples that we have used as ball centers so far for the construction of the classiﬁer. [sent-145, score-0.262]

73 Let us ﬁrst deﬁne the covering value C(Gk ) ab k of Gk by the “amount” of negative examples assigned to class 0 by Gab : ab C(Gk ) ab def (1 − yj ) 1 − Gk (xj ) . [sent-146, score-1.823]

74 ab = j∈k We also deﬁne the positive-side error E(Gk ) of Gk as the “amount” of positive ab ab examples assigned to class 0 : E(Gk ) ab def yj 1 − Gk (xj ) . [sent-147, score-2.131]

75 ab = j∈k We now want to add another ball, centered on an example with index i, to obtain a new vector k containing this new index in addition to those present in k. [sent-148, score-0.441]

76 ab j∈k Typically, the covering contribution of ball i should increase its “utility” and its k positive-side error should decrease it. [sent-150, score-0.68]

77 Hence, we deﬁne the utility Uab (i) of adding k ball i to Gab as k Uab (i) def = k k Cab (i) − pEab (i) , where parameter p represents the penalty of misclassifying a positive example. [sent-151, score-0.466]

78 For a ﬁxed value of p, the “soft greedy” algorithm simply consists of adding, to the current Gibbs classiﬁer, a ball with maximum added utility until either the maximum number of possible features (balls) has been reached or that all the negative examples have been (totally) covered. [sent-152, score-0.319]

79 It is understood that, during this soft greedy algorithm, we can remove an example (xj , yj ) from S whenever it is totally covered. [sent-153, score-0.224]

80 ab The term i∈i ln(R/(bi − ai )), present in the risk bound of Theorem 2, favors “soft balls” having large margins bi − ai . [sent-155, score-1.182]

81 Hence, we introduce a margin parameter γ ≥ 0 that we use as follows. [sent-156, score-0.068]

82 At each greedy step, we ﬁrst search among balls having bi − ai = γ. [sent-157, score-0.601]

83 Once such a ball, of center xi , having maximum utility has been found, we try to increase further its utility be searching among all possible values of ai and bi > ai while keeping its center xi ﬁxed2 . [sent-158, score-0.89]

84 We conclude this section with an analysis of the running time of this soft greedy learning algorithm for ﬁxed p and γ. [sent-160, score-0.1]

85 For each potential ball center, we ﬁrst sort the m − 1 other examples with respect to their distances from the center in O(m log m) time. [sent-161, score-0.22]

86 Then, for this center xi , the set of ai values that we examine are those speciﬁed by the distances (from xi ) of the m − 1 sorted examples3 . [sent-162, score-0.286]

87 Since the examples are sorted, it takes time ∈ O(km) to compute the covering contributions and the positive-side error for all the m − 1 values of ai . [sent-163, score-0.327]

88 It therefore takes time ∈ O(m log m) to compute the utility values of all the m − 1 diﬀerent balls of a given center. [sent-166, score-0.319]

89 Once a ball with a largest utility value has been chosen, we then try to increase further its utility by searching among O(m2 ) pair values for (ai , bi ). [sent-168, score-0.54]

90 We then remove the examples covered by this ball and repeat the algorithm on the remaining examples. [sent-169, score-0.23]

91 It is well known that greedy algorithms of this kind have the following guarantee: if there exist r balls that covers all the m examples, the greedy algorithm will ﬁnd at most r ln(m) balls. [sent-170, score-0.325]

92 Both of these algorithms were also compared with the SVM equipped with a RBF kernel of variance σ 2 and a soft margin parameter C. [sent-173, score-0.114]

93 (2005), each SCM was constrained to use only balls having centers of the same class (negative for conjunctions and positive for disjunctions). [sent-176, score-0.244]

94 2 3 The possible values for ai and bi are deﬁned by the location of the training points. [sent-177, score-0.388]

95 Recall that for each value of ai , the value of bi is set to ai + γ at this stage. [sent-178, score-0.497]

96 Data Set Name train breastw 343 bupa 170 credit 353 glass 107 heart 150 haberman 144 USvotes 235 test C 340 175 300 107 147 150 200 1 2 100 10 1 2 1 SVM results σ2 SVs errs 5 . [sent-180, score-0.118]

97 17 1 25 38 169 282 51 64 81 53 15 66 51 29 26 39 13 SCM errs b 1 5 3 5 1 1 10 12 62 58 22 23 39 27 b 4 6 11 16 1 1 18 SCM-PB γ errs . [sent-183, score-0.12]

98 About half of the examples was used for training and the remaining set of examples was used for testing. [sent-191, score-0.168]

99 In addition to this, the “b” and “γ” columns refer, respectively, to the number of balls and the margin parameter (divided by the average distance between the positive and the negative examples). [sent-199, score-0.334]

100 We also observe that SCM-PB generally sacriﬁces sparsity (compared to SCM) to obtain some margin γ > 0. [sent-204, score-0.101]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ab', 0.441), ('scm', 0.259), ('gi', 0.248), ('gk', 0.233), ('def', 0.23), ('balls', 0.217), ('er', 0.175), ('gqi', 0.173), ('risk', 0.171), ('ai', 0.167), ('bi', 0.163), ('classi', 0.145), ('rs', 0.145), ('bab', 0.138), ('laviolette', 0.138), ('marchand', 0.138), ('gibbs', 0.137), ('qi', 0.137), ('ball', 0.134), ('pi', 0.119), ('radius', 0.107), ('covering', 0.105), ('utility', 0.102), ('gab', 0.086), ('kl', 0.085), ('ln', 0.084), ('ers', 0.08), ('si', 0.074), ('mario', 0.069), ('margin', 0.068), ('compression', 0.065), ('conjunction', 0.062), ('errs', 0.06), ('training', 0.058), ('xj', 0.057), ('examples', 0.055), ('greedy', 0.054), ('yj', 0.053), ('string', 0.052), ('fran', 0.052), ('gq', 0.052), ('ois', 0.052), ('ci', 0.047), ('svm', 0.047), ('indices', 0.046), ('soft', 0.046), ('bound', 0.046), ('learner', 0.045), ('xi', 0.044), ('message', 0.044), ('theorem', 0.042), ('try', 0.039), ('yi', 0.039), ('posterior', 0.038), ('divergence', 0.037), ('bupa', 0.035), ('cab', 0.035), ('mohak', 0.035), ('ottawa', 0.035), ('svs', 0.035), ('uab', 0.035), ('consequently', 0.034), ('bayes', 0.033), ('sparsity', 0.033), ('center', 0.031), ('ui', 0.031), ('surface', 0.031), ('pr', 0.03), ('whenever', 0.03), ('disjunction', 0.03), ('su', 0.029), ('assigned', 0.029), ('nitions', 0.029), ('ni', 0.028), ('misclassi', 0.028), ('negative', 0.028), ('margins', 0.027), ('pac', 0.027), ('centers', 0.027), ('hence', 0.026), ('non', 0.026), ('equally', 0.026), ('prior', 0.026), ('cv', 0.026), ('mcallester', 0.026), ('boser', 0.024), ('choose', 0.024), ('heart', 0.023), ('sacri', 0.023), ('hi', 0.023), ('separating', 0.023), ('called', 0.023), ('metric', 0.022), ('according', 0.021), ('covered', 0.021), ('totally', 0.021), ('distance', 0.021), ('label', 0.021), ('remove', 0.02), ('posteriors', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Author: François Laviolette, Mario Marchand, Mohak Shah

2 0.10708802 95 nips-2005-Improved risk tail bounds for on-line algorithms

Author: NicolĂ˛ Cesa-bianchi, Claudio Gentile

Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.

3 0.096594237 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model ﬁtted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classiﬁcation, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

4 0.091271445 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

Abstract: We consider the problem of constructing an aggregated estimator from a ﬁnite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with speciﬁc weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efﬁcient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1

5 0.085490048 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inﬁnity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held ﬁxed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classiﬁcation error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the ﬁrst time to be a consistent density level set estimator. 1

6 0.082163192 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

7 0.080159791 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

8 0.076901428 50 nips-2005-Convex Neural Networks

9 0.076183863 58 nips-2005-Divergences, surrogate loss functions and experimental design

10 0.076162927 131 nips-2005-Multiple Instance Boosting for Object Detection

11 0.076040432 102 nips-2005-Kernelized Infomax Clustering

12 0.074349828 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

13 0.072442316 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

14 0.071337245 195 nips-2005-Transfer learning for text classification

15 0.070412807 144 nips-2005-Off-policy Learning with Options and Recognizers

16 0.067213856 78 nips-2005-From Weighted Classification to Policy Search

17 0.067003772 114 nips-2005-Learning Rankings via Convex Hull Separation

18 0.066789046 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

19 0.063465364 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

20 0.06288071 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, 0.083), (2, -0.016), (3, -0.085), (4, 0.074), (5, 0.125), (6, 0.043), (7, 0.054), (8, -0.068), (9, -0.018), (10, -0.041), (11, -0.028), (12, 0.059), (13, 0.023), (14, 0.059), (15, 0.144), (16, -0.001), (17, -0.005), (18, 0.019), (19, 0.015), (20, -0.012), (21, 0.074), (22, 0.027), (23, 0.073), (24, 0.014), (25, 0.073), (26, 0.013), (27, 0.112), (28, 0.045), (29, 0.086), (30, -0.025), (31, 0.14), (32, -0.016), (33, -0.029), (34, -0.003), (35, -0.077), (36, 0.042), (37, -0.149), (38, 0.022), (39, -0.115), (40, -0.021), (41, 0.008), (42, 0.142), (43, -0.043), (44, 0.103), (45, 0.064), (46, -0.154), (47, -0.081), (48, 0.076), (49, -0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94401383 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Author: François Laviolette, Mario Marchand, Mohak Shah

2 0.49280241 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

Abstract: Linear text classiﬁcation algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classiﬁcation over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classiﬁcation problems. The parameter function found by our algorithm then deﬁnes a new learning algorithm for text classiﬁcation, which we can apply to novel classiﬁcation tasks. We ﬁnd that our learned classiﬁer outperforms existing methods on a variety of multiclass text classiﬁcation tasks. 1

3 0.48090863 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

Author: Lacey Gunter, Ji Zhu

Abstract: In this paper we derive an algorithm that computes the entire solution path of the support vector regression, with essentially the same computational cost as ﬁtting one SVR model. We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1

4 0.47691661 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

5 0.4721759 4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

6 0.43812692 114 nips-2005-Learning Rankings via Convex Hull Separation

7 0.43405497 50 nips-2005-Convex Neural Networks

8 0.4291884 131 nips-2005-Multiple Instance Boosting for Object Detection

9 0.41902578 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

10 0.41857025 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

11 0.40515852 160 nips-2005-Query by Committee Made Real

12 0.39846301 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

13 0.39160675 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

14 0.37691185 95 nips-2005-Improved risk tail bounds for on-line algorithms

15 0.37623298 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

16 0.3743681 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

17 0.37394476 151 nips-2005-Pattern Recognition from One Example by Chopping

18 0.371923 146 nips-2005-On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?

19 0.36966318 62 nips-2005-Efficient Estimation of OOMs

20 0.36917186 70 nips-2005-Fast Information Value for Graphical Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.066), (10, 0.019), (27, 0.011), (31, 0.024), (34, 0.62), (39, 0.013), (50, 0.012), (55, 0.023), (69, 0.03), (73, 0.013), (88, 0.051), (91, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9905296 2 nips-2005-A Bayes Rule for Density Matrices

Author: Manfred K. Warmuth

Abstract: The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive deﬁnite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix speciﬁes the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki’s quantum relative entropy the new Bayes rule for density matrices. 1

same-paper 2 0.98572689 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Author: François Laviolette, Mario Marchand, Mohak Shah

3 0.9735226 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classiﬁcation, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-inﬁnite linear program that can be efﬁciently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classiﬁcation. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined. 1

4 0.97272491 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

5 0.95010227 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun

Abstract: We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw input images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training runs collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two forwardpointing wireless color cameras. A remote computer processes the video and controls the robot via radio. The learning system is a large 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.

6 0.81369388 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

7 0.78952181 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

8 0.77512318 114 nips-2005-Learning Rankings via Convex Hull Separation

9 0.7691859 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

10 0.76430589 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

11 0.76336932 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

12 0.75660419 105 nips-2005-Large-Scale Multiclass Transduction

13 0.74982488 77 nips-2005-From Lasso regression to Feature vector machine

14 0.74174124 184 nips-2005-Structured Prediction via the Extragradient Method

15 0.73277134 50 nips-2005-Convex Neural Networks

16 0.73275256 58 nips-2005-Divergences, surrogate loss functions and experimental design

17 0.72663718 47 nips-2005-Consistency of one-class SVM and related algorithms

18 0.72127002 126 nips-2005-Metric Learning by Collapsing Classes

19 0.71404374 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

20 0.71375793 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models