nips nips2005 nips2005-12 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: François Laviolette, Mario Marchand, Mohak Shah
Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. [sent-7, score-0.243]
2 1 Introduction Learning algorithms try to produce classifiers with small prediction error by trying to optimize some function that can be computed from a training set of examples and a classifier. [sent-8, score-0.152]
3 At one end of the spectrum, we have the set covering machine (SCM), proposed by Marchand and Shawe-Taylor (2002), that tries to find the sparsest classifier making few training errors. [sent-10, score-0.163]
4 (1992), that tries to find the maximum soft-margin separating hyperplane on the training data. [sent-12, score-0.081]
5 , 2005) if better classifiers could be found by learning algorithms that try to optimize a non-trivial function that depends on both the sparsity of a classifier and the magnitude of its separating margin. [sent-14, score-0.095]
6 Our main result was a general data-compression risk bound that applies to any algorithm producing classifiers represented by two complementary sources of information: a subset of the training set, called the compression set, and a message string of additional information. [sent-15, score-0.459]
7 In addition, we proposed a new algorithm for the SCM where the information string was used to encode radius values for data-dependent balls and, consequently, the location of the decision surface of the classifier. [sent-16, score-0.407]
8 Since a small message string is sufficient when large regions of equally good radius values exist for balls, the data compression risk bound applied to this version of the SCM exhibits, indirectly, a non-trivial margin-sparsity trade-off. [sent-17, score-0.511]
9 Moreover, this version of the SCM currently suffers from the fact that the radius values, used in the final classifier, depends on a a priori chosen distance scale R. [sent-18, score-0.128]
10 Moreover, we propose a risk bound that depends more explicitly on the margin and which is also minimized by classifiers achieving a non-trivial margin-sparsity trade-off. [sent-20, score-0.285]
11 2 Definitions We consider binary classification problems where the input space X consists of an def arbitrary subset of Rn and the output space Y = {0, 1}. [sent-21, score-0.23]
12 The (true) risk R(f ) of a classifier f : X → Y is defined to be the probability that f misclassifies z on a random draw according to D: def R(f ) = Pr(x,y)∼D (f (x) = y) = E(x,y)∼D I(f (x) = y) where I(a) = 1 if predicate a is true and 0 otherwise. [sent-24, score-0.422]
13 , zm ) of m examples, the task of a learning algorithm is to construct a classifier with the smallest possible risk without any information about D. [sent-28, score-0.171]
14 Each data-dependent ball is defined by a center and a radius value. [sent-30, score-0.272]
15 The center is an input example xi chosen among the training set S. [sent-31, score-0.133]
16 For any test example x, the output of a ball h, of radius ρ and centered on example xi , and is given by def hi,ρ (x) = yi yi ¯ if d(x, xi ) ≤ ρ otherwise , where yi denotes the boolean complement of yi and d(x, xi ) denotes the distance ¯ between the two points. [sent-32, score-0.78]
17 To specify a conjunction of balls we first need to list all the examples that participate def as centers for the balls in the conjunction. [sent-34, score-0.808]
18 < i|i| where |i| is the number of indices present in i (and thus the number of balls in the conjunction). [sent-44, score-0.263]
19 To complete the specification of a conjunction of balls, we need a vector ρ = (ρi1 , ρi2 , . [sent-45, score-0.062]
20 A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier called the Gibbs classifier . [sent-59, score-0.24]
21 We first choose a classifier h according to the posterior distribution Q and then use h to assign the label h(x) to x. [sent-61, score-0.104]
22 However, for all these versions of the PAC-Bayes theorem, the prior P must be defined without reference to the training data. [sent-63, score-0.084]
23 In the sample compression setting, each classifier is described by a subset Si of the training data, called the compression set, and a message string σ that represents the additional information needed to obtain a classifier. [sent-65, score-0.307]
24 In other words, in this setting, there exists a reconstruction function R that outputs a classifier R(σ, Si ) when given an arbitrary compression set Si and a message string σ. [sent-66, score-0.161]
25 Given a training set S, the compression set Si ⊆ S is defined by a vector of indices def i = (i1 , . [sent-67, score-0.399]
26 For the case of a conjunction of balls, each j ∈ i will point to a training example that is used for a ball center and the message string σ will be the vector ρ of radius values (defined above) that ρ are used for the balls. [sent-71, score-0.488]
27 Hence, given Si and ρ , the classifier obtained from R(ρ , Si ) is just the conjunction Ci,ρ defined previously. [sent-72, score-0.062]
28 Their proposed risk bound depends on a dataindependent prior P and a data-dependent posterior Q that are both defined on I × M where I denotes the set of the 2m possible index vectors i and M denotes, in our case, the set of possible radius vectors ρ . [sent-74, score-0.388]
29 Given a training set S and given a new (testing) input example x, a samplecompressed Gibbs classifier GQ chooses randomly (i, ρ ) according to Q to obtain ρ classifier R(ρ , Si ) which is then used to determine the class label of x. [sent-76, score-0.1]
30 In this paper we focus on the case where, given any training set S, the learner returns a Gibbs classifier defined with a posterior distribution Q having all its weight on a single vector i. [sent-77, score-0.141]
31 Hence, a single compression set Si will be used for the final classifier. [sent-78, score-0.065]
32 However, the radius ρi for each i ∈ i will be chosen stochastically according to the ρ posterior Q. [sent-79, score-0.166]
33 Hence we consider posteriors Q such that Q(i , ρ ) = I(i = i )Qi (ρ ) where i is the vector of indices chosen by the learner. [sent-80, score-0.066]
34 Hence, given a training set S, the true risk R(GQi ) of GQi and its empirical risk RS (GQi ) are defined by def R(GQi ) = ρ E R(R(ρ , Si )) ρ ∼Qi ; def RS (GQi ) = ρ E RSi (R(ρ , Si )) , ρ ∼Qi where i denotes the set of indices not present in i. [sent-81, score-0.906]
35 In contrast with the posterior Q, the prior P assigns a non zero weight to several ρ vectors i. [sent-86, score-0.09]
36 Let PI (i) denote the prior probability P assigned to vector i and let Pi (ρ ) 1 We assume that the examples in Si are ordered as in S so that the kth radius value in ρ is assigned to the kth example in Si . [sent-87, score-0.246]
37 The risk bound depends on the Kullback-Leibler divergence KL(Q P ) between the posterior Q and the prior P which, in our case, gives KL(Qi P ) = E ρ ∼Qi ln ρ Qi (ρ ) . [sent-89, score-0.402]
38 ρ PI (i)Pi (ρ ) For these classes of posteriors Q and priors P , the PAC-Bayes theorem of Laviolette and Marchand (2005) reduces to the following simpler version. [sent-90, score-0.062]
39 Theorem 1 (Laviolette and Marchand (2005)) Given all our previous definitions, for any prior P and for any δ ∈ (0, 1] Pr S∼D m ∀ Qi : kl(RS (GQi ) R(GQi )) ≤ where def kl(q p) = q ln 1 m−|i| KL(Qi P ) + ln m+1 δ ≥ 1−δ , q 1−q + (1 − q) ln . [sent-91, score-0.508]
40 But since the risk bound will deteriorate for large |i|, it is generally preferable to choose, for p(d), a slowly decreasing function of d. [sent-98, score-0.217]
41 ρ For the specification of Pi (ρ ), we assume that each radius value, in some predefined interval [0, R], is equally likely to be chosen for each ρi such that i ∈ i. [sent-99, score-0.133]
42 For Qi (ρ ), a margin interval [ai , bi ] ⊆ [0, R] of equally good radius values is chosen by the learner for each i ∈ i. [sent-101, score-0.409]
43 bi − ai Therefore, the Gibbs classifier returned by the learner will draw each radius ρi uniformly in [ai , bi ]. [sent-103, score-0.645]
44 A deterministic classifier is then specified by fixing each radius values ρi ∈ [ai , bi ]. [sent-104, score-0.27]
45 It is tempting at this point to choose ρi = (ai + bi )/2 ∀i ∈ i (i. [sent-105, score-0.187]
46 ρ ρ Consequently, with these choices for Qi (ρ ), PI (i), and Pi (ρ ), the KL divergence between Qi and P is given by KL(Qi P ) = ln m + ln |i| 1 p(|i|) + ln i∈i R bi − ai . [sent-109, score-0.619]
47 Notice that the KL divergence is small for small values of |i| (whenever p(|i|) is not too small) and for large margin values (bi − ai ). [sent-110, score-0.272]
48 Hence, the KL divergence term in Theorem 1 favors both sparsity (small |i|) and large margins. [sent-111, score-0.07]
49 Hence, in practice, the minimum might occur for some GQi that sacrifices sparsity whenever larger margins can be found. [sent-112, score-0.09]
50 Since the posterior Q is identified by i and by the intervals [ai , bi ] ∀i ∈ i, we will now refer to the Gibbs classifier GQi by Gi where a and b are the vectors formed ab by the unions of ai s and bi s respectively. [sent-113, score-0.972]
51 To obtain a risk bound for Gi , we need ab to find a closed-form expression for RS (Gi ). [sent-114, score-0.658]
52 Therefore, def i ζa,b (x) = Prρ∼U [a,b] (hi,ρ (x) = 1) = i σa,b (x) if i 1 − σa,b (x) if yi = 1 yi = 0 . [sent-116, score-0.308]
53 Now let Gi (x) denote the probability that Ci,ρ (x) = 1 when each ρi ∈ ρ are drawn ρ ab according to U [ai , bi ]. [sent-117, score-0.625]
54 i∈i R(x,y) (Gi ) ab Consequently, the risk on a single example (x, y) is given by Gi (x) if ab i y = 0 and by 1 − Gab (x) otherwise. [sent-119, score-1.053]
55 Therefore R(x,y) (Gi ) = ab y(1 − Gi (x)) + (1 − y)Gi (x) = (1 − 2y)(Gi (x) − y) . [sent-120, score-0.441]
56 ab ab ab Hence, the empirical risk RS (Gi ) of the Gibbs classifier Gi is given by ab ab 1 m − |i| RS (Gi ) = ab (1 − 2yj )(Gi (xj ) − yj ) . [sent-121, score-2.87]
57 ab j∈i From this expression we see that RS (Gi ) is small when Gi (xj ) → yj ∀j ∈ i. [sent-122, score-0.494]
58 ab ab Training points where Gi (xj ) ≈ 1/2 should therefore be avoided. [sent-123, score-0.882]
59 ab The PAC-Bayes theorem below provides a risk bound for the Gibbs classifier Gi . [sent-124, score-0.7]
60 ab i Since the Bayes classifier Bab just performs a majority vote under the same posterior i distribution as the one used by Gi , we have that Bab (x) = 1 iff Gi (x) > 1/2. [sent-125, score-0.479]
61 ab ab From the above definitions, note that the decision surface of the Bayes classifier, given by Gi (x) = 1/2, differs from the decision surface of classifier Ciρ when ρ ab ρi = (ai + bi )/2 ∀i ∈ i. [sent-126, score-1.548]
62 ab Theorem 2 Given all our previous definitions, for any δ ∈ (0, 1], for any p satism fying d=0 p(d) = 1, and for any fixed distance value R, we have: PrS∼Dm ∀i, a, b : R(Gi ) ≤ sup ab + ln 1 p(|i|) + ln i∈i i Furthermore: R(Bab ) ≤ 2R(Gi ) ab : kl(RS (Gi ) ) ≤ ab R bi − ai ∀i, a, b. [sent-131, score-2.283]
63 Recall that the KL divergence is small for small values of |i| (whenever p(|i|) is not too small) and for large margin values (bi − ai ). [sent-133, score-0.272]
64 Furthermore, the Gibbs empirical risk RS (Gi ) is small when the training points are located far away from the Bayes ab decision surface Gi (x) = 1/2 (with Gi (xj ) → yj ∀j ∈ i). [sent-134, score-0.754]
65 Consequently, the ab ab Gibbs classifier with the smallest guarantee of risk should perform a non trivial margin-sparsity tradeoff. [sent-135, score-1.079]
66 4 A Soft Greedy Learning Algorithm i Theorem 2 suggests that the learner should try to find the Bayes classifier Bab that uses a small number of balls (i. [sent-136, score-0.301]
67 , a small |i|), each with a large separating margin (bi − ai ), while keeping the empirical Gibbs risk RS (Gi ) at a low value. [sent-138, score-0.429]
68 To ab achieve this goal, we have adapted the greedy algorithm for the set covering machine (SCM) proposed by Marchand and Shawe-Taylor (2002). [sent-139, score-0.6]
69 Once the feature with the largest Ui is found, we remove Ni and Pi from the training set S and then repeat (on the remaining examples) until either no more negative examples are present or that a maximum number of features has been reached. [sent-141, score-0.161]
70 In our case, however, we need to keep the Gibbs risk on S low instead of the risk of a deterministic classifier. [sent-142, score-0.342]
71 Since the Gibbs risk is a “soft measure” that uses the i piece-wise linear functions σa,b instead of “hard” indicator functions, we need a “softer” version of the utility function Ui . [sent-143, score-0.273]
72 Following this observation, let k be the vector of indices of the examples that we have used as ball centers so far for the construction of the classifier. [sent-145, score-0.262]
73 Let us first define the covering value C(Gk ) ab k of Gk by the “amount” of negative examples assigned to class 0 by Gab : ab C(Gk ) ab def (1 − yj ) 1 − Gk (xj ) . [sent-146, score-1.823]
74 ab = j∈k We also define the positive-side error E(Gk ) of Gk as the “amount” of positive ab ab examples assigned to class 0 : E(Gk ) ab def yj 1 − Gk (xj ) . [sent-147, score-2.131]
75 ab = j∈k We now want to add another ball, centered on an example with index i, to obtain a new vector k containing this new index in addition to those present in k. [sent-148, score-0.441]
76 ab j∈k Typically, the covering contribution of ball i should increase its “utility” and its k positive-side error should decrease it. [sent-150, score-0.68]
77 Hence, we define the utility Uab (i) of adding k ball i to Gab as k Uab (i) def = k k Cab (i) − pEab (i) , where parameter p represents the penalty of misclassifying a positive example. [sent-151, score-0.466]
78 For a fixed value of p, the “soft greedy” algorithm simply consists of adding, to the current Gibbs classifier, a ball with maximum added utility until either the maximum number of possible features (balls) has been reached or that all the negative examples have been (totally) covered. [sent-152, score-0.319]
79 It is understood that, during this soft greedy algorithm, we can remove an example (xj , yj ) from S whenever it is totally covered. [sent-153, score-0.224]
80 ab The term i∈i ln(R/(bi − ai )), present in the risk bound of Theorem 2, favors “soft balls” having large margins bi − ai . [sent-155, score-1.182]
81 Hence, we introduce a margin parameter γ ≥ 0 that we use as follows. [sent-156, score-0.068]
82 At each greedy step, we first search among balls having bi − ai = γ. [sent-157, score-0.601]
83 Once such a ball, of center xi , having maximum utility has been found, we try to increase further its utility be searching among all possible values of ai and bi > ai while keeping its center xi fixed2 . [sent-158, score-0.89]
84 We conclude this section with an analysis of the running time of this soft greedy learning algorithm for fixed p and γ. [sent-160, score-0.1]
85 For each potential ball center, we first sort the m − 1 other examples with respect to their distances from the center in O(m log m) time. [sent-161, score-0.22]
86 Then, for this center xi , the set of ai values that we examine are those specified by the distances (from xi ) of the m − 1 sorted examples3 . [sent-162, score-0.286]
87 Since the examples are sorted, it takes time ∈ O(km) to compute the covering contributions and the positive-side error for all the m − 1 values of ai . [sent-163, score-0.327]
88 It therefore takes time ∈ O(m log m) to compute the utility values of all the m − 1 different balls of a given center. [sent-166, score-0.319]
89 Once a ball with a largest utility value has been chosen, we then try to increase further its utility by searching among O(m2 ) pair values for (ai , bi ). [sent-168, score-0.54]
90 We then remove the examples covered by this ball and repeat the algorithm on the remaining examples. [sent-169, score-0.23]
91 It is well known that greedy algorithms of this kind have the following guarantee: if there exist r balls that covers all the m examples, the greedy algorithm will find at most r ln(m) balls. [sent-170, score-0.325]
92 Both of these algorithms were also compared with the SVM equipped with a RBF kernel of variance σ 2 and a soft margin parameter C. [sent-173, score-0.114]
93 (2005), each SCM was constrained to use only balls having centers of the same class (negative for conjunctions and positive for disjunctions). [sent-176, score-0.244]
94 2 3 The possible values for ai and bi are defined by the location of the training points. [sent-177, score-0.388]
95 Recall that for each value of ai , the value of bi is set to ai + γ at this stage. [sent-178, score-0.497]
96 Data Set Name train breastw 343 bupa 170 credit 353 glass 107 heart 150 haberman 144 USvotes 235 test C 340 175 300 107 147 150 200 1 2 100 10 1 2 1 SVM results σ2 SVs errs 5 . [sent-180, score-0.118]
97 17 1 25 38 169 282 51 64 81 53 15 66 51 29 26 39 13 SCM errs b 1 5 3 5 1 1 10 12 62 58 22 23 39 27 b 4 6 11 16 1 1 18 SCM-PB γ errs . [sent-183, score-0.12]
98 About half of the examples was used for training and the remaining set of examples was used for testing. [sent-191, score-0.168]
99 In addition to this, the “b” and “γ” columns refer, respectively, to the number of balls and the margin parameter (divided by the average distance between the positive and the negative examples). [sent-199, score-0.334]
100 We also observe that SCM-PB generally sacrifices sparsity (compared to SCM) to obtain some margin γ > 0. [sent-204, score-0.101]
wordName wordTfidf (topN-words)
[('ab', 0.441), ('scm', 0.259), ('gi', 0.248), ('gk', 0.233), ('def', 0.23), ('balls', 0.217), ('er', 0.175), ('gqi', 0.173), ('risk', 0.171), ('ai', 0.167), ('bi', 0.163), ('classi', 0.145), ('rs', 0.145), ('bab', 0.138), ('laviolette', 0.138), ('marchand', 0.138), ('gibbs', 0.137), ('qi', 0.137), ('ball', 0.134), ('pi', 0.119), ('radius', 0.107), ('covering', 0.105), ('utility', 0.102), ('gab', 0.086), ('kl', 0.085), ('ln', 0.084), ('ers', 0.08), ('si', 0.074), ('mario', 0.069), ('margin', 0.068), ('compression', 0.065), ('conjunction', 0.062), ('errs', 0.06), ('training', 0.058), ('xj', 0.057), ('examples', 0.055), ('greedy', 0.054), ('yj', 0.053), ('string', 0.052), ('fran', 0.052), ('gq', 0.052), ('ois', 0.052), ('ci', 0.047), ('svm', 0.047), ('indices', 0.046), ('soft', 0.046), ('bound', 0.046), ('learner', 0.045), ('xi', 0.044), ('message', 0.044), ('theorem', 0.042), ('try', 0.039), ('yi', 0.039), ('posterior', 0.038), ('divergence', 0.037), ('bupa', 0.035), ('cab', 0.035), ('mohak', 0.035), ('ottawa', 0.035), ('svs', 0.035), ('uab', 0.035), ('consequently', 0.034), ('bayes', 0.033), ('sparsity', 0.033), ('center', 0.031), ('ui', 0.031), ('surface', 0.031), ('pr', 0.03), ('whenever', 0.03), ('disjunction', 0.03), ('su', 0.029), ('assigned', 0.029), ('nitions', 0.029), ('ni', 0.028), ('misclassi', 0.028), ('negative', 0.028), ('margins', 0.027), ('pac', 0.027), ('centers', 0.027), ('hence', 0.026), ('non', 0.026), ('equally', 0.026), ('prior', 0.026), ('cv', 0.026), ('mcallester', 0.026), ('boser', 0.024), ('choose', 0.024), ('heart', 0.023), ('sacri', 0.023), ('hi', 0.023), ('separating', 0.023), ('called', 0.023), ('metric', 0.022), ('according', 0.021), ('covered', 0.021), ('totally', 0.021), ('distance', 0.021), ('label', 0.021), ('remove', 0.02), ('posteriors', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
Author: François Laviolette, Mario Marchand, Mohak Shah
Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. 1
2 0.10708802 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
3 0.096594237 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
4 0.091271445 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis
Abstract: We consider the problem of constructing an aggregated estimator from a finite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1
5 0.085490048 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
6 0.082163192 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
7 0.080159791 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
8 0.076901428 50 nips-2005-Convex Neural Networks
9 0.076183863 58 nips-2005-Divergences, surrogate loss functions and experimental design
10 0.076162927 131 nips-2005-Multiple Instance Boosting for Object Detection
11 0.076040432 102 nips-2005-Kernelized Infomax Clustering
12 0.074349828 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
13 0.072442316 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
14 0.071337245 195 nips-2005-Transfer learning for text classification
15 0.070412807 144 nips-2005-Off-policy Learning with Options and Recognizers
16 0.067213856 78 nips-2005-From Weighted Classification to Policy Search
17 0.067003772 114 nips-2005-Learning Rankings via Convex Hull Separation
18 0.066789046 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
19 0.063465364 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
20 0.06288071 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
topicId topicWeight
[(0, 0.189), (1, 0.083), (2, -0.016), (3, -0.085), (4, 0.074), (5, 0.125), (6, 0.043), (7, 0.054), (8, -0.068), (9, -0.018), (10, -0.041), (11, -0.028), (12, 0.059), (13, 0.023), (14, 0.059), (15, 0.144), (16, -0.001), (17, -0.005), (18, 0.019), (19, 0.015), (20, -0.012), (21, 0.074), (22, 0.027), (23, 0.073), (24, 0.014), (25, 0.073), (26, 0.013), (27, 0.112), (28, 0.045), (29, 0.086), (30, -0.025), (31, 0.14), (32, -0.016), (33, -0.029), (34, -0.003), (35, -0.077), (36, 0.042), (37, -0.149), (38, 0.022), (39, -0.115), (40, -0.021), (41, 0.008), (42, 0.142), (43, -0.043), (44, 0.103), (45, 0.064), (46, -0.154), (47, -0.081), (48, 0.076), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.94401383 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
Author: François Laviolette, Mario Marchand, Mohak Shah
Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. 1
2 0.49280241 195 nips-2005-Transfer learning for text classification
Author: Chuong B. Do, Andrew Y. Ng
Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1
3 0.48090863 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
Author: Lacey Gunter, Ji Zhu
Abstract: In this paper we derive an algorithm that computes the entire solution path of the support vector regression, with essentially the same computational cost as fitting one SVR model. We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1
4 0.47691661 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.4721759 4 nips-2005-A Bayesian Spatial Scan Statistic
Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper
Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1
6 0.43812692 114 nips-2005-Learning Rankings via Convex Hull Separation
7 0.43405497 50 nips-2005-Convex Neural Networks
8 0.4291884 131 nips-2005-Multiple Instance Boosting for Object Detection
9 0.41902578 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
10 0.41857025 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
11 0.40515852 160 nips-2005-Query by Committee Made Real
12 0.39846301 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
13 0.39160675 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
14 0.37691185 95 nips-2005-Improved risk tail bounds for on-line algorithms
15 0.37623298 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
16 0.3743681 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
17 0.37394476 151 nips-2005-Pattern Recognition from One Example by Chopping
18 0.371923 146 nips-2005-On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?
19 0.36966318 62 nips-2005-Efficient Estimation of OOMs
20 0.36917186 70 nips-2005-Fast Information Value for Graphical Models
topicId topicWeight
[(3, 0.066), (10, 0.019), (27, 0.011), (31, 0.024), (34, 0.62), (39, 0.013), (50, 0.012), (55, 0.023), (69, 0.03), (73, 0.013), (88, 0.051), (91, 0.025)]
simIndex simValue paperId paperTitle
1 0.9905296 2 nips-2005-A Bayes Rule for Density Matrices
Author: Manfred K. Warmuth
Abstract: The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive definite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix specifies the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki’s quantum relative entropy the new Bayes rule for density matrices. 1
same-paper 2 0.98572689 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
Author: François Laviolette, Mario Marchand, Mohak Shah
Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non trivial margin-sparsity trade-off. 1
3 0.9735226 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined. 1
4 0.97272491 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
5 0.95010227 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun
Abstract: We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw input images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training runs collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two forwardpointing wireless color cameras. A remote computer processes the video and controls the robot via radio. The learning system is a large 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.
6 0.81369388 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
7 0.78952181 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
8 0.77512318 114 nips-2005-Learning Rankings via Convex Hull Separation
9 0.7691859 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
10 0.76430589 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
11 0.76336932 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
12 0.75660419 105 nips-2005-Large-Scale Multiclass Transduction
13 0.74982488 77 nips-2005-From Lasso regression to Feature vector machine
14 0.74174124 184 nips-2005-Structured Prediction via the Extragradient Method
15 0.73277134 50 nips-2005-Convex Neural Networks
16 0.73275256 58 nips-2005-Divergences, surrogate loss functions and experimental design
17 0.72663718 47 nips-2005-Consistency of one-class SVM and related algorithms
18 0.72127002 126 nips-2005-Metric Learning by Collapsing Classes
19 0.71404374 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
20 0.71375793 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models