nips nips2004 nips2004-69 knowledge-graph by maker-knowledge-mining

69 nips-2004-Fast Rates to Bayes for Kernel Machines

Source: pdf

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gov Abstract We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. [sent-2, score-0.408]

2 In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. [sent-3, score-0.369]

3 Since by the no-free-lunch theorem of Devroye (see [4]) performance guarantees are impossible without assumptions on the data-generating distribution we will restrict our considerations to speciﬁc classes of distributions. [sent-13, score-0.216]

4 In particular, we will present a geometric condition which describes how distributions behave close to the decision boundary. [sent-14, score-0.391]

5 This condition is then used to establish learning rates for SVM’s. [sent-15, score-0.328]

6 To obtain learning rates faster than n−1/2 we also employ a noise condition of Tsybakov (see [5]). [sent-16, score-0.415]

7 Combining both concepts we are in particular able to describe distributions such that SVM’s with Gaussian kernel learn almost linearly, i. [sent-17, score-0.184]

8 To make this precise the risk of a measurable function f : X → R is deﬁned by RP (f ) := P {(x, y) : sign f (x) = y} . [sent-34, score-0.138]

9 The smallest achievable risk RP := inf RP (f ) | f : X → R measurable is called the Bayes risk of P . [sent-35, score-0.239]

10 A function fP : X → Y attaining this risk is called a Bayes decision function. [sent-36, score-0.167]

11 The next naturally arising question is whether there are classiﬁers which guarantee a speciﬁc rate of convergence in (1) for all distributions. [sent-39, score-0.148]

12 However, if one restricts considerations to certain smaller classes of distributions such rates exist for various classiﬁers, e. [sent-43, score-0.335]

13 : • Assuming that the conditional probability η(x) := P (1|x) satisﬁes certain smoothness assumptions Yang showed in [6] that some plug-in rules (cf. [sent-45, score-0.19]

14 [4]) achieve rates for (1) which are of the form n−α for some 0 < α < 1/2 depending on the assumed smoothness. [sent-46, score-0.188]

15 He also showed that these rates are optimal in the sense that no classiﬁer can obtain faster rates under the proposed smoothness assumptions. [sent-47, score-0.499]

16 1]) that using structural risk minimization over a sequence of hypothesis classes with ﬁnite VC-dimension every distribution which has a Bayes decision function in one of the hypothesis classes can be learned with rate n−1/2 . [sent-50, score-0.416]

17 If F contains a Bayes decision function then up to a logarithmic factor the convergence rate of the ERM classiﬁer over F is n−1 (see [4, Sec. [sent-54, score-0.183]

18 Restricting the class of distributions for classiﬁcation always raises the question of whether it is likely that these restrictions are met in real world problems. [sent-57, score-0.136]

19 We conclude that the above listed rates are established for situations which are rarely met in practice. [sent-62, score-0.302]

20 Considering the ERM classiﬁer and hypothesis classes F containing a Bayes decision function there is a large gap in the rates for noise-free and noisy distributions. [sent-63, score-0.383]

21 In [5] Tsybakov proposed a condition on the noise which describes intermediate situations. [sent-64, score-0.233]

22 In order to present this condition we write η(x) := P (y = 1|x), x ∈ X, for the conditional probability and PX for the marginal distribution of P on X. [sent-65, score-0.131]

23 We will use the following modiﬁed version of Tsybakov’s noise condition which describes the size of the latter regions: Deﬁnition 1. [sent-68, score-0.233]

24 We say that P has Tsybakov noise exponent q if there exists a constant C > 0 such that for all sufﬁciently small t > 0 we have PX |2η − 1| ≤ t ≤ C · tq . [sent-70, score-0.763]

25 (2) All distributions have at least noise exponent 0. [sent-71, score-0.803]

26 In particular this means that noise-free 2 distributions have exponent q = ∞. [sent-73, score-0.714]

27 Finally note, that Tsybakov’s original noise condition q assumed PX (f = fP ) ≤ c(RP (f ) − RP ) 1+q for all f : X → Y which is satisﬁed if e. [sent-74, score-0.224]

28 In [5] Tsybakov showed that if P has a noise exponent q then ERM-type classiﬁers can q+1 obtain rates in (1) which are of the form n− q+pq+2 , where 0 < p < 1 measures the complexity of the hypothesis class. [sent-78, score-1.098]

29 In particular, rates faster than n−1/2 are possible whenever q > 0 and p < 1. [sent-79, score-0.22]

30 Furthermore, his classiﬁer requires substantial knowledge on how to approximate the Bayes decision rules of the considered distributions. [sent-81, score-0.139]

31 2 Results In this paper we will use the Tsybakov noise exponent to establish rates for SVM’s which are very similar to the above rates of Tsybakov. [sent-83, score-1.173]

32 To this end let H be a reproducing kernel Hilbert space (RKHS) of a kernel k : X × X → R, i. [sent-85, score-0.194]

33 Now given a regularization parameter λ > 0 the decision function of an SVM is n 1 2 (fT,λ , bT,λ ) := arg min λ f H + l yi (f (xi ) + b) , (3) f ∈H n i=1 b∈R where l(t) := max{0, 1 − t} is the so-called hinge loss. [sent-94, score-0.164]

34 Unfortunately, only a few results on learning rates for SVM’s are known: In [8] it was shown that SVM’s can learn with linear rate if the distribution is noise-free and the two classes can be strictly separated by the RKHS. [sent-95, score-0.325]

35 For RKHS which are dense in the space C(X) of continuous functions the latter condition is satisﬁed if the two classes have strictly positive distance in the input space. [sent-96, score-0.211]

36 Furthermore, Wu and Zhou (see [9]) recently established rates under the assumption that η is contained in a Sobolev space. [sent-98, score-0.244]

37 In particular, they proved rates of the form (log n) −p for some p > 0 if the SVM uses a Gaussian kernel. [sent-99, score-0.188]

38 Obviously, these rates are much too slow to be of practical interest and the difﬁculties with smoothness assumptions have already been discussed above. [sent-100, score-0.288]

39 Furthermore, we deﬁne the approximation error function by a(λ) := inf λ f f ∈H 2 H + Rl,P (f ) − Rl,P , λ ≥ 0. [sent-105, score-0.129]

40 However, in non-trivial situations no rate of convergence which uniformly holds for all distributions P is possible. [sent-109, score-0.191]

41 Then H approximates P with exponent β ∈ (0, 1] if there is a C > 0 such that for all λ > 0: a(λ) ≤ Cλβ . [sent-112, score-0.697]

42 Because of the speciﬁc structure of the approximation error function values β > 1 are only possible for distributions with η ≡ 1 . [sent-114, score-0.129]

43 Then H has complexity exponent 0 < p ≤ 2 if there is an ap > 0 such that for all ε > 0 we have log N (BH , ε, C(X)) ≤ ap ε−p . [sent-123, score-0.798]

44 Note, that in [10] the complexity exponent was deﬁned in terms of N (BH , ε, L2 (TX )), where L2 (TX ) is the L2 -space with respect to the empirical measure of (x1 , . [sent-124, score-0.733]

45 3 Let H be a RKHS of a continuous kernel on X with complexity exponent 0 < p < 2, and let P be a probability measure on X × Y with Tsybakov noise exponent 0 < q ≤ ∞. [sent-140, score-1.624]

46 Furthermore, assume that H approximates P with exponent 0 < β ≤ 1. [sent-141, score-0.697]

47 3 these rates have the form n− (2q+pq+4)(1+β) +ε for all ε > 0. [sent-149, score-0.188]

48 5 For brevity’s sake our major aim was to show the best possible rates using our techniques. [sent-151, score-0.188]

49 3 states rates for the SVM under the assumption that (λn ) is optimally chosen. [sent-153, score-0.188]

50 However, we emphasize, that the techniques of [10] also give rates if (λn ) is chosen in a different (and thus sub-optimal) way. [sent-154, score-0.188]

51 If we rescale the complexity exponent p from (0, 2) to (0, 1) and write p for the new complexity − q+1 exponent this rate becomes n q+p q+2 . [sent-162, score-1.491]

52 It also sufﬁces to suppose that H approximates P with exponent β for all β < β, and that H has complexity exponent p for all p > p. [sent-168, score-1.401]

53 Now, it is shown in [10] that the RKHS H has an approximation exponent β = 1 if and only if H contains a minimizer of the l-risk. [sent-169, score-0.728]

54 In particular, if H has approximation exponent β for all β < 1 but not for β = 1 then H does not contain such a minimizer but Theorem 2. [sent-170, score-0.728]

55 If in addition the RKHS consists of C ∞ functions we can choose p arbitrarily close to 0, and hence we can obtain rates up to n−1 even though H does not contain a minimizer of the l-risk, that means e. [sent-172, score-0.231]

56 In particular this seems to be true for the most popular kernel, that is the Gaussian RBF kernel kσ (x, x ) = exp(−σ 2 x − x 2 ), 2 x, x ∈ X on (compact) subsets X of Rd with width 1/σ. [sent-177, score-0.131]

57 However, to our best knowledge no non-trivial condition on η or fP = sign ◦(2η − 1) which ensures an approximation exponent β > 0 for ﬁxed width has been established and [12] shows that Gaussian kernels poorly approximate smooth functions. [sent-178, score-0.946]

58 Hence plug-in rules based on Gaussian kernels may perform poorly under smoothness assumptions on η. [sent-179, score-0.192]

59 In particular, many types of SVM’s using other loss functions are plug-in rules and therefore, their approximation properties under smoothness assumptions on η may be poor if a Gaussian kernel is used. [sent-180, score-0.268]

60 However, our SVM’s are not plug-in rules since their decision functions approximate the Bayes decision function (see [13]). [sent-181, score-0.238]

61 Intuitively, we therefore only need a condition that measures the cost of approximating the “bump” of the Bayes decision function at the “decision boundary”. [sent-182, score-0.204]

62 We will now establish such a condition for Gaussian RBF kernels with varying widths 1/σ n . [sent-183, score-0.192]

63 Since we are only interested in distributions P having a Tsybakov exponent q > 0 we always assume that X = X −1 ∪ X1 holds PX -almost surely. [sent-186, score-0.729]

64 With the help of this function we can deﬁne the following geometric condition for distributions: Deﬁnition 2. [sent-190, score-0.188]

65 We say that P has geometric noise exponent α ∈ (0, ∞] if we have X −αd τx |2η(x) − 1|PX (dx) < ∞ . [sent-192, score-0.871]

66 (6) Furthermore, P has geometric noise exponent ∞ if (6) holds for all α > 0. [sent-193, score-0.886]

67 In the above deﬁnition we make neither any kind of smoothness assumption nor do we assume a condition on PX in terms of absolute continuity with respect to the Lebesgue measure. [sent-194, score-0.146]

68 Instead, the integral condition (6) describes the concentration of the measure |2η −1|dPX near the decision boundary. [sent-195, score-0.246]

69 The less the measure is concentrated in this region −1 the larger the geometric noise exponent can be chosen. [sent-196, score-0.874]

70 Using this interpretation we easily can construct distributions which have geometric noise exponent ∞ and touching classes. [sent-201, score-0.911]

71 9 We say that η is H¨ lder about 1 with exponent γ > 0 on X ⊂ Rd if there is o 2 a constant cγ > 0 such that for all x ∈ X we have γ (7) |2η(x) − 1| ≤ cγ τx . [sent-204, score-0.755]

72 If η is H¨ lder about 1/2 with exponent γ > 0, the graph of 2η(x) − 1 lies in a multiple o γ γ of the envelope deﬁned by τx at the top and by −τx at the bottom. [sent-205, score-0.729]

73 To be H¨ lder about o 1/2 it is sufﬁcient that η is H¨ lder continuous, but it is not necessary. [sent-206, score-0.214]

74 A function which is o H¨ lder about 1/2 can be very irregular away from the decision boundary but it cannot jump o ¨ across the decision boundary discontinuously. [sent-207, score-0.391]

75 In addition a Holder continuous function’s ¨ exponent must satisfy 0 < γ ≤ 1 where being Holder about 1/2 only requires γ > 0. [sent-208, score-0.673]

76 ¨ For distributions with Tsybakov exponent such that η is Holder about 1/2 we can bound the geometric noise exponent. [sent-209, score-0.911]

77 Indeed, let P be a distribution which has Tsybakov noise ¨ exponent q ≥ 0 and a conditional probability η which is Holder about 1/2 with exponent γ > 0. [sent-210, score-1.422]

78 Then (see [10]) P has geometric noise exponent α for all α < γ q+1 . [sent-211, score-0.845]

79 d For distributions having a non-trivial geometric noise exponent we can now bound the approximation error function for Gaussian RBF kernels: Theorem 2. [sent-212, score-0.974]

80 10 Let X be the closed unit ball of the Euclidian space Rd , and Hσ be the RKHS of the Gaussian RBF kernel kσ on X with width 1/σ > 0. [sent-213, score-0.179]

81 Furthermore, let P be a distribution with geometric noise exponent 0 < α < ∞. [sent-214, score-0.883]

82 Roughly α speaking this states that the family of spaces Hσ(λ) approximates P with exponent α+1 . [sent-221, score-0.727]

83 Note, that we can obtain approximation rates up to linear order in λ for sufﬁciently benign distributions. [sent-222, score-0.251]

84 The price for this good approximation property is, however, an increasing complexity of the hypothesis class Hσ(λ) for σ → ∞, i. [sent-223, score-0.186]

85 The following theorem estimates this in terms of the complexity exponent: Theorem 2. [sent-226, score-0.183]

86 T ∈Z n Having established both results for the approximation and complexity exponent we can now formulate our main result for SVM’s using Gaussian RBF kernels: Theorem 2. [sent-229, score-0.823]

87 12 Let X be the closed unit ball of the Euclidian space Rd , and P be a distribution on X × Y with Tsybakov noise exponent 0 < q ≤ ∞ and geometric noise exponent 0 < α < ∞. [sent-230, score-1.656]

88 3 Discussion of a modiﬁed support vector machine Let us now discuss a recent result (see [11]) on rates for the following modiﬁcation of the original SVM: n 1 ∗ fT,λ := arg min λ f H + l yi f (xi ) . [sent-240, score-0.254]

89 To describe the result of [11] we need the following modiﬁcation of the approximation error function: a∗ (λ) := inf λ f f ∈H H + Rl,P (f ) − Rl,P , λ ≥ 0. [sent-242, score-0.129]

90 1 Let H be a RKHS of a continuous kernel on X with complexity exponent 0 < p < 2, and let P be a distribution on X × Y with Tsybakov noise exponent ∞. [sent-248, score-1.595]

91 If H has complexity exponent p it can be shown that these eigenvalues decay at least as fast as n−2/p . [sent-253, score-0.704]

92 It was also mentioned in [11] that using the techniques therein it is possible to derive rates for the original SVM. [sent-257, score-0.217]

93 To this end let us suppose that H approximates P with exponent 0 < β ≤ 1, i. [sent-267, score-0.761]

94 Therefore, if H 4β approximates P with exponent β then the rate in Theorem 3. [sent-272, score-0.754]

95 It is naturally to ask whether a similar observation can be made if we have a Tsybakov noise exponent which is smaller than ∞. [sent-278, score-0.737]

96 Furthermore, the asymptotically optimal choice of λn is again independent of the approximation exponent β. [sent-282, score-0.685]

97 However, it depends on the (typically unknown) noise exponent q. [sent-283, score-0.737]

98 12 if P has a non-trivial geometric noise exponent α? [sent-287, score-0.845]

99 Consistency of support vector machines and other regularized kernel machines. [sent-301, score-0.129]

100 On the inﬂuence of the kernel on the consistency of support vector machines. [sent-331, score-0.131]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('exponent', 0.622), ('tsybakov', 0.309), ('rp', 0.204), ('rates', 0.188), ('rkhs', 0.184), ('bh', 0.147), ('svm', 0.141), ('px', 0.125), ('rbf', 0.121), ('ft', 0.12), ('noise', 0.115), ('pq', 0.114), ('geometric', 0.108), ('lder', 0.107), ('bayes', 0.106), ('theorem', 0.101), ('decision', 0.099), ('remark', 0.083), ('complexity', 0.082), ('condition', 0.08), ('fp', 0.078), ('holder', 0.078), ('approximates', 0.075), ('risk', 0.068), ('smoothness', 0.066), ('distributions', 0.066), ('inf', 0.066), ('nition', 0.065), ('kernel', 0.065), ('cx', 0.064), ('approximation', 0.063), ('classi', 0.06), ('establish', 0.06), ('devroye', 0.059), ('er', 0.058), ('rate', 0.057), ('furthermore', 0.057), ('established', 0.056), ('pr', 0.056), ('classes', 0.055), ('kernels', 0.052), ('tx', 0.051), ('euclidian', 0.051), ('gaussian', 0.051), ('continuous', 0.051), ('svms', 0.051), ('bt', 0.05), ('ap', 0.047), ('minimizer', 0.043), ('boundary', 0.043), ('dpx', 0.043), ('erm', 0.043), ('holds', 0.041), ('hypothesis', 0.041), ('width', 0.04), ('rules', 0.04), ('question', 0.039), ('scovel', 0.039), ('ball', 0.038), ('rd', 0.038), ('describes', 0.038), ('let', 0.038), ('satis', 0.038), ('support', 0.037), ('regularization', 0.037), ('measurable', 0.037), ('closed', 0.036), ('obviously', 0.035), ('blanchard', 0.034), ('assumptions', 0.034), ('sign', 0.033), ('universally', 0.033), ('faster', 0.032), ('modi', 0.032), ('remarks', 0.031), ('met', 0.031), ('family', 0.03), ('consistency', 0.029), ('measure', 0.029), ('original', 0.029), ('ces', 0.028), ('covering', 0.028), ('hinge', 0.028), ('machines', 0.027), ('rarely', 0.027), ('concepts', 0.027), ('convergence', 0.027), ('considerations', 0.026), ('submitted', 0.026), ('say', 0.026), ('particular', 0.026), ('end', 0.026), ('write', 0.026), ('wu', 0.026), ('measures', 0.025), ('guarantee', 0.025), ('strictly', 0.025), ('showed', 0.025), ('conditional', 0.025), ('compact', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

2 0.30154544 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classiﬁcation problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

3 0.28744105 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classiﬁer is the ﬁrst known to achieve these optimal rates while being practical and implementable. 1

4 0.14291835 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1

5 0.11529162 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classiﬁcation framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1

6 0.10254646 168 nips-2004-Semigroup Kernels on Finite Sets

7 0.094713233 92 nips-2004-Kernel Methods for Implicit Surface Modeling

8 0.092236325 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

9 0.087576538 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

10 0.084742315 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

11 0.083050311 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

12 0.079683907 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

13 0.078733191 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

14 0.076552555 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

15 0.076141931 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

16 0.074068159 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

17 0.073407136 138 nips-2004-Online Bounds for Bayesian Algorithms

18 0.069523677 42 nips-2004-Computing regularization paths for learning multiple kernels

19 0.067591868 68 nips-2004-Face Detection --- Efficient and Rank Deficient

20 0.065785192 94 nips-2004-Kernels for Multi--task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.223), (1, 0.097), (2, 0.002), (3, 0.2), (4, -0.05), (5, 0.064), (6, 0.103), (7, -0.101), (8, 0.091), (9, -0.105), (10, -0.078), (11, 0.005), (12, 0.102), (13, -0.05), (14, -0.172), (15, -0.338), (16, 0.075), (17, 0.116), (18, -0.09), (19, 0.128), (20, -0.24), (21, 0.032), (22, 0.086), (23, -0.013), (24, -0.131), (25, 0.047), (26, 0.048), (27, -0.097), (28, -0.017), (29, 0.027), (30, -0.016), (31, 0.142), (32, -0.015), (33, -0.078), (34, 0.129), (35, 0.115), (36, 0.003), (37, -0.068), (38, -0.007), (39, 0.011), (40, 0.036), (41, 0.046), (42, 0.033), (43, -0.0), (44, 0.097), (45, 0.036), (46, -0.001), (47, 0.053), (48, 0.032), (49, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96705317 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

2 0.89867926 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

3 0.83882982 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

4 0.50501084 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

5 0.4714964 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini

Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1

6 0.47017947 34 nips-2004-Breaking SVM Complexity with Cross-Training

7 0.4055787 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

8 0.39468929 166 nips-2004-Semi-supervised Learning via Gaussian Processes

9 0.37987775 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

10 0.36474788 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

11 0.34067711 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

12 0.33601433 168 nips-2004-Semigroup Kernels on Finite Sets

13 0.32956278 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.3236689 94 nips-2004-Kernels for Multi--task Learning

15 0.32045335 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

16 0.31295279 185 nips-2004-The Convergence of Contrastive Divergences

17 0.30281374 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

18 0.29946044 68 nips-2004-Face Detection --- Efficient and Rank Deficient

19 0.27564892 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

20 0.27168712 138 nips-2004-Online Bounds for Bayesian Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.098), (15, 0.134), (26, 0.121), (31, 0.086), (33, 0.165), (35, 0.014), (39, 0.039), (44, 0.132), (50, 0.124)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90915775 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

2 0.90611047 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

Author: Sander M. Bohte, Michael C. Mozer

Abstract: Experimental studies have observed synaptic potentiation when a presynaptic neuron ﬁres shortly before a postsynaptic neuron, and synaptic depression when the presynaptic neuron ﬁres shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity or STDP. We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron’s variability to a given presynaptic input, causing the neuron’s output to become more reliable in the face of noise. Using an entropy-minimization objective function and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena including the reduction in synaptic plasticity as synaptic eﬃcacy increases. We compare our account to other eﬀorts to derive STDP from computational principles, and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of cortical adaptation. 1

3 0.88091218 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

4 0.8500976 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classiﬁcation. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classiﬁer automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

5 0.84896731 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

Author: Liam Paninski

Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classiﬁcation, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classiﬁcation and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.

6 0.83532554 49 nips-2004-Density Level Detection is Classification

7 0.83228886 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.83111924 29 nips-2004-Beat Tracking the Graphical Model Way

9 0.81967002 103 nips-2004-Limits of Spectral Clustering

10 0.81797314 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

11 0.81665981 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

12 0.81430262 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

13 0.81418371 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

14 0.81319541 28 nips-2004-Bayesian inference in spiking neurons

15 0.81201941 131 nips-2004-Non-Local Manifold Tangent Learning

16 0.8103804 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

17 0.80957097 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

18 0.80868912 158 nips-2004-Sampling Methods for Unsupervised Learning

19 0.8079195 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

20 0.80422878 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve