nips nips2002 nips2002-77 knowledge-graph by maker-knowledge-mining

77 nips-2002-Effective Dimension and Generalization of Kernel Learning


Source: pdf

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We investigate the generalization performance of some learning problems in Hilbert function Spaces. [sent-5, score-0.076]

2 We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. [sent-6, score-0.552]

3 Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. [sent-7, score-0.359]

4 We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances. [sent-8, score-0.428]

5 Usually the quality of the predictor can be measured by a loss function . [sent-11, score-0.206]

6 Our goal is to find loss of given below is as small as possible: ¨©¡ §¥£  ¦¤ ¢ ¨    ¡ ¦ ¡ ¨ §¤ ¦ ¨ ¦ ©¡ §¤ ¨   ¨¡ §¤¦   ¦ ¡  ¨ ¦  '% # ! [sent-13, score-0.093]

7 ¦ 2¨   ¡ 1¤¦ 0) (&$"¨ ¨ §¤¦  ¤ where we use to denote the expectation with respect to the true (but unknown) underlying distribution. [sent-14, score-0.087]

8 ) (% # ' In this paper we focus on smooth convex loss functions that are second order differentiable with respect to the first component. [sent-15, score-0.203]

9 In addition we assume that the second derivative is bounded both above and below (away from zero). [sent-16, score-0.196]

10 1 For example, our analysis applies to important methods such as least squares regression (aka, Gaussian processes) and logistic regression in Hilbert spaces. [sent-17, score-0.118]

11  ¨ ¦ ©¡ 3 ¤ In order to obtain a good predictor from training data, it is necessary to start with a model of the functional relationship. [sent-18, score-0.198]

12 In this paper, we consider models that are subsets in some Hilbert function space . [sent-19, score-0.094]

13 In particular, we consider models in a bounded convex subset of . [sent-21, score-0.241]

14 We would like to find the best model in 8 4 6 75  5 4 8 4 1 This boundedness assumption is not essential. [sent-22, score-0.137]

15 However in this paper, in order to emphasize the main idea, we shall avoid using a more complex derivation that handles more general situations. [sent-23, score-0.122]

16 ¨ ¦ ¡¤   defined as:     ©§¥£  ¨ ¦ ¤ ¢ In supervised learning, we construct an estimator of from a set of training examples . [sent-27, score-0.112]

17 Throughout the paper, we use symbol to denote empirical quantities based on the observed training data . [sent-28, score-0.163]

18 Specifically, we use to denote the empirical expectation with respect to the training samples, and ) (% 3 # ' 3 & ' 7 7 ¨   2¨ ¡ §¤¦   ¦ # 9 ¤ ¨    ¡¡ ¦  2110    %¡ ¦ )! [sent-29, score-0.162]

19 This assumption is equivalent to the condition , implying that each data point can be regarded as a bounded linear functional on such that : . [sent-34, score-0.396]

20 Therefore we can define as for all , where denotes the inner product of . [sent-36, score-0.156]

21 ©¡ §¤ 9H1 F¨E DCA @ B ¡ 5P T 6 5  V¤ U¤ @ ¡ ¤ ¤ g e c a ` hfc  db§Y P ¡ S R 4 4 w B % It is clear that can be regarded as a representing feature vector of in . [sent-43, score-0.102]

22 In the literature, the inner product is often referred to as the kernel of , and as the . [sent-44, score-0.365]

23 reproducing kernel Hilbert space which is determined by the kernel function The purpose of this paper is to develop bounds on the true risk of any empirical estimator compared to the optimal risk based on its observed risk . [sent-45, score-1.154]

24 Specifically we seek a bound of the following form: 3¤ where is a positive constant that only depends on the loss function , and is a parameter that characterizes the effective data dimensionality for the learning problem. [sent-46, score-0.619]

25 ‡ 3¤ If is the empirical estimator that minimizes in , then the second term on the right hand side is non-positive. [sent-47, score-0.316]

26 It will be shown that if is a finite dimensional space, then the third term is where is the dimension of . [sent-49, score-0.328]

27 If is an infinite dimensional space (or when is large compared to ), one can adjust appropriately based on the sample size to get a bound where the effective dimension at the optimal scale becomes sample-size dependent. [sent-50, score-0.773]

28 and hence even in the However the dimension will never grow faster than worse case, converges to zero at a rate no worse than . [sent-51, score-0.327]

29 For A consequence of our analysis is to obtain convergence rates better than empirical estimators with least squares loss, this issue has been considered in [1, 2, 4] among others. [sent-53, score-0.483]

30 The approach in [1] won’t lead to the optimal rate of convergence for nonparametric classes. [sent-54, score-0.294]

31 The -covering number based analysis in [2, 4] use the chaining argument [4] and ratio large deviation inequalities. [sent-55, score-0.095]

32 However, it is known that chaining does not always lead to the optimal convergence rate, and for many problems covering numbers can be rather difficult to estimate. [sent-56, score-0.27]

33 The effective dimension based analysis presented here, while restricted to learning problems in Hilbert spaces (kernel methods), addresses these issues. [sent-57, score-0.446]

34 2 Decomposition of loss function , which is closed under the uniform norm topology. [sent-58, score-0.188]

35 This inequality will be very where is the derivative of important in our analysis. [sent-61, score-0.276]

36 1 The Bregman distance of ¤ solution, and using the convexity of condition:  §¥ £ ¦ ¨    ¤¦   0  T ¨    ¤¦  T ¨    ¥ ¦ $! [sent-63, score-0.076]

37 ¨   ¤¦ ¤¦ ¤“  It is well known (and easy to check) that for a convex function, its Bregman divergence is always non-negative. [sent-64, score-0.117]

38 As mentioned in the introduction, we assume for simplicity that there such that , where is the exist positive constants and second order derivative of with respect to the first variable. [sent-65, score-0.157]

39 ¨ V¤¦  T ¨ ¤¦  ¦ ¨ ¦   £ % #   8 B A Now, (3) Clearly by the non-negativeness of Bregman divergence and (2), the two terms on the right hand side of the above equality are all non-negative. [sent-67, score-0.14]

40 The above decomposition of gives the following decomposition of loss function: 0 #9¨ ¨¡ ¦   ¤ T ¨¡ ¦1¤¦ ¨   2©¡ ¦   ¤¦  ¡ q ¨   ¨©¡ ¦1¤ 2©¡ ¦   ¤¦ ¨ § ¨  ! [sent-69, score-0.366]

41 1 y ¦0 %$ , we define an inner product be the identity operator, we define the following 4 % ¡ ! [sent-75, score-0.221]

42 4  ¨ ' ¦ §G ¤ where is the trace of a linear operator norm is denoted as . [sent-79, score-0.255]

43 on the set of self-adjoint operators on  ¨  In addition, we consider the inner product space , with the inner product defined as 4  ¨ ! [sent-80, score-0.406]

44 %5 5 ' B 3 I I H F( DC 3 ¤G E $ B " A A 3 @ w § 9 8 8 w w ( 6 2  w 3 4 0 7 5ˆq ( g w to denote the self-adjoint operator . [sent-82, score-0.16]

45 0 %$ # &¤¦ Given a positive number , and let self-adjoint operator on : . [sent-85, score-0.236]

46 R The corresponding norm is 4 Given a positive definite self-adjoint operator structure on as: (sum of eigenvalues). [sent-86, score-0.255]

47 Therefore let , we obtain from Cauchy-Schwartz inequality This proves the first inequality. [sent-92, score-0.369]

48 To show the second inequality, we simply observe that the left hand side is the largest absolute eigenvalue of the operator , which is upper bounded by . [sent-93, score-0.438]

49 Therefore the second inequality follows immediately from the definition of -norm. [sent-94, score-0.254]

50 1 is that it bounds the behavior of any estimator (which can be sample dependent) in terms of the norm of the empirical mean of zeromean Hilbert-space valued random vectors. [sent-97, score-0.507]

51 The convergence rate of the latter can be easily estimated from the variance of the random vectors, and therefore we have significantly simplified the problem. [sent-98, score-0.235]

52 1, and hence characterize the behavior of the learning problem, we shall introduce the following notion of effective data dimensionality at a scale :  # E y% 5 % 15 % $! [sent-100, score-0.472]

53 is upper bounded by the dimensionality In particular for a finite dimensional space , of the space. [sent-103, score-0.406]

54 W ¦ 3 R  3 4 % w % w &# % ¨( 4 ¦ ¨–—  We also define the following quantities to measure the boundedness of the input data: 5 % 25 b% ‘Y ! [sent-106, score-0.203]

55 ¨ ¤¦ 3  ¨   ¨ ¦ ¨   '% A A We obtain from Lemma 3. [sent-123, score-0.085]

56 3 that with probability of at least : Proof We introduce the following notations for convenience: # y ¨  ¦ & 3 ¨ ‡ y  ‡ y @F 6 5 ¤ T ¤3 5 ¨ Q bRQ¨   ¤¦ 3  T ¨ 3 ¤ ¦ 3  P@ F „¨   ¤¦  … ¨ 3 ¤ ¦  ‡ S q I q q y   ¨ ¦ 8 3¤ ! [sent-124, score-0.184]

57 If we choose such that , the generalization error 8  3¤ where and . [sent-126, score-0.076]

58 Let , then with probability of at least is bounded as: ! [sent-129, score-0.204]

59 Consider any sample dependent estimator satisfy is a ¡ a ` b‘Y We are now ready to derive the following main result of the paper: ¥ Proof The bounds are straight forward applications of (6) and the previous two lemmas. [sent-132, score-0.384]

60 Due to the limitation of space, we skip the details. [sent-133, score-0.139]

61 ¨  ¦ 3 & #$¨ ‘¦ a %  T  … Similarly, with probability of at least Lemma 3. [sent-136, score-0.076]

62 2, let probability of at least : (6) # $B & & ! [sent-138, score-0.152]

63 "q    §  ¢ 6     7 & 7 6 5      In this paper, we shall use the following variant of the above bound for convenience. [sent-139, score-0.301]

64 7 7 7 7   ¡ W § B 7 Next we need to use the following version of Bernstein inequality in Hilbert spaces. [sent-144, score-0.273]

65 ¥ Similar to the proof of the first inequality, it is easy to check that this implies the third inequality. [sent-145, score-0.284]

66 ¨ 6 ¨‡ q y y ¨‡ § A y  ‡ S ¦ @ F d¨   ¤¦  q 6 … ¨ 3¤¦  & •  6 y § @  e F In the worst case, we have . [sent-150, score-0.078]

67 1 and obtain with probability at least : & …   • y § … v3 ¦ ¢ Worst case effective dimensionality and generalization ¨ 3¤ ¦ 3I T 8 ¨ ¤¦ 3  ¨ ¤¦ 3  …   W „Q ¡ 3¤ in . [sent-152, score-0.522]

68 In this case, We will only consider empirical estimator that minimizes in Corollary 3. [sent-153, score-0.277]

69 Assume also that the diameter of is bounded by : . [sent-160, score-0.128]

70 If and , we have with probability of at least , 3 8 3 © ¦  ¦  6 e • … 3 Note that both and go to zero as , therefore the assumption can be satisfied as long as we pick that is larger than a critical value . [sent-162, score-0.183]

71 Using the bound , we easily obtain the following result. [sent-163, score-0.264]

72 Solving the inequality using , we obtain D# W … 3 § ¨ ‡ # 3 § § ¦ x x ¨  ¦ 3 C@ &  ©‡ ¨ ‡  ©‡ ¥§¦¨ 3 ¤ ¦ 0 x 3 q ¨ 3 Sx ¤  ‡ ¨ ¦ 3 „¨ 3 ¤ ¦ y 3 x … ¨ 3 ¤ ¦ 0 x ‡ & q ¨ § ¤¦ 0 £  ¡¢ ¨ ¤¦ Sx … ¨ ¤¦ T ©‡ 0 ¨ ¨ ¤¦ 3  T ¨ ¤¦ 3  I ! [sent-165, score-0.293]

73 We can derive from (7)  y6 5 V¤ t¤ 5   T § which immediately implies the theorem. [sent-168, score-0.147]

74 Therefore we can let & • y §  y We can use the bound Corollary 3. [sent-171, score-0.19]

75 Smoothing splines For simplicity, we only consider 1-dimensional problems. [sent-173, score-0.131]

76 For smoothing splines, the corresponding Hilbert space consists of functions satisfying the smoothness condition that is bounded ( is the -th derivative of and ). [sent-174, score-0.34]

77 We may consider periodic functions (or their restrictions in an interval) and the condition corresponds to a decaying Fourier coefficients condition. [sent-175, score-0.093]

78 Specifically, the space can be regarded as the reproducing kernel Hilbert space with kernel  • 5 ¨¨ ¡ ¦ ¡ ¤ ¤ ¨ ¨ ¡ ¦ ¥£¡ ¦¤§¤ ¨ ¨ ¡ ¦ ¨ ¡ ¦ ¦ ¨ ¦ # y   $ V  Y  0  Y d ¡   Y 0   Y £ y 6 q    q y      5 6  ¨ ¨ ! [sent-176, score-0.682]

79 This gives the following bound (with probability at least ). [sent-185, score-0.255]

80  ) 3 ‚ ‚¥ ¦ ¨ ¨ ¤¦  … ¨ 3 ¤ ¦  ¨ ¦ A 9& ¨ y 5 T ¨  89 ‡ q 5 T ¨  # q „   6 ¨ 7£y  y  ‡ y ‡ ¥ 0 ¦£ y ¡ 1 £ y 0  y @ ¨ @$ A ¦ ¦ ¨ ¨ This rate matches the best possible convergence rate for any data-dependent estimator. [sent-187, score-0.235]

81 2 Exponential kernel Exponential kernel has recently been popularized by Vapnik. [sent-188, score-0.465]

82 Again for simplicity we consider 1-dimensional problems where . [sent-189, score-0.092]

83 The kernel function is given by 7 7 5 ¡ ¡§B # y 0 ¨   ¤¦  ¨& FE ¨ 3¤¦  … C P 8 Q © 0 3 C x q  q  I P ©    ¦ . [sent-190, score-0.209]

84 We obtain an upper bound , implying that the effective dimension is at most tial kernels. [sent-191, score-0.782]

85 7V¡  0 ¡ ¦ y  ¨y Q 5  5 T GB ¡ I  E • & E  ¦ ’ … 3 ¨ Therefore for exponen- ¦ D 3 F G DF 3 ’ D 3 FE 5 Conclusion In this paper, we introduced a concept of scale-sensitive effective data dimension, and used it to derive generalization bounds for some kernel learning problems. [sent-194, score-0.721]

86 The resulting convergence rates are optimal for various learning problems. [sent-195, score-0.23]

87 We have also shown that the 2 The lower bound is well-known in the non-parametric statistical literature (for example, see [3]). [sent-196, score-0.114]

88 effective dimension at the appropriate chosen optimal scale can be sample-size dependent and behaves like in the worst case. [sent-197, score-0.688]

89 & e This shows that despite the claim that a kernel method learns a predictor from an infinite dimensional Hilbert space, for a fixed sample size, the effective dimension is rather small. [sent-198, score-0.814]

90 This in fact indicates that they are not any more powerful than learning in an appropriately chosen finite dimensional space. [sent-199, score-0.146]

91 This observation also raises the following computational question: given -samples, kernel methods use parameters in the computation but as we have shown, the effective number of parameters (effective dimension) is not more than . [sent-200, score-0.494]

92 Therefore it could be possible to significantly reduce the computational cost of kernel methods by explicitly parameterizing the effective dimensions. [sent-201, score-0.476]

93 & & ¨ & e ¦ ’ A Properties of scale-sensitive effective data dimension ¦ We list some properties of the scale-sensitive data dimension . [sent-202, score-0.568]

94 Due to the limitation of space, we shall skip the proofs. [sent-203, score-0.261]

95 The following lemma implies that the quantity behaves like dimension if the underlying space is finite dimensional. [sent-204, score-0.677]

96 1 If is a finite dimensional space, then Hilbert spaces , we have the following bound 6 4 7 4 Proposition A. [sent-212, score-0.329]

97 2 Consider the complete set of ortho-normal eigen-pairs of the operator , where if and . [sent-213, score-0.16]

98 ¦ 3 In many cases, we can find a so-called feature representation of the kernel function . [sent-226, score-0.209]

99 3 Consider the following feature space decomposition of kernel: , where each is a real valued function. [sent-231, score-0.284]

100 The importance of convexity in learning with squared loss. [sent-240, score-0.076]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hilbert', 0.299), ('effective', 0.22), ('kernel', 0.209), ('inequality', 0.208), ('lemma', 0.19), ('dimension', 0.174), ('hgt', 0.163), ('sx', 0.161), ('operator', 0.16), ('corollary', 0.137), ('proposition', 0.137), ('bounded', 0.128), ('shall', 0.122), ('empirical', 0.12), ('bounds', 0.114), ('bound', 0.114), ('predictor', 0.113), ('estimator', 0.112), ('convergence', 0.105), ('inner', 0.104), ('decomposition', 0.104), ('regarded', 0.102), ('behaves', 0.1), ('dimensional', 0.098), ('norm', 0.095), ('chaining', 0.095), ('boundedness', 0.095), ('loss', 0.093), ('bregman', 0.09), ('nite', 0.089), ('splines', 0.086), ('obtain', 0.085), ('skip', 0.081), ('worst', 0.078), ('let', 0.076), ('generalization', 0.076), ('least', 0.076), ('convexity', 0.076), ('implying', 0.076), ('ud', 0.072), ('proof', 0.071), ('optimal', 0.07), ('risk', 0.069), ('derivative', 0.068), ('convex', 0.068), ('valued', 0.066), ('upper', 0.066), ('therefore', 0.065), ('following', 0.065), ('rate', 0.065), ('dimensionality', 0.065), ('reproducing', 0.064), ('fr', 0.062), ('characterizes', 0.062), ('limitation', 0.058), ('cally', 0.057), ('equality', 0.056), ('third', 0.056), ('concept', 0.055), ('rates', 0.055), ('implies', 0.054), ('check', 0.054), ('nonparametric', 0.054), ('spaces', 0.052), ('product', 0.052), ('easy', 0.049), ('space', 0.049), ('condition', 0.048), ('fe', 0.048), ('appropriately', 0.048), ('tial', 0.047), ('popularized', 0.047), ('tong', 0.047), ('tzhang', 0.047), ('yorktown', 0.047), ('parameterizing', 0.047), ('py', 0.047), ('wf', 0.047), ('won', 0.047), ('simplicity', 0.047), ('derive', 0.047), ('smoothing', 0.047), ('dependent', 0.046), ('immediately', 0.046), ('dc', 0.045), ('consider', 0.045), ('underlying', 0.045), ('de', 0.044), ('worse', 0.044), ('trp', 0.043), ('tb', 0.043), ('watson', 0.043), ('hut', 0.043), ('notations', 0.043), ('quantities', 0.043), ('respect', 0.042), ('squares', 0.042), ('side', 0.042), ('hand', 0.042), ('assumption', 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

2 0.24432951 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

3 0.2280238 106 nips-2002-Hyperkernels

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

4 0.19456528 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

Author: Ron Meir, Tong Zhang

Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0   n   1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) .  n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that  λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound    (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.

5 0.19393115 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

Author: Luis E. Ortiz, David A. McAllester

Abstract: This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s inequality, Bennett’s inequality, or McDiarmid’s theorem.

6 0.17614534 120 nips-2002-Kernel Design Using Boosting

7 0.16821398 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

8 0.12700854 119 nips-2002-Kernel Dependency Estimation

9 0.12407067 161 nips-2002-PAC-Bayes & Margins

10 0.11586753 113 nips-2002-Information Diffusion Kernels

11 0.11461932 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

12 0.11202866 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

13 0.11015797 125 nips-2002-Learning Semantic Similarity

14 0.10629527 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

15 0.10619304 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.10441911 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.099547356 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

18 0.097952381 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

19 0.0898242 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

20 0.089815639 145 nips-2002-Mismatch String Kernels for SVM Protein Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.287), (1, -0.171), (2, 0.042), (3, -0.15), (4, -0.069), (5, -0.096), (6, 0.044), (7, 0.184), (8, -0.048), (9, 0.031), (10, 0.14), (11, -0.106), (12, 0.068), (13, -0.086), (14, 0.038), (15, -0.003), (16, 0.004), (17, -0.21), (18, -0.135), (19, -0.085), (20, -0.068), (21, 0.1), (22, 0.007), (23, 0.073), (24, -0.049), (25, 0.143), (26, -0.011), (27, 0.012), (28, 0.045), (29, 0.057), (30, -0.072), (31, 0.089), (32, 0.077), (33, -0.13), (34, 0.005), (35, 0.051), (36, -0.035), (37, -0.019), (38, -0.117), (39, 0.088), (40, -0.01), (41, 0.014), (42, -0.044), (43, 0.008), (44, 0.125), (45, -0.044), (46, 0.034), (47, -0.01), (48, 0.033), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98160577 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

2 0.79125202 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

Author: Luis E. Ortiz, David A. McAllester

Abstract: This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s inequality, Bennett’s inequality, or McDiarmid’s theorem.

3 0.77924496 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

Author: Christopher Williams, John S. Shawe-taylor

Abstract: In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, ... ,X m drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel peA. 1

4 0.72564 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

5 0.72444499 106 nips-2002-Hyperkernels

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

6 0.68923259 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

7 0.54439616 113 nips-2002-Information Diffusion Kernels

8 0.53618956 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

9 0.50727969 120 nips-2002-Kernel Design Using Boosting

10 0.48942438 139 nips-2002-Margin-Based Algorithms for Information Filtering

11 0.48887715 161 nips-2002-PAC-Bayes & Margins

12 0.46703371 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

13 0.46633127 119 nips-2002-Kernel Dependency Estimation

14 0.4517715 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

15 0.44236067 125 nips-2002-Learning Semantic Similarity

16 0.42719945 124 nips-2002-Learning Graphical Models with Mercer Kernels

17 0.4140231 178 nips-2002-Robust Novelty Detection with Single-Class MPM

18 0.41214496 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

19 0.40687352 167 nips-2002-Rational Kernels

20 0.40484065 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.011), (42, 0.032), (54, 0.718), (74, 0.043), (92, 0.025), (98, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99842054 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

2 0.99336588 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

3 0.99126434 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

4 0.98604 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

5 0.96503633 67 nips-2002-Discriminative Binaural Sound Localization

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

6 0.95860809 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

7 0.9090156 156 nips-2002-On the Complexity of Learning the Kernel Matrix

8 0.87754232 113 nips-2002-Information Diffusion Kernels

9 0.8679353 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

10 0.86176139 119 nips-2002-Kernel Dependency Estimation

11 0.8490665 190 nips-2002-Stochastic Neighbor Embedding

12 0.84621114 106 nips-2002-Hyperkernels

13 0.84292185 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

14 0.84223372 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

15 0.84176713 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

16 0.83716047 120 nips-2002-Kernel Design Using Boosting

17 0.83656824 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

18 0.83148164 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

19 0.81715202 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

20 0.81620276 140 nips-2002-Margin Analysis of the LVQ Algorithm