nips nips2011 nips2011-207 knowledge-graph by maker-knowledge-mining

207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Source: pdf

Author: Mona Eberts, Ingo Steinwart

Abstract: We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. To this end, we apply the modulus of smoothness. With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Optimal learning rates for least squares SVMs using Gaussian kernels M. [sent-1, score-0.431]

2 de Abstract We prove a new oracle inequality for support vector machines with Gaussian RBF kernels solving the regularized least squares regression problem. [sent-6, score-0.478]

3 With the help of the new oracle inequality we then derive learning rates that can also be achieved by a simple data-dependent parameter selection method. [sent-8, score-0.303]

4 Finally, it turns out that our learning rates are asymptotically optimal for regression functions satisfying certain standard smoothness conditions. [sent-9, score-0.466]

5 , (xn , yn )) of input/output observations drawn from an unknown distribution P on X ⇥ Y , where Y ⇢ R, the goal of non-parametric least squares regression is to ﬁnd a function fD : X ! [sent-16, score-0.24]

6 R such that, for the least squares loss L : Y ⇥ R ! [sent-17, score-0.191]

7 [0, 1) 2 deﬁned by L (y, t) = (y t) , the risk Z Z 2 RL,P (fD ) := L (y, fD (x)) dP (x, y) = (y fD (x)) dP (x, y) X⇥Y X⇥Y is small. [sent-18, score-0.048]

8 This means RL,P (fD ) has to be close to the optimal risk R⇤ := inf {RL,P (f ) | f : X ! [sent-19, score-0.089]

9 R measureable} , L,P ⇤ called the Bayes risk with respect to P and L. [sent-20, score-0.048]

10 R ⇤ deﬁned by fL,P (x) = EP (Y |x), x 2 X, is the only function for which the Bayes risk is attained. [sent-22, score-0.048]

11 In this paper, we assume that X ⇢ Rd is a non-empty, open and bounded set such that its boundary @X has Lebesgue measure 0, Y := [ M, M ] for some M > 0 and P is a probability measure on X ⇥Y such that PX is the uniform distribution on X. [sent-24, score-0.138]

12 In Section 2 we also discuss that this condition can easily be generalized by assuming that PX on X is absolutely continuous with respect to the Lebesgue measure on X such that the corresponding density of PX is bounded away from 0 and 1. [sent-25, score-0.189]

13 Recall that because of the ﬁrst assumption, it sufﬁces to restrict considerations to decision functions f : X ! [sent-26, score-0.065]

14 The non-parametric least squares problem can be solved in many ways. [sent-30, score-0.163]

15 In this paper, we use SVMs to ﬁnd a solution for the non-parametric least squares problem by solving the regularized problem 2 fD, = arg min kf kH + RL,D (f ) . [sent-34, score-0.367]

16 (2) f 2H Here, > 0 is a ﬁxed real number, H is a reproducing kernel Hilbert space (RKHS) over X, and RL,D (f ) is the empirical risk of f , that is n RL,D (f ) = 1X L (yi , f (xi )) . [sent-35, score-0.088]

17 n i=1 In this work we restrict our considerations to Gaussian RBF kernels k on X, which are deﬁned by ! [sent-36, score-0.132]

18 2 kx x0 k2 0 k (x, x ) = exp , x, x0 2 X , 2 for some width 2 (0, 1]. [sent-37, score-0.094]

19 Our goal is to deduce asymptotically optimal learning rates for the SVMs (2) using the RKHS H of k . [sent-38, score-0.344]

20 To this end, we ﬁrst establish a general oracle inequality. [sent-39, score-0.087]

21 Based on this oracle inequality, we then derive learning rates if the regression function is contained in some Besov space. [sent-40, score-0.339]

22 It will turn out, that these learning rates are asymptotically optimal. [sent-41, score-0.274]

23 Finally, we show that these rates can be achieved by a simple data-dependent parameter selection method based on a hold-out set. [sent-42, score-0.17]

24 The rest of this paper is organized as follows: The next section presents the main theorems and as a consequence of these theorems some corollaries inducing asymptotically optimal learning rates for regression functions contained in Sobolev or Besov spaces. [sent-43, score-0.541]

25 2 Results In this section we present our main results including the optimal rates for LS-SVMs using Gaussian kernels. [sent-47, score-0.211]

26 To this end, we ﬁrst need to introduce some function spaces, which are later assumed to contain the regression function. [sent-48, score-0.044]

27 For r 2 N, the r-th modulus of smoothness of f is deﬁned by ! [sent-57, score-0.178]

28 r,Lp (⌫) (f, t) = sup k4r (f, · )kLp (⌫) , h t 0, khk2 t where k · k2 denotes the Euclidean norm and the r-th difference 4r (f, ·) is deﬁned by h (P r r j r f (x + jh) if x 2 ⌦r,h j=0 j ( 1) 4r (f, x) = h 0 if x 2 ⌦r,h / for h = (h1 , . [sent-58, score-0.042]

29 It is well-known that the modulus of smoothness with respect to Lp (⌫) is a nondecreasing function of t and for the Lebesgue measure on ⌦ it satisﬁes ✓ ◆r t ! [sent-62, score-0.209]

30 Moreover, the modulus of smoothness can be used to deﬁne the scale of Besov spaces. [sent-68, score-0.178]

31 r,Lp (⌫) (f, t) := sup t t>0 ↵ q dt t 1 ◆q , ! [sent-70, score-0.042]

32 ↵ In both cases the norm of Bp,q (⌫) can be deﬁned by kf kBp,q (⌫) := kf kLp (⌫) + |f |Bp,q (⌫) , see ↵ ↵ ↵ e. [sent-72, score-0.328]

33 44], that the Sobolev spaces Wp (Rd ) fall into the scale of Besov spaces, namely ↵ ↵ Wp (Rd ) ⇢ Bp,q (Rd ) (4) ↵ ↵ for ↵ 2 N, p 2 (1, 1), and max{p, 2}  q  1 and especially W2 (Rd ) = B2,2 (Rd ). [sent-82, score-0.067]

34 R such that the ˆ smoothness properties of f described by some Sobolev or Besov space are preserved by f . [sent-85, score-0.076]

35 To be more precise, in this case there exists a linear operator E mapping functions f : ⌦ ! [sent-87, score-0.102]

36 R with the properties: (a) E (f )|⌦ = f , that is, E is an extension operator. [sent-89, score-0.074]

37 m m (b) E continuously maps Wp (⌦) into Wp Rd for all p 2 [1, 1] and all integer m m That is, there exist constants am,p 0, such that, for every f 2 Wp (⌦), we have kEf kWp (Rd )  am,p kf kWp (⌦) . [sent-90, score-0.245]

38 ↵ That is, there exist constants a↵,p,q 0, such that, for every f 2 Bp,q (⌦), we have kEf kBp,q (Rd )  a↵,p,q kf kBp,q (⌦) . [sent-93, score-0.206]

39 ↵ Property (c) follows by some interpolation argument since Bp,q can be interpreted as interpolation m0 m1 space of the Sobolev spaces Wp and Wp for q 2 [1, 1], p 2 (1, 1), ✓ 2 (0, 1) and m0 , m1 2 N0 with m0 6= m1 and ↵ = m0 (1 ✓) + m1 ✓, see [10, pp. [sent-97, score-0.175]

40 In the following, we always assume that we do have such an extension operator E. [sent-99, score-0.145]

41 Moreover, if µ is the Lebesgue measure on ⌦, such that @⌦ has Lebesgue measure 0, the canonical extension of µ to Rd is given by µ(A) := µ(A \ ⌦) for all measurable A ⇢ Rd . [sent-100, score-0.136]

42 Analogously, we proceed for the e uniform distribution on ⌦ and its canonical extension to Rd and the same convention will be applied to measures PX on ⌦ that are absolutely continuous w. [sent-102, score-0.176]

43 Let X ⇢ B`d be a domain such that we have an extension operator E in the above 2 sense. [sent-108, score-0.178]

44 Assume that we have ﬁxed a version fL,P of the regression 3 ⇤ function such that fL,P (x) = EP (Y |x) 2 [ M, M ] for all x 2 X. [sent-110, score-0.044]

45 2 Û kfD, kH + RL,P (fD, ) R⇤  K L,P with probability Pn not less than 1 e ⌧ d + Kc2 2↵ 1, ⌧ (1 p)(1+")d +K pn + kfD, 2 n kH n Û + RL,P (fD, with probability Pn not less than 1 n n e K⌧ n . [sent-113, score-0.157]

46 With this oracle inequality we can derive learning rates for the learning method (2). [sent-114, score-0.303]

47 Here, c1 > 0 and c2 > 0 are user-speciﬁed constants and C > 0 is a constant independent of n. [sent-117, score-0.042]

48 Note that for every ⇢ > 0 we can ﬁnd ", p 2 (0, 1) sufﬁciently close to 0 such that the learning rate in Corollary 1 is at least as fast as n 2↵ 2↵+d +⇢ . [sent-118, score-0.08]

49 , (xn , yn )) where m := functions ⌅n⇧ 2 fD1 , + 1 and n , 4. [sent-138, score-0.064]

50 We will use D1 as a training set by computing the SVM decision := arg min f 2H 2 kf kH + RL,D1 (f ) , and use D2 to determine ( , ) by choosing a ( RL,D2 fD1 , = D2 , D2 D2 , D 2 ) min ( , )2⇤n ⇥ ( , ) 2 ⇤n ⇥ 2 ⇤n ⇥ n n n such that RL,D2 (fD1 , , ) . [sent-139, score-0.164]

51 Then, for all ⇢ > 0, the TV-SVM producing the decision functions fD1 , D2 , D2 learns with the rate n with probability Pn not less than 1 e ⌧ 2↵ 2↵+d +⇢ (7) . [sent-143, score-0.069]

52 What is left to do is to relate Assumption (6) with the function spaces introduced earlier, such that we can show that the learning rates deduced earlier are asymptotically optimal under some circumstances. [sent-144, score-0.382]

53 Let X ⇢ B`d be a domain such that we have an extension operator E of the form 2 described in front of Theorem 1. [sent-146, score-0.246]

54 If, for some ↵ 2 N, we have fL,P 2 ↵ W2 (PX ), then, for all ⇢ > 0, both the SVM considered in Corollary 1 and the TV-SVM considered in Theorem 2 learn with the rate n with probability Pn not less than 1 optimal in a minmax sense. [sent-148, score-0.203]

55 Moreover, if ↵ > d/2, then this rate is asymptotically Similar to Corollary 2 we can show assumption (6) and asymptotically optimal learning rates if the regression function is contained in a Besov space. [sent-150, score-0.567]

56 Let X ⇢ B`d be a domain such that we have an extension operator E of the form 2 described in front of Theorem 1. [sent-152, score-0.246]

57 151]) ↵ ↵ and since B2,1 (PX ) = B2,1 (X) is continuously embedded into the space `1 (X) of all bounded 2↵ functions on X, we obtain by [11, Theorem 2. [sent-158, score-0.108]

58 2] that n 2↵+d is the optimal learning rate in a minimax sense for ↵ > d (cf. [sent-159, score-0.079]

59 Therefore, for ↵ > d, the learning rates obtained in Corollary 3 are asymptotically optimal. [sent-161, score-0.274]

60 This can be generalized by assuming that PX is absolutely continuous w. [sent-163, score-0.064]

61 the Lebesgue measure µ such that the corresponding density is bounded away from zero and from inﬁnity. [sent-166, score-0.125]

62 Moreover, to derive learning rates, we actually only need that the Lebesgue density of PX is upper bounded. [sent-168, score-0.058]

63 The assumption that the density is bounded away from zero is only needed to derive the lower bounds in Corollaries 2 and 3. [sent-169, score-0.122]

64 Instead only needs to be bounded from above by some constant such that estimates on the entropy numbers for Gaussian kernels as used in the proofs can be applied. [sent-172, score-0.168]

65 There have already been made several investigations on learning rates for SVMs using the least squares loss, see e. [sent-174, score-0.333]

66 In particular, optimal rates ⇤ have been established in [16], if fP 2 H, and the eigenvalue behavior of the integral operator ⇤ associated to H is known. [sent-177, score-0.282]

67 Moreover, if fP 62 H [17] and [12] establish both learning rates of the form n /( +p) , where is a parameter describing the approximation properties of H and p is a parameter describing the eigenvalue decay. [sent-178, score-0.2]

68 Furthermore, in the introduction of [17] it is mentioned that the assumption on the eigenvalues and eigenfunctions also hold for Gaussian kernels with ﬁxed width, but this case as well as the more interesting case of Gaussian kernels with variable widths are not further investigated. [sent-179, score-0.299]

69 In the ﬁrst case, where Gaussian kernels with ﬁxed width are considered, the approximation error behaves very badly as shown in [18] and fast rates cannot be expected as we discuss below. [sent-180, score-0.392]

70 In the second case, where variable widths are considered as in our paper, it is crucial to carefully control the inﬂuence of on all arising constants which unfortunately has not been worked out in [17], either. [sent-181, score-0.117]

71 On the other hand, [12] shows that the rate n /( +p) is often m asymptotically optimal in a minmax sense. [sent-183, score-0.307]

72 In particular, the latter is the case for H = W2 (X), s f 2 W2 (X), and s 2 (d/2, m], that is, when using a Sobolev space as the underlying RKHS H, 5 then all target functions contained in a Sobolev of lower smoothness s > d/2 can be learned with the 2s asymptotically optimal rate n 2s+d . [sent-184, score-0.328]

73 Here we note that the condition s > d/2 ensures by Sobolev’s s embedding theorem that W2 (X) consists of bounded functions, and hence Y = [ M, M ] does not ⇤ impose an additional assumption on fL,P . [sent-185, score-0.174]

74 If s 2 (0, d/2], then the results of [12] still yield the above mentioned rates, but we no longer know whether they are optimal in a minmax sense, since Y = [ M, M ] does impose an additional assumption. [sent-186, score-0.193]

75 In addition, note that for Sobolev spaces this result, modulo an extra log factor, has already been proved by [1]. [sent-187, score-0.149]

76 This result suggests that by using a C 1 -kernel such as the Gaussian RBF kernel, one could actually learn the entire scale of Sobolev spaces with the above mentioned rates. [sent-188, score-0.096]

77 Indeed, [18] shows that for many analytic kernels the approximation error ⇤ can only have polynomial decay if fL,P is analytic, too. [sent-190, score-0.186]

78 In particular, for Gaussian kernels with ⇤ 1 ﬁxed width and fL,P 62 C the approximation error does not decay polynomially fast, see [18, ⇤ m Proposition 1. [sent-191, score-0.288]

79 Since it seems rather unlikely that these poor approximation properties can be balanced by superior bounds on the estimation error, the above-mentioned results indicate that Gaussian kernels with ﬁxed width may have a poor performance. [sent-194, score-0.222]

80 Although these authors actually consider binary classiﬁcation using convex loss functions including the least squares loss, formulated it is relatively straightforward to translate m their ﬁnding to our least squares regression scenario. [sent-198, score-0.458]

81 The result is the learning rate n m+2d+2 , again ⇤ m under the assumption fL,P 2 W2 (X) for some m > 0. [sent-199, score-0.066]

82 Namely the authors show the rate n 8s+4t modulo a logarithmic ⇤ factor, where s 2 (0, 1] and fL,P 2 Lip (s). [sent-202, score-0.12]

83 Another direction of research that can be applied to Gaussian kernels with varying widths are multi-kernel regularization schemes, see [22, 23, 24] for 2m d some results in this direction. [sent-203, score-0.173]

84 For example, [22] establishes learning rates of the form n 4(4m d) +⇢ ⇤ m whenever fL,P 2 W2 (X) for some m 2 (d/2, d/2 + 2), where again ⇢ > 0 can be chosen to be arbitrarily close to 0. [sent-204, score-0.17]

85 Clearly, all these results provide rates that are far from being optimal, so that it seems fair to say that our results represent a signiﬁcant advance. [sent-205, score-0.17]

86 3 Proof of the main result To prove Theorem 1 we deduce an oracle inequality for the least squares loss by specializing [2, Theorem 7. [sent-207, score-0.353]

87 Let X ⇢ Rd be a domain such that we have an extension operator E of the form described in front of Theorem 1, PX be the uniform distribution on X and f 2 L1 (X). [sent-212, score-0.284]

88 Let g 2 L2 Rd , H be the RKHS of the Gaussian RBF kernel k over X ⇢ Rd and ! [sent-221, score-0.04]

89 ✓ ◆d r 2 2 X ✓r ◆ 2 kxk2 2 1 j 1 p K (x) := ( 1) exp j jd j2 2 ⇡ j=1 for x 2 Rd and a ﬁxed r 2 N. [sent-222, score-0.062]

90 Let g 2 L1 Rd , H be the RKHS of the Gaussian RBF kernel k over X ⇢ Rd and K be as in Lemma 2. [sent-225, score-0.04]

91 Then we have f 2 L1 Rd and L,P for all x 2 X, which implies ˜ |K ⇤ f (x) |  a0,1 (2r 1) M ˜ L(y, K ⇤ f (x))  4r a2 M 2 for the least squares loss L and all (x, y) 2 X ⇥ Y . [sent-228, score-0.191]

92 [0, 1) be the least squares loss, k be the Gaussian RBF kernel over X with width 2 (0, 1] and H be the associated RKHS. [sent-234, score-0.297]

93 With the previous results we are ﬁnally able to prove the oracle inequality declared by Theorem 1. [sent-237, score-0.133]

94 ✓ ◆d r 2 2 X ✓r ◆ 2 kxk2 2 1 j 1 p K (x) := ( 1) exp j jd j2 2 ⇡ j=1 7 and p ˜ f (x) := d 2 ⇡ ⇤ EfL,P (x) ⇤ ⇤ for all x 2 Rd . [sent-240, score-0.062]

95 The choice fL,P (x) 2 [ M, M ] for all x 2 X implies fL,P 2 L2 (X) and the latter together with X ⇢ B`d and (5) yields 2 p   ✓ ⇡ p ˜ kf kL2 (Rd ) = ⇡ 2 p d 2 d 2 ⇡ ◆d 2 ⇤ kEfL,P kL2 (Rd ) ⇤ a0,2 kfL,P kL2 (X) (10) a0,2 M , ˜ i. [sent-241, score-0.164]

96 Because of this and Lemma 2 ˜ f0 = K ⇤ f 2 H is satisﬁed and with Lemma 3 we have kL f0 k1 = sup (x,y)2X⇥Y |L (y, f0 (x))| = sup (x,y)2X⇥Y Furthermore, (1) and Lemma 1 yield RL,P (f0 ) ⇣ ⌘ ˜ L y, K ⇤ f (x)  4r a2 M 2 =: B0 . [sent-244, score-0.084]

97 r,L2 (Rd ) EfL,P , 2 2 2↵  Cr,2 c , where we used the assumption for 2 (0, 1], ↵ know kf0 kH ⇣ ⌘ ⇤ ! [sent-246, score-0.056]

98 By Lemma 2 and (10) we ˜ = kK ⇤ f kH  (2r ˜ 1) kf kL2 (Rd )  (2r 1) ✓ 2 p Therefore, Theorem 3 and the above choice of f0 yield, for all ﬁxed ⌧ p 2 (0, 1), that the SVM using H and L satisﬁes ⇣ ⌘ 2 Û kfD, kH + RL,P fD, R⇤ L,P ! [sent-248, score-0.164]

99 ✓ ◆d 2 2 p 9 (2r 1) a2 M 2 + Cr,2 c2 2↵ 0,2 ⇡  C1 pn d + 9 Cr c2 a0,2 M . [sent-249, score-0.157]

100 Fast rates for support vector machines using Gaussian kernels. [sent-378, score-0.17]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('px', 0.404), ('rd', 0.312), ('besov', 0.248), ('kh', 0.245), ('sobolev', 0.223), ('fd', 0.213), ('rates', 0.17), ('kf', 0.164), ('pn', 0.157), ('wp', 0.149), ('lebesgue', 0.144), ('kfd', 0.124), ('minmax', 0.124), ('squares', 0.121), ('rbf', 0.108), ('theorem', 0.108), ('corollary', 0.106), ('rkhs', 0.105), ('asymptotically', 0.104), ('modulus', 0.102), ('steinwart', 0.1), ('kernels', 0.098), ('width', 0.094), ('lip', 0.093), ('lp', 0.092), ('oracle', 0.087), ('modulo', 0.082), ('smoothness', 0.076), ('widths', 0.075), ('extension', 0.074), ('operator', 0.071), ('dp', 0.07), ('front', 0.068), ('spaces', 0.067), ('absolutely', 0.064), ('svms', 0.064), ('caponnetto', 0.062), ('jd', 0.062), ('kef', 0.062), ('klp', 0.062), ('kwp', 0.062), ('stuttgart', 0.062), ('lemma', 0.058), ('smale', 0.055), ('interpolation', 0.054), ('gaussian', 0.054), ('svm', 0.051), ('furthermore', 0.05), ('risk', 0.048), ('devore', 0.047), ('inequality', 0.046), ('corollaries', 0.045), ('regression', 0.044), ('ying', 0.043), ('constants', 0.042), ('least', 0.042), ('ep', 0.042), ('sup', 0.042), ('ams', 0.041), ('fp', 0.041), ('kk', 0.041), ('optimal', 0.041), ('regularized', 0.04), ('kernel', 0.04), ('continuously', 0.039), ('cr', 0.038), ('lq', 0.038), ('contained', 0.038), ('uniform', 0.038), ('bounded', 0.038), ('rate', 0.038), ('polynomially', 0.037), ('gaussians', 0.036), ('considerations', 0.034), ('theorems', 0.034), ('domain', 0.033), ('ym', 0.033), ('yn', 0.033), ('entropy', 0.032), ('measure', 0.031), ('functions', 0.031), ('approximation', 0.03), ('berlin', 0.03), ('density', 0.029), ('analytic', 0.029), ('decay', 0.029), ('volume', 0.029), ('actually', 0.029), ('xm', 0.029), ('deduce', 0.029), ('assumption', 0.028), ('know', 0.028), ('ln', 0.028), ('loss', 0.028), ('away', 0.027), ('chapters', 0.027), ('krzy', 0.027), ('jh', 0.027), ('isometrically', 0.027), ('moduli', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

2 0.32933626 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions ﬂuctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on ﬂow cytometry data are presented. 1

3 0.13351054 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

Abstract: The goal of this paper is to investigate the advantages and disadvantages of learning in Banach spaces over Hilbert spaces. While many works have been carried out in generalizing Hilbert methods to Banach spaces, in this paper, we consider the simple problem of learning a Parzen window classiﬁer in a reproducing kernel Banach space (RKBS)—which is closely related to the notion of embedding probability measures into an RKBS—in order to carefully understand its pros and cons over the Hilbert space classiﬁer. We show that while this generalization yields richer distance measures on probabilities compared to its Hilbert space counterpart, it however suffers from serious computational drawback limiting its practical applicability, which therefore demonstrates the need for developing efﬁcient learning algorithms in Banach spaces.

4 0.12433445 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

Author: Samory Kpotufe

Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1

5 0.11838315 301 nips-2011-Variational Gaussian Process Dynamical Systems

Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou

Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1

6 0.10918193 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

7 0.096873373 59 nips-2011-Composite Multiclass Losses

8 0.088924855 198 nips-2011-On U-processes and clustering performance

9 0.083028734 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

10 0.08145529 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

11 0.080561191 162 nips-2011-Lower Bounds for Passive and Active Learning

12 0.078528523 256 nips-2011-Solving Decision Problems with Limited Information

13 0.067588083 22 nips-2011-Active Ranking using Pairwise Comparisons

14 0.063954614 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

15 0.063678049 139 nips-2011-Kernel Bayes' Rule

16 0.063347496 29 nips-2011-Algorithms and hardness results for parallel large margin learning

17 0.061080117 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

18 0.058332257 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

19 0.057439581 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

20 0.05682582 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.176), (1, -0.064), (2, -0.039), (3, -0.13), (4, -0.042), (5, 0.04), (6, 0.013), (7, -0.107), (8, -0.033), (9, 0.048), (10, -0.149), (11, 0.089), (12, 0.132), (13, 0.036), (14, 0.005), (15, -0.025), (16, 0.08), (17, 0.032), (18, -0.112), (19, 0.235), (20, -0.039), (21, -0.146), (22, 0.03), (23, 0.061), (24, -0.087), (25, 0.022), (26, -0.006), (27, -0.068), (28, -0.227), (29, -0.015), (30, 0.127), (31, 0.14), (32, 0.002), (33, -0.073), (34, 0.067), (35, 0.081), (36, 0.226), (37, -0.026), (38, -0.12), (39, -0.088), (40, 0.091), (41, -0.022), (42, -0.016), (43, 0.049), (44, -0.059), (45, -0.093), (46, -0.055), (47, -0.041), (48, -0.038), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94794381 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

2 0.90916777 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

Author: Gilles Blanchard, Gyemin Lee, Clayton Scott

3 0.72456509 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

Author: Kenji Fukumizu, Gert R. Lanckriet, Bharath K. Sriperumbudur

4 0.58362663 59 nips-2011-Composite Multiclass Losses

Author: Elodie Vernet, Mark D. Reid, Robert C. Williamson

Abstract: We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a “proper composite loss”, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We determine the stationarity condition, Bregman representation, order-sensitivity, existence and uniqueness of the composite representation for multiclass losses. We subsume existing results on “classiﬁcation calibration” by relating it to properness and show that the simple integral representation for binary proper losses can not be extended to multiclass losses. 1

5 0.55437243 256 nips-2011-Solving Decision Problems with Limited Information

Author: Denis D. Maua, Cassio Campos

Abstract: We present a new algorithm for exactly solving decision-making problems represented as an inﬂuence diagram. We do not require the usual assumptions of no forgetting and regularity, which allows us to solve problems with limited information. The algorithm, which implements a sophisticated variable elimination procedure, is empirically shown to outperform a state-of-the-art algorithm in randomly generated problems of up to 150 variables and 1064 strategies. 1

6 0.51168877 139 nips-2011-Kernel Bayes' Rule

7 0.50153309 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

8 0.40327018 162 nips-2011-Lower Bounds for Passive and Active Learning

9 0.36867711 198 nips-2011-On U-processes and clustering performance

10 0.36799294 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

11 0.34805134 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

12 0.34385005 202 nips-2011-On the Universality of Online Mirror Descent

13 0.33967125 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

14 0.33043242 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

15 0.32840493 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness

16 0.3168799 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

17 0.31320983 69 nips-2011-Differentially Private M-Estimators

18 0.30608305 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

19 0.29758096 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

20 0.2961525 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.544), (4, 0.025), (20, 0.027), (26, 0.031), (31, 0.053), (43, 0.085), (45, 0.07), (57, 0.011), (74, 0.026), (83, 0.022), (99, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91943395 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

Author: Mona Eberts, Ingo Steinwart

2 0.83473754 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

Author: Yong Zhang, Zhaosong Lu

Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We ﬁrst show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1

3 0.79029363 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efﬁciently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artiﬁcial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A ﬁxedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efﬁcient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efﬁciently represent highly-varying functions [4, 3]. In this paper, we measure “efﬁciency” in terms of the number of computational units in the network. An efﬁcient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a ﬁxed amount of training samples and computational power, better generalization is expected. The ﬁrst successful algorithms for training deep architectures appeared in 2006, with efﬁcient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classiﬁcation [19, 10], natural language processing [9, 36, 21, 32], collaborative ﬁltering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artiﬁcial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we deﬁne “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we deﬁne the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to ﬁve. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efﬁciently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efﬁciently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difﬁculty efﬁciently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efﬁciently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efﬁcient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efﬁcient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We ﬁrst formally deﬁne sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efﬁcient than the deep representation. 2 Sum-product networks Deﬁnition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our deﬁnition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Deﬁnition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally deﬁne what we mean by deep vs. shallow networks in the rest of the paper. Deﬁnition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Deﬁnition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Deﬁnition The ﬁrst family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now deﬁne f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We ﬁrst prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which speciﬁc factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the ﬁnal output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the ﬁrst factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the ﬁrst factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Deﬁnition 1), then the ﬁnal output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Deﬁnition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this speciﬁc structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a speciﬁc architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to deﬁne functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efﬁciency of deep vs. shallow sum-product networks in the case of a ﬁxed dataset. 4.1 Deﬁnition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, deﬁne the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the deﬁnition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus ﬁnd a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a speciﬁc combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two speciﬁc families of functions: ﬁnding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions deﬁned only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefﬁcients (since the number of input combinations becomes ﬁnite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on ﬁnite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random ﬁeld. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efﬁcient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative ﬁltering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9

4 0.73472613 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

Author: Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, Jieping Ye

Abstract: Discriminative learning when training and test data belong to different distributions is a challenging and complex task. Often times we have very few or no labeled data from the test or target distribution but may have plenty of labeled data from multiple related sources with different distributions. The difference in distributions may be both in marginal and conditional probabilities. Most of the existing domain adaptation work focuses on the marginal probability distribution difference between the domains, assuming that the conditional probabilities are similar. However in many real world applications, conditional probability distribution differences are as commonplace as marginal probability differences. In this paper we propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (ﬁrst stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multi-source domain adaptation formulation using the weighted Rademacher complexity measure. Empirical comparisons with existing state-of-the-art domain adaptation methods using three real-world datasets demonstrate the effectiveness of the proposed approach. 1

5 0.58180177 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efﬁciently learn a label tree for large scale classiﬁcation with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classiﬁers for each node in the tree. This approach also allows ﬁne grained control over the efﬁciency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classiﬁcation with 10184 classes and 9 million images. We demonstrate signiﬁcant improvements in test accuracy and efﬁciency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

6 0.55516559 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

7 0.46693859 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

8 0.46049088 256 nips-2011-Solving Decision Problems with Limited Information

9 0.45049655 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

10 0.44824681 53 nips-2011-Co-Training for Domain Adaptation

11 0.43186873 265 nips-2011-Sparse recovery by thresholded non-negative least squares

12 0.42681918 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

13 0.42259759 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

14 0.42207882 22 nips-2011-Active Ranking using Pairwise Comparisons

15 0.42014286 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

16 0.41826811 291 nips-2011-Transfer from Multiple MDPs

17 0.41550589 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

18 0.41154379 198 nips-2011-On U-processes and clustering performance

19 0.40543187 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

20 0.40459448 199 nips-2011-On fast approximate submodular minimization