nips nips2008 nips2008-217 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. [sent-4, score-0.095]
2 It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. [sent-5, score-0.057]
3 Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. [sent-6, score-0.066]
4 1 Introduction Given a reproducing kernel Hilbert space (RKHS) of a kernel k : X × X → R and training set D := ((x1 , y1 ), . [sent-7, score-0.05]
5 21], that the solution is of the form n ∗ βi k(xi , · ) , fD,λ = (2) i=1 ∗ where the coefficients βi are a solution of the optimization problem n i=1 subject to n yi βi − maximize n |βi | − i=1 −C ≤ βi ≤ C 1 βi βj k(xi , xj ) 2 i,j=1 (3) for all i = 1, . [sent-14, score-0.047]
6 In the ∗ following, we write SV (fD,λ ) := {i : βi = 0} for the set of indices that belong to the support vectors of fD,λ . [sent-21, score-0.054]
7 Furthermore, we write # for the counting measure, and hence #SV (fD,λ ) denotes the number of support vectors of fD,λ . [sent-22, score-0.079]
8 Due to this fact, the -insensitive loss was originally motivated by the goal to achieve sparse decision functions, i. [sent-24, score-0.037]
9 The goal of this work is to provide such an explanation by establishing asymptotically tight lower and upper bounds for the number of support vectors. [sent-28, score-0.07]
10 Based on these bounds we then investigate the trade-off between sparsity and estimation accuracy of the -insensitive SVM. [sent-29, score-0.032]
11 To this end, let P be a probability measure on X × R, where X is some measurable space. [sent-31, score-0.119]
12 Given a measurable f : X → R, we then define the L -risk of f by RL ,P (f ) := E(x,y)∼P L (y, f (x)). [sent-32, score-0.064]
13 Moreover, recall that P can be split into the marginal distribution PX on X and the regular conditional probability P( · |x). [sent-33, score-0.051]
14 Given a RKHS H of a bounded kernel k, [1] then showed that 2 H fP,λ := arg inf λ f f ∈H + RL ,P (f ) exists and is uniquely determined whenever RL ,P (0) < ∞. [sent-34, score-0.108]
15 Let us write δ(x,y) for the Dirac n 1 measure at some (x, y) ∈ X × R. [sent-35, score-0.051]
16 Finally, we need to introduce the sets Aδ (f ) low := (x, y) ∈ X × R : |f (x) − y| > + δ Aδ (f ) up := (x, y) ∈ X × R : |f (x) − y| ≥ − δ , where f : X → R is an arbitrary function and δ ∈ R. [sent-40, score-0.027]
17 1 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-44, score-0.241]
18 Then, for all n ≥ 1, ρ > 0, δ > 0, and λ > 0 satisfying δλ ≤ 4, we have Pn D ∈ (X × R)n : 2 δ 2 λ2 n #SV (fD,λ ) > P Aδ (fP,λ ) − ρ ≥ 1 − 3e− 16 − e−2ρ n low n Pn D ∈ (X × R)n : 2 δ 2 λ2 n #SV (fD,λ ) < P Aδ (fP,λ ) + ρ ≥ 1 − 3e− 16 − e−2ρ n . [sent-45, score-0.068]
19 2 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-49, score-0.241]
20 Then, for all ρ > 0 and λ > 0, we have lim Pn D ∈ (X × R)n : P Alow (fP,λ ) − ρ ≤ n→∞ #SV (fD,λ ) ≤ P Aup (fP,λ ) + ρ = 1 . [sent-50, score-0.055]
21 n Note that the above corollary exactly describes the asymptotic behavior of the fraction of support vectors modulo the probability of the set Aup (fP,λ )\Alow (fP,λ ) = (x, fP,λ (x) − ) : x ∈ X ∪ (x, fP,λ (x) + ) : x ∈ X . [sent-51, score-0.126]
22 In particular, if the conditional distributions P( · |x), x ∈ X, have no discrete components, then the above corollary gives an exact description. [sent-52, score-0.107]
23 Let us begin by denoting the Bayes L -risk by R∗ ,P := inf RL ,P (f ), L where P is a distribution and the infimum is taken over all measurable functions f : X → R. [sent-56, score-0.113]
24 In addition, given a distribution Q on R, [6] and [7, Chapter 3] defined the inner L -risks by CL ,Q (t) := L (y, t) dQ(y) , t ∈ R, R ∗ and the minimal inner L -risks were denoted by CL RL ,P (f ) CL = X ,Q ,P( · |x) := inf t∈R CL ,Q (t). [sent-57, score-0.038]
25 Moreover, we need the sets of conditional minimizers X M∗ (x) := t ∈ R : CL ,P( · |x) (t) ∗ = CL ,P( · |x) ,P = . [sent-61, score-0.039]
26 The following lemma collects some useful properties of these sets. [sent-62, score-0.059]
27 3 Let P be a probability measure on X × R with R∗ L empty and compact interval for PX -almost all x ∈ X. [sent-64, score-0.035]
28 3 shows that for PX -almost all x ∈ X there exists a unique t∗ (x) ∈ M∗ (x) such that t∗ (x) − f (x) ≤ t − f (x) ∗ for all t ∈ M∗ (x) . [sent-67, score-0.037]
29 4 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-74, score-0.241]
30 For this case, the following obvious corollary of Theorem 2. [sent-79, score-0.065]
31 4 establishes lower and upper bounds on the number of support vectors. [sent-80, score-0.041]
32 5 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-82, score-0.241]
33 Then, for all n ρ > 0, the probability Pn of D ∈ (X × R)n satisfying lim inf P Mlow (fP,λm ) − ρ ≤ m→∞ #SV (fD,λn ) ≤ lim sup P Mup (fP,λm ) + ρ n m→∞ converges to 1 for n → ∞. [sent-85, score-0.201]
34 As a consequence, it seems hard to show that, for probability measures P whose conditional distributions P( · |x), x ∈ X, have no discrete components, we always have lim inf P Mlow (fP,λ ) = lim sup P Mup (fP,λ ) . [sent-90, score-0.215]
35 Consequently, (7) holds provided that the conditional distributions P( · |x), x ∈ X, have no discrete components, and hence Corollary 2. [sent-95, score-0.08]
36 5 gives a tight bound on the number of support vectors. [sent-96, score-0.039]
37 Finally, recall that for this specific loss function, M∗ (x) equals the median of P( · |x), and hence M∗ (x) is a singleton whenever the median of P( · |x) is unique. [sent-101, score-0.116]
38 To this end, we assume in the following that the conditional distributions P( · |x) are symmetric, i. [sent-104, score-0.055]
39 , for PX -almost all x ∈ X there exists a conditional center c(x) ∈ R such that P(c(x) + A|x) = P(c(x) − A|x) for all measurable A ⊂ R. [sent-106, score-0.128]
40 Note that by considering A := [0, ∞) it is easy to see that c(x) is a median of P( · |x). [sent-107, score-0.05]
41 Furthermore, the assumption RL ,P (0) < ∞ imposed in the results above ensures that the conditional ∗ mean fP (x) := E(Y |x) of P( · |x) exists PX -almost surely, and from this it is easy to conclude that ∗ c(x) = fP (x) for PX -almost all x ∈ X. [sent-108, score-0.103]
42 6 Let P be a probability measure on X × R such that RL ,P (0) < ∞. [sent-113, score-0.035]
43 Assume that the conditional distributions P( · |x), x ∈ X, are symmetric and that for PX -almost all x ∈ X there exists a δ(x) > 0 such that for all δ ∈ (0, δ(x)] we have ∗ P fP (x) + [−δ, δ] x P ∗ fP (x) > 0, > 0. [sent-114, score-0.08]
44 + [ − δ, + δ] x ∗ Then, for PX -almost all x ∈ X, we have M (x) = the unique median of P( · |x). [sent-115, score-0.049]
45 (8) (9) ∗ {fP (x)} and ∗ fP (x) equals PX -almost surely Obviously, condition (8) means that the conditional distributions have some mass around their me∗ ∗ dian fP , whereas (9) means that the conditional distributions have some mass around fP ± . [sent-116, score-0.156]
46 6, the corresponding -insensitive SVM can be used to estimate the conditional median. [sent-118, score-0.05]
47 To this end, let us assume for the sake of simplicity that the conditional distributions P( · |x) have continuous Lebesgue densities p( · |x) : R → [0, ∞). [sent-120, score-0.101]
48 By the symmetry of the conditional distributions it is then easy to see that these densities are sym∗ metric around fP (x). [sent-121, score-0.096]
49 Let us first consider the case where the conditional distributions are equal modulo translations. [sent-123, score-0.091]
50 In other words, we assume that there exists a continuous Lebesgue density q : R → [0, ∞) which is symmetric around 0 such that for PX -almost all x ∈ X we have ∗ q(y) = p(fP (x) + y|x) , y ∈ R. [sent-124, score-0.038]
51 6 and the discussion around (7) we then conclude that under the assumptions of Corollary 2. [sent-131, score-0.051]
52 5 the fraction of support vectors asymptotically approaches 2Q([ , ∞)), where Q is the probability measure defined by q. [sent-132, score-0.087]
53 In particular, if Q([ , ∞)) = 0, the corresponding SVM produces super sparse decision functions, i. [sent-134, score-0.039]
54 , decision functions whose number of support vectors does not grow linearly in the sample size. [sent-136, score-0.057]
55 3] indicates that the size of q( ) has a direct influence on the ability of fD,λ to estimate ∗ the conditional median fP . [sent-139, score-0.087]
56 By a literal repetition of the proof of [8, Theorem 2. [sent-143, score-0.031]
57 5, then we know that RL ,P (fD,λn ) → R∗ ,P in probability for n → ∞, and thus L ∗ (10) shows that fD,λn approximates the conditional median fP with respect to the L1 (PX )-norm. [sent-146, score-0.088]
58 In other words, the sparsity of the decision functions may be paid by less accurate estimates of the conditional median. [sent-150, score-0.092]
59 On the other hand, our results also show that moderate values for can lead to both reasonable estimates of the conditional median and relatively sparse decision functions. [sent-151, score-0.096]
60 3] to establish self∗ calibration inequalities that measure the distance of f to fP only up to . [sent-153, score-0.044]
61 In this case, however, it is obvious that such self-calibration inequalities are worse the larger is, and hence the informal conclusions above remain unchanged. [sent-154, score-0.059]
62 Finally, we like to mention that, if the conditional distributions are not equal modulo translations, then the situation may become more involved. [sent-155, score-0.095]
63 In particular, if we are in a situation with ∗ ∗ ∗ ∗ p(fP (x)|x) > 0 and p(fP (x) + |x) > 0 but inf x p(fP (x)|x) = inf x p(fP (x) + |x) = 0, selfcalibration inequalities of the form (10) are in general impossible, and weaker self-calibration inequalities require additional assumptions on P. [sent-156, score-0.145]
64 3 Proofs Setting C := 1 2λn and introducing slack variables, we can restate the optimization problem (1) as 1 f 2 minimize n 2 H ˜ (ξi + ξi ) +C (11) i=1 f (xi ) − yi ≤ + ξi , ˜ yi − f (xi ) ≤ + ξi , subject to ˜ ξi , ξi ≥ 0 for all i = 1, . [sent-158, score-0.094]
65 117], that the dual optimization problem of (11) is n n i=1 i=1 subject to n (˜ i + αi ) − α yi (˜ i − αi ) − α maximize 0 ≤ αi , αi ≤ C ˜ 1 (˜ i − αi )(αj − αj )k(xi , xj ) α ˜ 2 i,j=1 (12) for all i = 1, . [sent-166, score-0.047]
66 , αn , αn ) denotes a solution of ˜∗ ∗ ∗ ˜∗ (12), then we can recover the primal solution (f , ξ , ξ ) by n f∗ ∗ (˜ i − αi )k(xi , · ) , α∗ = (13) i=1 ∗ ξi ˜ ξ∗ i = max{0, f ∗ (xi ) − yi − } , (14) = max{0, yi − f ∗ (xi ) − } , (15) for all i = 1, . [sent-173, score-0.104]
67 Moreover, the Karush-Kuhn-Tucker conditions of (12) are ∗ ∗ αi (f ∗ (xi ) − yi − − ξi ) = 0 , ˜∗ αi (yi − f ∗ (xi ) − − ξi ) = 0 , ˜∗ ∗ ∗ (αi − C)ξi = 0 , ˜∗ (˜ i − C)ξi = 0 , α∗ ˜ ξ∗ξ∗ = 0 , i i ∗ ∗ αi αi ˜ = 0, (16) (17) (18) (19) (20) (21) where i = 1, . [sent-177, score-0.047]
68 The following simple ˜ lemma provides lower and upper bounds for the set of support vectors. [sent-182, score-0.1]
69 1 Using the above notations we have ∗ ⊂ i : βi = 0 ⊂ i : |fD,λ (xi ) − yi | ≥ i : |fD,λ (xi ) − yi | > . [sent-184, score-0.105]
70 Proof: Let us first prove the inclusion on the left hand side. [sent-185, score-0.037]
71 To this end, we begin by fixing an index ∗ i with fD,λ (xi ) − yi > . [sent-186, score-0.047]
72 By fD,λ = f ∗ and (14), we then find ξi > 0, and hence (18) implies ∗ ∗ ∗ αi = C. [sent-187, score-0.037]
73 From (21) we conclude αi = 0 and hence we have βi = αi − αi = −C = 0. [sent-188, score-0.051]
74 The case ˜∗ ˜∗ yi − fD,λ (xi ) > can be shown analogously, and hence we obtain the first inclusion. [sent-189, score-0.072]
75 In order to ∗ ∗ ∗ show the second inclusion we fix an index i with βi = 0. [sent-190, score-0.026]
76 The KKT condition (16) ˜∗ ˜∗ ∗ ∗ together with fD,λ = f ∗ implies fD,λ (xi ) − yi − = ξi and since ξi ≥ 0 we get fD,λ (xi ) − yi ≥ . [sent-193, score-0.106]
77 2 Let (Ω, A, P) be a probability space and H be a separable Hilbert space. [sent-198, score-0.068]
78 , ηn : Ω → H be independent random variables satisfying EP ηi = 0 and ηi ∞ ≤ 1 for all i = 1, . [sent-202, score-0.041]
79 3 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-210, score-0.241]
80 H n Let us now assume that we have a training set D ∈ (X × R) such that fP,λ − fD,λ a pair (x, y) ∈ Aδ (fP,λ ), we then have low (22) H ≤ δ. [sent-221, score-0.038]
81 Given + δ < |fP,λ (x) − y| ≤ |fD,λ (x) − y| + |fP,λ (x) − fD,λ (x)| ≤ |fD,λ (x) − y| + δ by the triangle inequality and k ∞ ≤ 1 which implies · Aδ (fP,λ ) ⊂ Alow (fD,λ ). [sent-222, score-0.029]
82 1 yields low #SV (fD,λ ) ≥ # i : |fD,λ (xi ) − yi | > ∞ ≤ · H. [sent-224, score-0.074]
83 In other words, we have ≥ # i : |fP,λ (xi ) − yi | > + δ n = 1Aδ (fP,λ ) (xi , yi ) . [sent-225, score-0.094]
84 low i=1 Combining this estimate with (22) we then obtain Pn D ∈ (X × R)n : 1 #SV (fD,λ ) ≥ n n n 1Aδ (fP,λ ) (xi , yi ) ≥ 1 − 3e− low δ 2 λ2 n 16 . [sent-226, score-0.112]
85 1], shows Pn D ∈ (X × R)n : 1 n n 1Aδ (fP,λ ) (xi , yi ) > P Aδ (fP,λ ) − ρ ≥ 1 − e−2ρ low low i=1 2 n for all ρ > 0 and n ≥ 1. [sent-230, score-0.101]
86 1 then shows up n #SV (fD,λ ) ≤ 1Aδ (fP,λ ) (xi , yi ) , up i=1 and hence (22) yields n P #SV (fD,λ ) 1 D ∈ (X × R) : ≤ n n n 1Aδ (fP,λ ) (xi , yi ) ≥ 1 − 3e− up n δ 2 λ2 n 16 . [sent-234, score-0.119]
87 i=1 Using Hoeffding’s inequality analogously to the proof of the first estimate we then obtain the second estimate. [sent-235, score-0.071]
88 low low Let us show δ Alow (fP,λ ) = Alow (fP,λ ) . [sent-238, score-0.065]
89 From (23) we now conclude low lim P Aδ (fP,λ ) = P Alow (fP,λ ) . [sent-243, score-0.108]
90 low δ (24) 0 In addition, we have Aδ (fP,λ ) ⊂ Aδ (fP,λ ) for 0 ≤ δ ≤ δ, and it is easy to check that up up Aδ (fP,λ ) = Aup (fP,λ ) . [sent-244, score-0.04]
91 From (25) we then conclude lim P Aδ (fP,λ ) = P Aup (fP,λ ) . [sent-249, score-0.081]
92 3: Since the loss function L is Lipschitz continuous and convex in t, it is easy to verify that t → CL ,P( · |x) (t) is Lipschitz continuous and convex for PX -almost all x ∈ X, and hence M∗ (x) is a closed interval. [sent-254, score-0.055]
93 Let us fix such an x, a B > 0, L and a sequence (tn ) ⊂ R with tn → −∞. [sent-257, score-0.067]
94 Furthermore, there exists an M > 0 with P([−M, M ] | x) ≥ 1/2, and since tn → −∞ there further exists an n0 ≥ 1 such that tn ≤ −M −r0 for all n ≥ n0 . [sent-259, score-0.162]
95 For y ∈ [−M, M ] we thus have y − tn ≥ r0 , and hence we finally find CL ,P( · |x) (tn ) ≥ L (y, tn ) dP(y|x) ≥ B [−M,M ] for all n ≥ n0 . [sent-260, score-0.137]
96 4 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-265, score-0.241]
97 Proof: Since H is dense in L1 (PX ) we have inf f ∈H RL ,P (f ) = R∗ ,P by [9, Theorem 3], and L hence limλ→0 RL ,P (fP,λ ) = R∗ ,P . [sent-268, score-0.086]
98 5 Let P be a probability measure on X × R and H be a separable RKHS with bounded measurable kernel satisfying k ∞ ≤ 1. [sent-272, score-0.241]
99 low up Proof: We write t∗ (x) for the real number defined by (6) for f (x) := fP,λ (x). [sent-275, score-0.044]
100 4: Analogously to the proofs of (24) and (26), we find δ δ lim P Mlow (fP,λ ) = P Mlow (fP,λ ) and lim P Mup (fP,λ ) = P Mup (fP,λ ) δ 0 δ 0 Combining these equations with Theorem 2. [sent-289, score-0.11]
wordName wordTfidf (topN-words)
[('fp', 0.826), ('fd', 0.338), ('mlow', 0.191), ('mup', 0.167), ('px', 0.153), ('sv', 0.1), ('alow', 0.095), ('aup', 0.095), ('cl', 0.092), ('rl', 0.091), ('measurable', 0.064), ('rkhs', 0.061), ('lemma', 0.059), ('separable', 0.056), ('tn', 0.056), ('lim', 0.055), ('corollary', 0.052), ('pn', 0.051), ('yi', 0.047), ('satisfying', 0.041), ('conditional', 0.039), ('inf', 0.038), ('median', 0.037), ('xi', 0.036), ('theorem', 0.035), ('steinwart', 0.032), ('low', 0.027), ('conclude', 0.026), ('moreover', 0.026), ('inclusion', 0.026), ('hence', 0.025), ('exists', 0.025), ('modulo', 0.025), ('kernel', 0.025), ('support', 0.025), ('svm', 0.024), ('bayreuth', 0.024), ('christmann', 0.024), ('ingo', 0.024), ('dense', 0.023), ('measure', 0.023), ('hoeffding', 0.022), ('analogously', 0.022), ('proof', 0.021), ('inequalities', 0.021), ('assertion', 0.021), ('dpx', 0.021), ('alamos', 0.021), ('let', 0.02), ('bounded', 0.02), ('decision', 0.02), ('surely', 0.02), ('super', 0.019), ('ep', 0.019), ('chapter', 0.018), ('vapnik', 0.018), ('write', 0.017), ('inequality', 0.017), ('svms', 0.017), ('paid', 0.017), ('ipping', 0.017), ('loss', 0.017), ('sparsity', 0.016), ('obviously', 0.016), ('distributions', 0.016), ('bounds', 0.016), ('lebesgue', 0.015), ('los', 0.015), ('situation', 0.015), ('asymptotically', 0.015), ('densities', 0.015), ('proposition', 0.014), ('tight', 0.014), ('hilbert', 0.014), ('end', 0.014), ('editors', 0.013), ('around', 0.013), ('obvious', 0.013), ('easy', 0.013), ('lipschitz', 0.012), ('york', 0.012), ('springer', 0.012), ('formulate', 0.012), ('implies', 0.012), ('unique', 0.012), ('vectors', 0.012), ('probability', 0.012), ('consequently', 0.012), ('assumptions', 0.012), ('combining', 0.011), ('furthermore', 0.011), ('notations', 0.011), ('us', 0.011), ('conversely', 0.011), ('estimate', 0.011), ('primal', 0.01), ('literal', 0.01), ('uous', 0.01), ('sym', 0.01), ('observe', 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
2 0.26790547 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
3 0.10834466 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
4 0.070813209 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
5 0.064132884 225 nips-2008-Supervised Bipartite Graph Inference
Author: Yoshihiro Yamanishi
Abstract: We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data. 1
6 0.062259793 153 nips-2008-Nonlinear causal discovery with additive noise models
7 0.045683548 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
8 0.039725289 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
9 0.039586421 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
10 0.039211929 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues
11 0.036264449 228 nips-2008-Support Vector Machines with a Reject Option
12 0.034929968 44 nips-2008-Characteristic Kernels on Groups and Semigroups
13 0.034534127 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
14 0.033859756 202 nips-2008-Robust Regression and Lasso
15 0.033820756 113 nips-2008-Kernelized Sorting
16 0.032913506 138 nips-2008-Modeling human function learning with Gaussian processes
17 0.032240883 59 nips-2008-Dependent Dirichlet Process Spike Sorting
18 0.031962451 170 nips-2008-Online Optimization in X-Armed Bandits
19 0.030969627 78 nips-2008-Exact Convex Confidence-Weighted Learning
20 0.029855713 178 nips-2008-Performance analysis for L\ 2 kernel classification
topicId topicWeight
[(0, -0.091), (1, 0.011), (2, -0.079), (3, 0.037), (4, -0.025), (5, 0.032), (6, 0.02), (7, -0.005), (8, 0.003), (9, 0.029), (10, 0.056), (11, 0.026), (12, 0.072), (13, -0.06), (14, -0.059), (15, 0.08), (16, 0.063), (17, -0.034), (18, -0.024), (19, 0.047), (20, -0.065), (21, 0.056), (22, 0.092), (23, -0.122), (24, 0.148), (25, 0.147), (26, 0.035), (27, -0.094), (28, -0.025), (29, -0.087), (30, 0.113), (31, 0.001), (32, 0.063), (33, 0.195), (34, -0.058), (35, -0.01), (36, 0.055), (37, -0.039), (38, 0.001), (39, 0.123), (40, 0.083), (41, -0.077), (42, -0.139), (43, -0.153), (44, -0.061), (45, -0.192), (46, 0.149), (47, -0.03), (48, -0.173), (49, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.95870572 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
2 0.68001324 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
3 0.47768012 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
4 0.34411702 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates
Author: Richard Nock, Frank Nielsen
Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1
5 0.30849698 15 nips-2008-Adaptive Martingale Boosting
Author: Phil Long, Rocco Servedio
Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classifiers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classification noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classifiers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classification noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with confidencerated weak hypotheses that output real values rather than Boolean predictions.
7 0.27763844 44 nips-2008-Characteristic Kernels on Groups and Semigroups
8 0.27560711 153 nips-2008-Nonlinear causal discovery with additive noise models
9 0.27198297 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
10 0.26674762 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
11 0.26376319 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
12 0.25123483 228 nips-2008-Support Vector Machines with a Reject Option
13 0.24774954 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
14 0.23430991 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
15 0.22738676 40 nips-2008-Bounds on marginal probability distributions
16 0.22212785 85 nips-2008-Fast Rates for Regularized Objectives
17 0.22092216 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
18 0.20898551 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
19 0.20879191 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
20 0.20215979 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
topicId topicWeight
[(6, 0.068), (7, 0.059), (12, 0.058), (19, 0.324), (21, 0.024), (28, 0.151), (57, 0.03), (59, 0.027), (63, 0.02), (71, 0.026), (77, 0.046), (83, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.78494906 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
2 0.66105431 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
Author: Zenglin Xu, Rong Jin, Irwin King, Michael Lyu
Abstract: We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efficient methods, i.e., Semi-Infinite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can significantly improve efficiency by saving on average 91.9% of computational time over the SILP method and 70.3% over the SD method. 1
3 0.64130867 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
Author: Steven J. Phillips, Miroslav Dudík
Abstract: We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
5 0.49966377 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
Author: Bernhard Nessler, Michael Pfeiffer, Wolfgang Maass
Abstract: Uncertainty is omnipresent when we perceive or interact with our environment, and the Bayesian framework provides computational methods for dealing with it. Mathematical models for Bayesian decision making typically require datastructures that are hard to implement in neural networks. This article shows that even the simplest and experimentally best supported type of synaptic plasticity, Hebbian learning, in combination with a sparse, redundant neural code, can in principle learn to infer optimal Bayesian decisions. We present a concrete Hebbian learning rule operating on log-probability ratios. Modulated by reward-signals, this Hebbian plasticity rule also provides a new perspective for understanding how Bayesian inference could support fast reinforcement learning in the brain. In particular we show that recent experimental results by Yang and Shadlen [1] on reinforcement learning of probabilistic inference in primates can be modeled in this way. 1
6 0.49951112 202 nips-2008-Robust Regression and Lasso
7 0.497798 195 nips-2008-Regularized Policy Iteration
8 0.49777094 85 nips-2008-Fast Rates for Regularized Objectives
9 0.49671867 179 nips-2008-Phase transitions for high-dimensional joint support recovery
10 0.49669579 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
11 0.49654824 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
12 0.49651101 196 nips-2008-Relative Margin Machines
13 0.49584323 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
14 0.49583343 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
15 0.49554351 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
16 0.49507219 48 nips-2008-Clustering via LP-based Stabilities
17 0.494605 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
18 0.49451771 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
19 0.49429628 62 nips-2008-Differentiable Sparse Coding
20 0.49383408 106 nips-2008-Inferring rankings under constrained sensing