nips nips2004 nips2004-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. [sent-4, score-0.114]
2 For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. [sent-5, score-0.858]
3 We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. [sent-6, score-0.259]
4 1 Introduction The last decade has seen significant progress in online learning algorithms that perform well even in adversarial settings (e. [sent-7, score-0.189]
5 In the online learning framework, one makes minimal assumptions on the data presented to the learner, and the goal is to obtain good (relative) performance on arbitrary sequences. [sent-11, score-0.122]
6 Our motivation is similar to that given in the online learning literature and the MDL literature (see Grunwald, 2005) —namely, that models are often chosen to balance realism with computational tractability, and often assumptions made by the Bayesian are not truly believed to hold (e. [sent-14, score-0.18]
7 We consider the widely used class of generalized linear models—focusing on Gaussian linear regression and logistic regression—and provide relative performance bounds (comparing to the best model in our model class) when the cost function is the logloss. [sent-21, score-0.499]
8 Though the regression problem has been studied in a competitive framework and, indeed, many ingenious algorithms have been devised for it (e. [sent-22, score-0.327]
9 Our bounds for linear regression are comparable to the best bounds in the literature (though we use the logloss as opposed to the square loss). [sent-25, score-0.651]
10 The competitive approach to regression started with Foster (1991), who provided competitive bounds for a variant of the ridge regression algorithm (under the square loss). [sent-26, score-0.725]
11 Vovk (2001) presents many competitive algorithms and provides bounds for linear regression (under the square loss) with an algorithm that differs slightly from the Bayesian one. [sent-27, score-0.493]
12 Azoury and Warmuth (2001) rederive Vovk’s bound with a different analysis based on Bregman distances. [sent-28, score-0.105]
13 We should also note that when the loss function is the logloss, multiplicative weights algorithms are sometimes identical to Bayes rule with particular choices of the parameters (see Freund and Schapire, 1999) . [sent-32, score-0.18]
14 Furthermore, Bayesian algorithms have been used in some online learning settings, such as the sleeping experts setting of Freund et al. [sent-33, score-0.155]
15 (1997) and the online boolean prediction setting of Cesa-Bianchi et al. [sent-34, score-0.194]
16 To our knowledge, there have been no studies of Bayesian generalized linear models in an adversarial online learning setting (though many variants have been considered as discussed above). [sent-37, score-0.242]
17 We also examine maximum a posteriori (MAP) algorithms for both Gaussian linear regression (i. [sent-38, score-0.282]
18 These algorithms are often used in practice, particularly in logistic regression where Bayesian model averaging is computationally expensive, but the MAP algorithm requires only solving a convex problem. [sent-41, score-0.372]
19 As expected, MAP algorithms are somewhat less competitive than full Bayesian model averaging, though not unreasonably so. [sent-42, score-0.177]
20 2 Bayesian Model Averaging We now consider the Bayesian model averaging (BMA) algorithm and give a bound on its worst-case online loss. [sent-43, score-0.246]
21 In logistic regression, we would have 1 1 log p(y|x, θ) = y log + (1 − y) log 1 − , (2) T x) 1 + exp(−θ 1 + exp(−θT x) where we assume y ∈ {0, 1}. [sent-49, score-0.728]
22 Also, let t i=1 pt (θ) = p(θ|St ) = θ p(y (i) |x(i) , θ) p(θ) t i=1 p(y (i) |x(i) , θ) p(θ)dθ be the posterior distribution over θ given the first t training examples. [sent-58, score-0.241]
23 θ We are then given the true label y (t) , and we suffer logloss − log p(y (t) |x(t) , St−1 ). [sent-61, score-0.392]
24 We define the cumulative loss of the BMA algorithm after T rounds to be T − log p(y (t) |x(t) , St−1 ). [sent-62, score-0.331]
25 We are interested in comparing against the loss of any “expert” that uses some fixed parameters θ ∈ Rn . [sent-65, score-0.123]
26 Define θ (t) = − log p(y (t) |x(t) , θ), and let T Lθ (S) = T θ (t) t=1 − log p(y (t) |x(t) , θ). [sent-66, score-0.445]
27 Given a distribution Q over θ, define Q (t) = θ −Q(θ) log p(y (t) |x(t) , θ)dθ, and T LQ (S) = Q (t) = Q(θ)Lθ (S)dθ. [sent-68, score-0.208]
28 θ t=1 This is the expected logloss incurred by a procedure that first samples some θ ∼ Q and then uses this θ for all its predictions. [sent-69, score-0.21]
29 Note that the expectation is of the logloss, which is a different type of averaging than in BMA, which had the expectation and the log in the reverse order. [sent-71, score-0.294]
30 1 A Useful Variational Bound The following lemma provides a worst case bound of the loss incurred by Bayesian algorithms and will be useful for deriving our main result in the next section. [sent-73, score-0.361]
31 As usual, q(θ) define KL(q||p) = θ q(θ) log p(θ) . [sent-77, score-0.208]
32 The chain rule of conditional probabilities implies that LBMA (S) = − log p(Y |X) and Lθ (S) = − log p(Y |X, θ). [sent-88, score-0.416]
33 So LBMA (S) − LQ (S) = − log p(Y |X) + Q(θ) log p(Y |X, θ)dθ θ = Q(θ) log θ By Bayes rule, we have that pT (θ) = p(Y |X, θ) dθ p(Y |X) p(Y |X,θ)p0 (θ) . [sent-89, score-0.624]
34 p(Y |X) Continuing, pT (θ) dθ p0 (θ) θ Q(θ) Q(θ) = Q(θ) log dθ − Q(θ) log dθ p0 (θ) pT (θ) θ θ = KL(Q||p0 ) − KL(Q||pT ). [sent-90, score-0.416]
35 Note that for linear regression (as defined in Equation 1), we have that for all y 1 (3) |fy (z)| = 2 σ and for logistic regression (as defined in Equation 2), we have that for y ∈ {0, 1} |fy (z)| ≤ 1 . [sent-96, score-0.512]
36 Then for all θ∗ , LBMA (S) ≤ Lθ∗ (S) + 1 n T cν 2 ||θ∗ ||2 + log 1 + 2ν 2 2 n (4) The ||θ∗ ||2 /2ν 2 term can be interpreted as a penalty term from our prior. [sent-100, score-0.268]
37 The log term is how fast our loss could grow in comparison to the best θ∗ . [sent-101, score-0.361]
38 Importantly, this extra loss is only logarithmic in T in this adversarial setting. [sent-102, score-0.181]
39 This bound almost identical to those provided by Vovk (2001); Azoury and Warmuth (2001) and Foster (1991) for the linear regression case (under the square loss); the only difference is that in their bounds, the last term is multiplied by an upper bound on y (t) . [sent-103, score-0.526]
40 In contrast, we require no bound on y (t) in the Gaussian linear regression case due to the fact that we 1 deal with the logloss (also recall |fy (z)| = σ2 for all y). [sent-104, score-0.511]
41 Letting H(Q) = n log 2πe 2 be the entropy of Q, we have 2 −1 1 1 exp − 2 θT θ dθ − H(Q) 2ν (2π)n/2 |ν 2 In |1/2 θ 1 n = n log ν + 2 Q(θ)θT θdθ − − n log 2ν θ 2 1 n = n log ν + 2 ||θ∗ ||2 + n 2 − − n log . [sent-108, score-1.069]
42 By taking a Taylor expansion of fy (assume y ∈ S), we have that (z − z ∗ )2 fy (z) = fy (z ∗ ) + fy (z ∗ )(z − z ∗ ) + fy (ξ(z)) , 2 for some appropriate function ξ. [sent-110, score-2.53]
43 Thus, if z is a random variable with mean z ∗ , we have (z − z ∗ )2 Ez [fy (z)] = fy (z ∗ ) + fy (z ∗ ) · 0 + Ez fy (ξ(z)) 2 ∗ 2 (z − z ) ≤ fy (z ∗ ) + cEz 2 c = fy (z ∗ ) + Var(z) 2 KL(Q||p0 ) = Q(θ) log Consider a single example (x, y). [sent-111, score-2.738]
44 Thus, we have c 2 Eθ∼Q [fy (θT x)] ≤ fy (θ∗ T x) + 2 Since Q (t) = Eθ∼Q [fy(t) (θT x(t) )] and θ∗ (t) = fy(t) (θ∗ T x(t) ), we can sum both sides from t = 1 to T to obtain Tc 2 LQ (S) ≤ Lθ∗ (S) + 2 Putting this together with Lemma 2. [sent-115, score-0.534]
45 1 and Equation 5, we find that Tc 2 1 n LBMA (S) ≤ Lθ∗ (S) + + n log ν + 2 ||θ∗ ||2 + n 2 − − n log . [sent-116, score-0.416]
46 3 A Lower Bound for Gaussian Linear Regression The following lower bound shows that, for linear regression, no other prediction scheme is better than Bayes in the worst case (when our penalty term is ||θ∗ ||2 ). [sent-120, score-0.26]
47 Here, we compare to an arbitrary predictive distribution q(y|x(t) , St−1 ) for prediction at time t, which suffers an instant loss q (t) = − log q(y (t) |x(t) , St−1 ). [sent-121, score-0.42]
48 3: Let Lθ∗ (S) be the loss under the Gaussian linear regression model using the parameter θ∗ , and let ν 2 = σ 2 = 1. [sent-124, score-0.374]
49 By the chain rule of conditional probabilities, LBMA (S) = − log p(Y |X) (where p is the Gaussian linear regression model), and T q’s loss is t=1 q (t) = − log q(Y |X). [sent-132, score-0.761]
50 For any predictive distribution q that differs from p, there must exist some sequence S such that − log q(Y |X) is greater than − log p(Y |X) (since probabilities are normalized). [sent-133, score-0.515]
51 3 MAP Estimation We now present bounds for MAP algorithms for both Gaussian linear regression (i. [sent-137, score-0.345]
52 These algorithms use the maximum θt−1 of pt−1 (θ) to (t) ˆ form their predictive distribution p(y|x , θt−1 ) at time t, as opposed to BMA’s predictive distribution of p(y|x(t) , St−1 ). [sent-140, score-0.116]
53 As expected these bounds are weaker than BMA, though perhaps not unreasonably so. [sent-141, score-0.154]
54 1 The Square Loss and Ridge Regression Before we provide the MAP bound, let us first present the form of the posteriors and the T t 1 predictions for Gaussian linear regression. [sent-143, score-0.101]
55 We now have that ˆ ˆ pt (θ) = p(θ|St ) = N θ; θt , Σt , (6) ˆ ˆ where θt = A−1 bt , and Σt = A−1 . [sent-145, score-0.206]
56 In contrast, the prediction of a ˆ t+1 ∗ fixed expert using parameter θ would be ∗ p(y (t) |x(t) , θ∗ ) = N y (t) ; yt , σ 2 , (8) ∗ where yt = θ∗ T x(t) . [sent-147, score-0.198]
57 Now the BMA loss is: T LBMA (S) = t=1 1 (t) ˆT (t) 2 (y − θt−1 x ) + log 2s2 t 2πs2 t (9) Importantly, note how Bayes is adaptively weighting the squared term with the inverse variances 1/st (which depend on the current observation x(t) ). [sent-148, score-0.361]
58 The logloss of using a fixed expert θ∗ is just: T Lθ∗ (S) = t=1 √ 1 (y (t) − θ∗ T x(t) )2 + log 2πσ 2 2 2σ (10) ˆ The MAP procedure (referred to as ridge regression) uses p(y|x(t) , θt−1 ) which has a fixed variance. [sent-149, score-0.521]
59 Hence, the MAP loss is essentially the square loss and we define it as such: LMAP (S) = 1 2 T ˆT (y (t) − θt−1 x(t) )2 , Lθ∗ (S) = t=1 1 2 T (y (t) − θ∗ T x(t) )2 , (11) t=1 ˆ where θt is the MAP estimate (see Equation 6). [sent-150, score-0.288]
60 For all S such that ||x(t) || ≤ 1 and for all θ∗ , we have LMAP (S) ≤ γ2 γ2 γ2n T ν2 L ∗ (S) + 2 ||θ∗ ||2 + log 1 + 2 2 θ σ 2ν 2 σ n Proof: Using Equations (9,10) and Theorem 2. [sent-153, score-0.208]
61 2, we have T t=1 1 (t) ˆT (t) 2 (y − θt−1 x ) 2s2 t T ≤ 1 1 (y (t) − θ∗ T x(t) )2 + 2 ||θ∗ ||2 2σ 2 2ν t=1 √ T T cν 2 2πσ 2 n + log 1 + + log 2 n 2πs2 t t=1 Equations (6, 7) imply that σ 2 ≤ s2 ≤ σ 2 + ν 2 . [sent-154, score-0.452]
62 We might have hoped that MAP were more competitive in that the leading coefficient, γ2 in front of the Lθ∗ (S) term in the bound, be 1 (similar to Theorem 2. [sent-156, score-0.124]
63 Some previous (non-Bayesian) algorithms did in fact have bounds with this coefficient being unity. [sent-159, score-0.123]
64 Vovk (2001) provides such an algorithm, though this algorithm differs from MAP in that its predictions at time t are a nonlinear function of x(t) (it uses At instead of At−1 at time t). [sent-160, score-0.098]
65 Foster (1991) provides a bound with this coefficient being 1 with more restrictive assumptions. [sent-161, score-0.105]
66 Azoury and Warmuth (2001) also provide a bound with a coefficient of 1 by using a MAP procedure with “clipping. [sent-162, score-0.105]
67 ” (Their algorithm thresholds the ˆT prediction yt = θt−1 x(t) if it is larger than some upper bound. [sent-163, score-0.119]
68 Note that we do not assume ˆ any upper bound on y (t) . [sent-164, score-0.127]
69 ) As the following lower bound shows, it is not possible for the MAP linear regression algorithm to have a coefficient of 1 for Lθ∗ (S), with a reasonable regret bound. [sent-165, score-0.327]
70 A similar lower bound is in Vovk (2001), which doesn’t apply to our setting where we have the additional constraint ||x(t) || ≤ 1. [sent-166, score-0.15]
71 2 Logistic Regression MAP estimation is often used for regularized logistic regression, since it requires only solving a convex program (while BMA has to deal with a high dimensional integral over ˆ θ that is intractable to compute exactly). [sent-172, score-0.127]
72 Letting θt−1 be the maximum of the posterior T (t) (t) ˆ pt−1 (θ), define LMAP (S) = t=1 − log p(y |x , θt−1 ). [sent-173, score-0.237]
73 As with the square loss case, the bound we present is multiplicatively worse (by a factor of 4). [sent-174, score-0.27]
74 5, we have that for all sequences S such that ||x(t) || ≤ 1 and y (t) ∈ {0, 1} and for all θ∗ LMAP (S) ≤ 4Lθ∗ (S) + T ν2 2 ∗ 2 ||θ || + 2n log 1 + 2 ν n Proof: (sketch) Assume n = 1 (the general case is analogous). [sent-177, score-0.208]
75 The proof consists of ˆ showing that θt−1 (t) = − log p(y (t) |x(t) , θt−1 ) ≤ 4 BMA (t). [sent-178, score-0.262]
76 Without loss of generality, ˆ assume y (t) = 1 and x(t) ≥ 0, and for convenience, we just write x instead of x(t) . [sent-179, score-0.123]
77 Now the BMA prediction is θ p(1|θ, x)pt−1 (θ)dθ, and BMA (t) is the negative log of this. [sent-180, score-0.257]
78 Note θ = ∞ gives probability 1 for y (t) = 1 (and this setting of θ minimizes the loss at time t). [sent-181, score-0.147]
79 Define pq = p(1|θ, x)q(θ)dθ, which can be viewed as the prediction using q rather than the posterior. [sent-183, score-0.37]
80 With this positive probability only for θ ≥ θ choice, we first show that the loss of q, − log pq , is less than or equal to BMA (t). [sent-185, score-0.652]
81 Then we complete the proof by showing that θt−1 (t) ≤ −4 log pq , since − log pq ≤ BMA (t). [sent-186, score-1.112]
82 ˆ Consider the q which maximizes pq subject to the following constraints: let q(θ) have its ˆ ˆ ˆ maximum at θt−1 ; let q(θ) = 0 if θ < θt−1 (intuitively, mass to the left of θt−1 is just 2 making the pq smaller); and impose the constraint that −(log q(θ)) ≥ 1/ν . [sent-187, score-0.742]
83 We now argue that for such a q, − log pq ≤ BMA (t). [sent-188, score-0.529]
84 Now if this posterior pt−1 were rectified (with support only for θ ≥ θt−1 ) and renormalized, then such a modified distribution clearly satisfies the aforementioned constraints, and it has loss less than the loss of pt−1 itself (since the rectification only increases the prediction). [sent-190, score-0.275]
85 Hence, the maximizer, q, of pq subject to the constraints has loss less than that of pt−1 , i. [sent-191, score-0.444]
86 Assume some ˆ ˆ other q2 satisfied these constraints and maximized pq . [sent-195, score-0.321]
87 To see this, note that normalization and curvature imply that q2 must cross pt only q(θ once. [sent-198, score-0.28]
88 Now a sufficiently slight perturbation of this crossing point to the left, by shifting more mass from the left to the right side of the crossing point, would not violate the curvature constraint and would result in a new distribution with larger pq , contradicting the ˆ ˆ maximality of q2 . [sent-199, score-0.5]
89 This, along with the curvature constraint and normalization, imply that the rectified Gaussian, q, is the unique solution. [sent-201, score-0.118]
90 ˆ To complete the proof, we show ˆ (t) = − log p(1|x, θt−1 ) ≤ −4 log pq . [sent-202, score-0.737]
91 Using the boundedness ˆ of the derivative |∂ log p(1|x, θ)/∂θ| < 1 and that q only has support for θ ≥ θt−1 , we have pq = exp(log p(1|x, θ))q(θ)dθ θ ˆ ˆ ˆ exp log(p(1|x, θt−1 ) + θ − θt−1 q(θ)dθ ≤ 1. [sent-205, score-0.579]
92 Now observe that for θt−1 ≤ 0, we have the lower ˆ ˆ bound − log p(1|x, θt−1 ) ≥ log 2. [sent-209, score-0.521]
93 (− log p(1|x, θ θt−1 θt−1 ˆ Now for the case θt−1 ≥ 0. [sent-214, score-0.208]
94 Let σ be the sigmoid function, so p(1|x, θ) = σ(θx) and pq = θ σ(xθ)q(θ)dθ. [sent-215, score-0.348]
95 Since the sigmoid is concave for θ > 0 and, for this case, q only has support from positive θ, we have that pq ≤ σ x θ θq(θ)dθ . [sent-216, score-0.348]
96 Using the definition of q, we ˆ ˆ then have that pq ≤ σ(x(θt−1 + ν)) ≤ σ(θt−1 + ν), where the last inequality follows from ˆ θt−1 + ν > 0 and x ≤ 1. [sent-217, score-0.321]
97 Using properties of σ, one can show |(log σ) (z)| < − log σ(z) ˆ ˆ (for all z). [sent-218, score-0.208]
98 Hence, for all θ ≥ θt−1 , |(log σ) (θ)| < − log σ(θ) ≤ − log σ(θt−1 ). [sent-219, score-0.416]
99 Using this derivative condition along with the previous bound on pq , we have that − log pq ≥ ˆ ˆ − log σ(θt−1 + ν) ≥ (− log σ(θt−1 ))(1 − ν) = θt−1 (t)(1 − ν), which shows that ˆ ˆ (t) ≤ −4 log pq (since ν ≤ 0. [sent-220, score-1.921]
100 Relative loss bounds for on-line density estimation with the exponential family of distributions. [sent-230, score-0.21]
wordName wordTfidf (topN-words)
[('fy', 0.506), ('bma', 0.345), ('pq', 0.321), ('log', 0.208), ('lbma', 0.207), ('regression', 0.186), ('logloss', 0.184), ('pt', 0.183), ('vovk', 0.16), ('lmap', 0.138), ('loss', 0.123), ('azoury', 0.12), ('bayesian', 0.116), ('foster', 0.11), ('st', 0.109), ('bound', 0.105), ('logistic', 0.104), ('warmuth', 0.098), ('online', 0.095), ('lq', 0.091), ('map', 0.091), ('bounds', 0.087), ('ridge', 0.076), ('competitive', 0.074), ('recti', 0.068), ('freund', 0.065), ('kl', 0.063), ('curvature', 0.061), ('renormalized', 0.06), ('adversarial', 0.058), ('gaussian', 0.056), ('proof', 0.054), ('expert', 0.053), ('prediction', 0.049), ('inf', 0.049), ('yt', 0.048), ('schapire', 0.048), ('dawid', 0.046), ('grunwald', 0.046), ('prequential', 0.046), ('averaging', 0.046), ('proves', 0.042), ('square', 0.042), ('theorem', 0.04), ('worst', 0.04), ('predictive', 0.04), ('nelder', 0.04), ('unreasonably', 0.037), ('tc', 0.037), ('linear', 0.036), ('predictions', 0.036), ('imply', 0.036), ('algorithms', 0.036), ('mccullagh', 0.034), ('helmbold', 0.034), ('sketch', 0.032), ('differs', 0.032), ('subsequence', 0.032), ('bayes', 0.031), ('lemma', 0.031), ('devised', 0.031), ('ez', 0.031), ('though', 0.03), ('importantly', 0.03), ('term', 0.03), ('imposes', 0.029), ('let', 0.029), ('generalized', 0.029), ('literature', 0.029), ('exp', 0.029), ('posterior', 0.029), ('sides', 0.028), ('crossing', 0.028), ('coef', 0.027), ('sigmoid', 0.027), ('sequence', 0.027), ('assumptions', 0.027), ('ng', 0.026), ('var', 0.026), ('boolean', 0.026), ('incurred', 0.026), ('rn', 0.026), ('setting', 0.024), ('posteriori', 0.024), ('regularized', 0.023), ('bt', 0.023), ('upper', 0.022), ('constraint', 0.021), ('widely', 0.021), ('prior', 0.021), ('mass', 0.021), ('derivative', 0.021), ('equation', 0.021), ('multiplicative', 0.021), ('letting', 0.021), ('expectation', 0.02), ('hoped', 0.02), ('cleverly', 0.02), ('retrospect', 0.02), ('contradicting', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 138 nips-2004-Online Bounds for Bayesian Algorithms
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
2 0.1999923 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
Author: Philip M. Long, Xinyu Wu
Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1
3 0.11879866 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
4 0.11087743 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
5 0.1000155 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
6 0.096393272 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
7 0.092194758 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
8 0.091356851 164 nips-2004-Semi-supervised Learning by Entropy Minimization
9 0.087781735 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
10 0.083945714 64 nips-2004-Experts in a Markov Decision Process
11 0.075335436 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
12 0.074060872 136 nips-2004-On Semi-Supervised Classification
13 0.073887646 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
14 0.073407136 69 nips-2004-Fast Rates to Bayes for Kernel Machines
15 0.073376685 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
16 0.065838829 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
17 0.065649942 42 nips-2004-Computing regularization paths for learning multiple kernels
18 0.061935801 116 nips-2004-Message Errors in Belief Propagation
19 0.061125424 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
20 0.0598079 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
topicId topicWeight
[(0, -0.18), (1, 0.015), (2, 0.158), (3, 0.094), (4, 0.027), (5, -0.086), (6, -0.011), (7, 0.071), (8, 0.09), (9, 0.096), (10, 0.08), (11, 0.041), (12, 0.229), (13, -0.021), (14, -0.041), (15, -0.125), (16, 0.013), (17, -0.126), (18, -0.016), (19, 0.007), (20, 0.141), (21, 0.007), (22, -0.06), (23, -0.119), (24, -0.036), (25, -0.031), (26, -0.061), (27, 0.029), (28, -0.064), (29, 0.006), (30, -0.032), (31, 0.123), (32, -0.022), (33, -0.079), (34, 0.021), (35, -0.08), (36, -0.062), (37, -0.008), (38, 0.1), (39, -0.011), (40, -0.134), (41, 0.097), (42, 0.043), (43, 0.03), (44, 0.03), (45, 0.0), (46, 0.013), (47, -0.007), (48, 0.112), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.95110089 138 nips-2004-Online Bounds for Bayesian Algorithms
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
2 0.71903259 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
Author: Philip M. Long, Xinyu Wu
Abstract: We establish a mistake bound for an ensemble method for classification based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more refined bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simplification of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simplified algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1
3 0.58417875 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng
Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1
4 0.53817046 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
5 0.41232947 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
6 0.40324134 70 nips-2004-Following Curved Regularized Optimization Solution Paths
7 0.39758265 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
8 0.39069396 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
9 0.38904613 64 nips-2004-Experts in a Markov Decision Process
10 0.37089428 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
11 0.36754352 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
12 0.36221978 136 nips-2004-On Semi-Supervised Classification
13 0.34302941 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
14 0.33943793 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
15 0.33826753 137 nips-2004-On the Adaptive Properties of Decision Trees
16 0.33167154 116 nips-2004-Message Errors in Belief Propagation
17 0.31588241 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
18 0.30591735 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
19 0.30526614 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
20 0.30423841 8 nips-2004-A Machine Learning Approach to Conjoint Analysis
topicId topicWeight
[(13, 0.6), (15, 0.076), (26, 0.041), (31, 0.019), (33, 0.118), (39, 0.022), (50, 0.022)]
simIndex simValue paperId paperTitle
1 0.97645783 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval
Author: Giorgio Gia\-cin\-to, Fabio Roli
Abstract: High retrieval precision in content-based image retrieval can be attained by adopting relevance feedback mechanisms. These mechanisms require that the user judges the quality of the results of the query by marking all the retrieved images as being either relevant or not. Then, the search engine exploits this information to adapt the search to better meet user’s needs. At present, the vast majority of proposed relevance feedback mechanisms are formulated in terms of search model that has to be optimized. Such an optimization involves the modification of some search parameters so that the nearest neighbor of the query vector contains the largest number of relevant images. In this paper, a different approach to relevance feedback is proposed. After the user provides the first feedback, following retrievals are not based on knn search, but on the computation of a relevance score for each image of the database. This score is computed as a function of two distances, namely the distance from the nearest non-relevant image and the distance from the nearest relevant one. Images are then ranked according to this score and the top k images are displayed. Reported results on three image data sets show that the proposed mechanism outperforms other state-of-the-art relevance feedback mechanisms. 1 In t rod u ct i on A large number of content-based image retrieval (CBIR) systems rely on the vector representation of images in a multidimensional feature space representing low-level image characteristics, e.g., color, texture, shape, etc. [1]. Content-based queries are often expressed by visual examples in order to retrieve from the database the images that are “similar” to the examples. This kind of retrieval is often referred to as K nearest-neighbor retrieval. It is easy to see that the effectiveness of content-based image retrieval systems (CBIR) strongly depends on the choice of the set of visual features, on the choice of the “metric” used to model the user’s perception of image similarity, and on the choice of the image used to query the database [1]. Typically, if we allow different users to mark the images retrieved with a given query as relevant or non-relevant, different subsets of images will be marked as relevant. Accordingly, the need for mechanisms to adapt the CBIR system response based on some feedback from the user is widely recognized. It is interesting to note that while relevance feedback mechanisms have been first introduced in the information retrieval field [2], they are receiving more attention in the CBIR field (Huang). The vast majority of relevance feedback techniques proposed in the literature is based on modifying the values of the search parameters as to better represent the concept the user bears in mind. To this end, search parameters are computed as a function of the relevance values assigned by the user to all the images retrieved so far. As an example, relevance feedback is often formulated in terms of the modification of the query vector, and/or in terms of adaptive similarity metrics. [3]-[7]. Recently, pattern classification paradigms such as SVMs have been proposed [8]. Feedback is thus used to model the concept of relevant images and adjust the search consequently. Concept modeling may be difficult on account of the distribution of relevant images in the selected feature space. “Narrow domain” image databases allows extracting good features, so that images bearing similar concepts belong to compact clusters. On the other hand, “broad domain” databases, such as image collection used by graphic professionals, or those made up of images from the Internet, are more difficult to subdivide in cluster because of the high variability of concepts [1]. In these cases, it is worth extracting only low level, non-specialized features, and image retrieval is better formulated in terms of a search problem rather then concept modeling. The present paper aims at offering an original contribution in this direction. Rather then modeling the concept of “relevance” the user bears in mind, feedback is used to assign each image of the database a relevance score. Such a score depends only from two dissimilarities (distances) computed against the images already marked by the user: the dissimilarity from the set of relevant images, and the dissimilarity from the set of non-relevant images. Despite its computational simplicity, this mechanism allows outperforming state-of-the-art relevance feedback mechanisms both on “narrow domain” databases, and on “broad domain” databases. This paper is organized as follows. Section 2 illustrates the idea behind the proposed mechanism and provides the basic assumptions. Section 3 details the proposed relevance feedback mechanism. Results on three image data sets are presented in Section 4, where performances of other relevance feedback mechanisms are compared. Conclusions are drawn in Section 5. 2 In st an ce- b ased rel evan ce est i m at i on The proposed mechanism has been inspired by classification techniques based on the “nearest case” [9]-[10]. Nearest-case theory provided the mechanism to compute the dissimilarity of each image from the sets of relevant and non–relevant images. The ratio between the nearest relevant image and the nearest non-relevant image has been used to compute the degree of relevance of each image of the database [11]. The present section illustrates the rationale behind the use of the nearest-case paradigm. Let us assume that each image of the database has been represented by a number of low-level features, and that a (dis)similarity measure has been defined so that the proximity between pairs of images represents some kind of “conceptual” similarity. In other words, the chosen feature space and similarity metric is meaningful at least for a restricted number of users. A search in image databases is usually performed by retrieving the k most similar images with respect to a given query. The dimension of k is usually small, to avoid displaying a large number of images at a time. Typical values for k are between 10 and 20. However, as the “relevant” images that the user wishes to retrieve may not fit perfectly with the similarity metric designed for the search engine, the user may be interested in exploring other regions of the feature space. To this end, the user marks the subset of “relevant” images out of the k retrieved. Usually, such relevance feedback is used to perform a new k-nn search by modifying some search parameters, i.e., the position of the query point, the similarity metric, and other tuning parameters [1]-[7]. Recent works proposed the use of support vector machine to learn the distribution of relevant images [8]. These techniques require some assumption about the general form of the distribution of relevant images in the feature space. As it is difficult to make any assumption about such a distribution for broad domain databases, we propose to exploit the information about the relevance of the images retrieved so far in a nearest-neighbor fashion. Nearest-neighbor techniques, as used in statistical pattern recognition, case-based reasoning, or instance-based learning, are effective in all applications where it is difficult to produce a high-level generalization of a “class” of objects [9]-[10],[12][13]. Relevance learning in content base image retrieval may well fit into this definition, as it is difficult to provide a general model that can be adapted to represent different concepts of similarity. In addition, the number of available cases may be too small to estimate the optimal set of parameters for such a general model. On the other hand, it can be more effective to use each “relevant” image as well as each “non-relevant” image, as “cases” or “instances” against which the images of the database should be compared. Consequently, we assume that an image is as much as relevant as much as its dissimilarity from the nearest relevant image is small. Analogously, an image is as much as non-relevant as much as its dissimilarity from the nearest non-relevant image is small. 3 Rel evan ce S core Com p u t ati on According to previous section, each image of the database can be thus characterized by a “degree of relevance” and a “degree of non-relevance” according to the dissimilarities from the nearest relevant image, and from the nearest non-relevant image, respectively. However, it should be noted that these degrees should be treated differently because only “relevant” images represent a “concept” in the user’s mind, while “non-relevant” images may represent a number of other concepts different from user’s interest. In other words, while it is meaningful to treat the degree of relevance as a degree of membership to the class of relevant images, the same does not apply to the degree of non-relevance. For this reason, we propose to use the “degree of non-relevance” to weight the “degree of relevance”. Let us denote with R the subset of indexes j ∈ {1,...,k} related to the set of relevant images retrieved so far and the original query (that is relevant by default), and with NR the subset of indexes j ∈ (1,...,k} related to the set of non-relevant images retrieved so far. For each image I of the database, according to the nearest neighbor rule, let us compute the dissimilarity from the nearest image in R and the dissimilarity from the nearest image in NR. Let us denote these dissimilarities as dR(I) and dNR(I), respectively. The value of dR(I) can be clearly used to measure the degree of relevance of image I, assuming that small values of dR(I) are related to very relevant images. On the other hand, the hypothesis that image I is relevant to the user’s query can be supported by a high value of dNR(I). Accordingly, we defined the relevance score ! dR ( I ) $ relevance ( I ) = # 1 + dN ( I ) &
2 0.97248381 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
Author: Yuanqing Lin, Daniel D. Lee
Abstract: Bayesian Regularization and Nonnegative Deconvolution (BRAND) is proposed for estimating time delays of acoustic signals in reverberant environments. Sparsity of the nonnegative filter coefficients is enforced using an L1 -norm regularization. A probabilistic generative model is used to simultaneously estimate the regularization parameters and filter coefficients from the signal data. Iterative update rules are derived under a Bayesian framework using the Expectation-Maximization procedure. The resulting time delay estimation algorithm is demonstrated on noisy acoustic data.
3 0.96418953 176 nips-2004-Sub-Microwatt Analog VLSI Support Vector Machine for Pattern Classification and Sequence Estimation
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: An analog system-on-chip for kernel-based pattern classification and sequence estimation is presented. State transition probabilities conditioned on input data are generated by an integrated support vector machine. Dot product based kernels and support vector coefficients are implemented in analog programmable floating gate translinear circuits, and probabilities are propagated and normalized using sub-threshold current-mode circuits. A 14-input, 24-state, and 720-support vector forward decoding kernel machine is integrated on a 3mm×3mm chip in 0.5µm CMOS technology. Experiments with the processor trained for speaker verification and phoneme sequence estimation demonstrate real-time recognition accuracy at par with floating-point software, at sub-microwatt power. 1
same-paper 4 0.93924987 138 nips-2004-Online Bounds for Bayesian Algorithms
Author: Sham M. Kakade, Andrew Y. Ng
Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1
5 0.93113422 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
Author: Elizaveta Levina, Peter J. Bickel
Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1
6 0.88986498 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification
7 0.84172434 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
8 0.71534693 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
9 0.6552974 135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks
10 0.65068954 28 nips-2004-Bayesian inference in spiking neurons
11 0.64037973 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
12 0.63806146 163 nips-2004-Semi-parametric Exponential Family PCA
13 0.6364761 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
14 0.63351387 131 nips-2004-Non-Local Manifold Tangent Learning
15 0.63179225 102 nips-2004-Learning first-order Markov models for control
16 0.63078219 116 nips-2004-Message Errors in Belief Propagation
17 0.63073987 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
18 0.62820053 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making
19 0.62533885 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
20 0.62100953 70 nips-2004-Following Curved Regularized Optimization Solution Paths