nips nips2011 nips2011-109 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. [sent-4, score-0.115]
2 It is known that if the models are mis-specified, model averaging is superior to model selection. [sent-5, score-0.252]
3 Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). [sent-6, score-0.298]
4 In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. [sent-7, score-0.56]
5 The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. [sent-9, score-0.337]
6 We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. [sent-10, score-0.075]
7 1 Introduction This paper considers the model combination problem, where the goal is to combine multiple models in order to achieve improved accuracy. [sent-11, score-0.077]
8 That is, let k∗ be the best single model defined as: k∗ = argmin f k − g k 1 2 2 , (1) then f k∗ = g. [sent-44, score-0.064]
9 ˆ We are interested in an estimator f of g that achieves a small regret ˆ R(f ) = 2 1 ˆ f −g n − 2 1 f k∗ − g n 2 2 . [sent-45, score-0.247]
10 This paper considers a special class of model combination methods which we refer to as model averaging, with combined estimators of the form M ˆ f= wk f k , ˆ k=1 where wk ≥ 0 and k wk = 1. [sent-46, score-0.552]
11 A standard method for “model averaging” is model selection, where ˆ ˆ ˆ with the smallest least squares error: we choose the model k ˆ f MS = f k; ˆ ˆ k = arg min f k − y k 2 2 . [sent-47, score-0.06]
12 ˆ ˆ This corresponds to the choice of wk = 1 and wk = 0 when k = k. [sent-48, score-0.318]
13 However, it is well known that ˆˆ ˆ the worst case regret this procedure can achieve is R(f M S ) = O( ln M/n) [1]. [sent-49, score-0.487]
14 Another standard model averaging method is the Exponential Weighted Model Averaging (EWMA) estimator defined as 2 M qk e−λ f k −y 2 ˆ f EW M A = wk f k , wk = ˆ ˆ (2) 2 , M −λ f j −y 2 k=1 j=1 qj e with a tuned parameter λ ≥ 0. [sent-50, score-1.019]
15 In this setting, the most common prior choice is the flat prior qj = 1/M . [sent-56, score-0.287]
16 It should be pointed out that a progressive variant of (2), which returns the average of n + 1 EWMA estimators with Si = {(X1 , Y1 ), . [sent-57, score-0.14]
17 Nevertheless, the non progressive version presented in (2) is clearly a more natural estimator, and this is the form that has been studied in more recent work [3, 6, 8]. [sent-64, score-0.082]
18 It is known that exponential model averaging leads to an average regret of O(ln M/n) which achieves the optimal rate; however it was pointed out in [1] that the rate does not hold with large probability. [sent-67, score-0.416]
19 Specifically, EWMA only leads to a sub-optimal deviation bound of O( ln M/n) with large probability. [sent-68, score-0.39]
20 To remedy this sub-optimality, an empirical star algorithm (which we will refer to as STAR from now on) was then proposed in [1]; it was shown that the algorithm gives O(ln M/n) deviation bound with large probability under the flat prior qi = 1/M . [sent-69, score-0.272]
21 One major issue of the STAR algorithm is that its average performance is often inferior to EWMA, as we can see from our empirical examples. [sent-70, score-0.063]
22 Therefore although theoretically interesting, it is not an algorithm that can be regarded as a replacement of EWMA for practical purposes. [sent-71, score-0.07]
23 Partly for this reason, a more recent study [7] re-examined the problem of improving EWMA, where different estimators were proposed in order to achieve optimal deviation for model averaging. [sent-72, score-0.118]
24 The purpose of this paper is to present a simple greedy model averaging (GMA) algorithm that gives the optimal O(ln M/n) deviation bound with large probability, and it can be applied with arbitrary prior qi . [sent-74, score-0.395]
25 Moreover, unlike STAR which has average performance inferior to EWMA, the average performance of GMA algorithm is generally superior to EWMA as we shall illustrate with examples. [sent-75, score-0.132]
26 2 Greedy Model Averaging This paper studies a new model averaging procedure presented in Algorithm 1. [sent-77, score-0.233]
27 The procedure has L stages, and each time adds an additional model f k( ) into the ensemble. [sent-78, score-0.058]
28 It is based on a simple, but ˆ 2 important modification of a classical sequential greedy approximation procedure in the literature [4], which corresponds to setting µ( ) = 0, λ = 0 in Algorithm 1 with α( ) optimized over [0, 1]. [sent-79, score-0.148]
29 The ˆ (2) STAR algorithm corresponds to the stage-2 estimator f with the above mentioned classical greedy procedure of [4]. [sent-80, score-0.263]
30 However, in order to prove the desired deviation bound, our analysis critically 2 ˆ ( −1) − f which isn’t present in the classical procedure (that depends on the extra term µ( ) f j 2 is, our proof does not apply to the procedure of [4]). [sent-81, score-0.221]
31 As we will see in Section 4, this extra term does have a positive impact under suitable conditions that correspond to Theorem 1 and Theorem 2 below, and thus this term is not only for theoretical interest, but also it leads to practical benefits under the right conditions. [sent-82, score-0.083]
32 Another difference between GMA and the greedy algorithm in [4] is that our procedure allows the use of non-flat priors through the extra penalty term λc( ) ln(1/qj ). [sent-83, score-0.143]
33 Moreover, it is useful to notice that if we choose the flat prior qj = 1/M , then the term λc( ) ln(1/qj ) is identical for all models, and thus this term can be removed from the optimization. [sent-85, score-0.245]
34 In this case, the proposed method has the advantage of being parameter free (with the default choice of ν = 0. [sent-86, score-0.072]
35 , f M ˆ( ) output : averaged model f parameters: prior {qj }j=1,. [sent-92, score-0.069]
36 ˆ √ As we have pointed out earlier, it is well known that only O(1/ n) regret can be achieved by ˆˆ ˆ any model selection procedure (that is, any procedure that returns a single model f k for some k). [sent-101, score-0.253]
37 For clarity, we rewrite this stage 2 estimator as ˆ k (2) = argmin j ˆ (2) f = 1 (f ˆ(1) + f j ) − y 2 k 2 + 2 1 ˆ f k(1) − f j ˆ 20 2 + 2 λ ln(1/qj ) , 4 1 (f ˆ(1) + f k(2) ). [sent-104, score-0.191]
38 ˆ 2 k Theorem 1 shows that this simple stage 2 estimator achieves O(1/n) regret. [sent-105, score-0.173]
39 A similar result was shown in [1] for the STAR algorithm under the flat prior qj = 1/M , which corresponds to the stage 2 estimator of the classical greedy algorithm in [4]. [sent-106, score-0.5]
40 This result also improves the recent work of [7] in that the resulting bound is scale free while the algorithm itself is significantly simpler. [sent-109, score-0.051]
41 One disadvantage of this stage-2 estimator (and similarly the STAR estimator of [1]) is that its average performance is generally inferior to that of EWMA, mainly due to the relatively large constant in Theorem 1 (the same issue holds for the STAR algorithm). [sent-110, score-0.325]
42 For this reason, the stage-2 estimator is not a practical replacement of EWMA. [sent-111, score-0.154]
43 Our empirical experiments show that in order to compete with EWMA for average performance, it is important to take L > 2. [sent-113, score-0.068]
44 However a relatively small L (as small as L = 5) is often sufficient, and in such case the resulting estimator is still quite sparse. [sent-114, score-0.147]
45 If λ ≥ 40σ 2 , then with probability 1 − 2δ we have Theorem 1 Given qj ≥ 0 such that j=1 ˆ R(f (2) )≤ λ 3 1 ln(1/qk∗ ) + ln(1/δ) . [sent-116, score-0.221]
46 n 4 2 ˆ (2) achieves the optimal rate, running GMA for more more stages While the stage-2 estimator f can further improve the performance. [sent-117, score-0.226]
47 The following theorem shows that similar bounds can be 2 obtained for GMA at stages larger than 2. [sent-118, score-0.153]
48 However, the constant before σ ln qk1 δ approaches 8 n ∗ when → ∞ (with default ν = 0. [sent-119, score-0.332]
49 In fact, with relatively large , the GMA method not only has the theoretical advantage of achieving smaller regret in deviation (that is, the regret bound holds with large probability) but also achieves better average performance in practice. [sent-122, score-0.373]
50 If λ ≥ 40σ 2 and let 0 < ν < 1 in Algorithm 1, j=1 then with probability 1 − 2δ we have ˆ R(f ( ) )≤ 1 λ ( − 2) + ln( − 1) + 30ν(1 − ν) ln . [sent-124, score-0.301]
51 n 20ν(1 − ν) qk∗ δ Another important advantage of running GMA for > 2 stages is that the resulting estimator not only competes with the best single estimator f k∗ , but also competes with the best estimator in the convex hull of cov(F) (with the parameter ν appropriately tuned). [sent-125, score-0.567]
52 Define the convex hull of F as M cov(F) = wj f j : wj ≥ 0; wj = 1 . [sent-127, score-0.204]
53 j=1 j √ ˆ( ) The following theorem shows that as → ∞, the prediction error of f is no more than O(1/√n) ¯ worse than that of the optimal f ∈ cov(F) when we choose a sufficiently small ν = O(1/ n) in Algorithm 1. [sent-128, score-0.071]
54 Note that in this case, it is beneficial to use a parameter ν smaller than the default choice of ν = 0. [sent-129, score-0.049]
55 , M } such that j=1 ¯ wj = 1 and wj ≥ 0, and let f = j wj f j . [sent-136, score-0.168]
56 If λ ≥ 40σ 2 and let 0 < ν < 1 in Algorithm 1, then with probability 1 − 2δ, when → ∞: j 1 ˆ( ) f −g n 2 ≤ 2 1 ¯ f −g n 2 ν + 2 n ¯ wk f k − f k 4 λ 2 + 2 20ν(1 − ν)n wk ln k 1 +O δqk 1 . [sent-137, score-0.601]
57 ˆ The performance of an estimator f measured here is the mean squared error (MSE) defined as ˆ MSE(f ) = 1 ˆ f −g n 2 . [sent-155, score-0.115]
58 2 We run the Greedy Model Averaging (GMA) algorithm for L stages up to L = 40. [sent-156, score-0.099]
59 Moreover, we also listed the performance of EWMA with projection, which is the method that runs EWMA, but with each model f k replaced by model ˜ ˜ f k = αk f k where αk = arg minα∈R αf k − y 2 . [sent-158, score-0.042]
60 Note that this is a special case of the class of methods studied in [6] (which considers more general projections) that leads to non progressive regret bounds, and this is the method of significant current interests [3, 8]. [sent-160, score-0.234]
61 The model f k∗ is clearly not a valid estimator because it depends on the unobserved g; however its performance is informative, and thus included in the tables. [sent-163, score-0.136]
62 For simplicity, all algorithms use flat prior qk = 1/M . [sent-164, score-0.187]
63 Five hundred replications are run, and the MSE performance of different algorithms are reported in Table 1 using the “mean ± standard deviation” format. [sent-166, score-0.06]
64 This is thus the situation that model averaging does not achieve as good a performance as that of the best single model. [sent-171, score-0.241]
65 The results indicate that for GMA, from L = 1 (corresponding to model selection) to L = 2 (stage-2 model averaging of Theorem 1), there is significant reduction of error. [sent-173, score-0.217]
66 This isn’t surprising, because STAR can be regarded as the stage-2 estimator based on the more classical greedy algorithm of [4]. [sent-175, score-0.257]
67 It means that in order to achieve good performance, it is necessary to use more stages than L = 2 (although this doesn’t change the O(1/n) rate for regret, it can significantly reduce constant). [sent-177, score-0.107]
68 It becomes better than EWMA when L is as small as 5, which still gives a relatively sparse averaged model. [sent-178, score-0.056]
69 This again is consistent with Theorem 2, which shows that the new term we added into the greedy algorithm is indeed useful in this scenario. [sent-184, score-0.078]
70 Five hundred replications are run, and the MSE performance of different algorithms are reported in Table 2 using the “mean ± standard deviation” format. [sent-187, score-0.06]
71 5 Table 1: MSE of different algorithms: best single model is superior to averaged models STAR EWMA EWMA (with projection) BSM 0. [sent-188, score-0.1]
72 5 is ¯ relatively small, which makes it beneficial √ compete with the best model f in the convex hull even to ¯ though GMA has a larger regret of O(1/ n) when competing with f . [sent-232, score-0.281]
73 This is thus the situation considered in Theorem 3, which means that model averaging can achieve better performance than that of the best single model. [sent-233, score-0.241]
74 The results again show that for GMA, from L = 1 (corresponding to model selection) to L = 2 (stage-2 model averaging of Theorem 1), there is significant reduction of error. [sent-234, score-0.217]
75 It becomes better than EWMA when L is as small as 5, which still gives a relatively sparse averaged model. [sent-238, score-0.056]
76 5 in Theorem 2 is inferior to choosing smaller parameter values of ν = 0. [sent-241, score-0.047]
77 This is consistent with Theorem 3, where it is beneficial to use a smaller value for ν in order to compete with the best model in the convex hull. [sent-244, score-0.093]
78 Table 2: MSE of different algorithms: best single model is inferior to averaged model STAR EWMA EWMA (with projection) BSM 0. [sent-245, score-0.133]
79 045 Conclusion This paper presents a new model averaging scheme which we call greedy model averaging (GMA). [sent-286, score-0.47]
80 It is shown that the new method can achieve regret bound of O(ln M/n) with large probability when competing with the single best model. [sent-287, score-0.193]
81 Moreover, it can also compete with the best combined model in convex hull. [sent-288, score-0.093]
82 Due to the simplicity of our proposal, GMA may be regarded as a valid alternative to the more widely studied EWMA procedure both for practical applications and for theoretical purposes. [sent-290, score-0.105]
83 Finally we shall point out that while this work only considers static model averaging where the models F are finite, similar results can be obtained for affine estimators or infinite models considered in recent work [3, 6, 8]. [sent-291, score-0.291]
84 A Proof Sketches We only include proof sketches, and leave the details to the supplemental material that accompanies the submission. [sent-293, score-0.078]
85 The proofs can be found in the supplemental material. [sent-295, score-0.051]
86 Define event E1 as ∀j : (f j − f k∗ ) ξ ≤ σ f j − f k∗ E1 = 2 2 ln(1/(δqj )) and define event E2 as ∀j, k : (f j − f k ) ξ ≤ σ f j − f k E2 = 2 2 ln(1/(δqj qk )) , then P (E1 ) ≥ 1 − δ and P (E2 ) ≥ 1 − δ. [sent-301, score-0.281]
87 1 Proof Sketch of Theorem 1 More detailed proof can be found in the supplemental material. [sent-303, score-0.078]
88 Note that with probability 1 − 2δ, both event E1 and event E2 of Proposition 1 hold. [sent-304, score-0.118]
89 ˆ ˆ In the above derivation, the inequality is equivalent to Q(2) (k (2) ) ≤ Q(2) (k∗ ), which is a simple fact ˆ( ) in the algorithm. [sent-306, score-0.047]
90 2 The first inequality above uses the tail probability bounds in the event E1 and E2 . [sent-310, score-0.138]
91 We then use the algebraic inequality 2a1 b1 ≤ a2 /r1 + r1 b2 and 2a2 b2 ≤ a2 /r2 + r2 b2 to obtain the last inequality, 1 1 2 2 which implies the desired bound. [sent-311, score-0.081]
92 2 Proof Sketch of Theorem 2 Again, more detailed proof can be found in the supplemental material. [sent-313, score-0.078]
93 With probability 1 − 2δ, both event E1 and event E2 of Proposition 1 hold. [sent-314, score-0.118]
94 qk( ) δ ˆ − q b 2 ≤ 2 −pq/(p + q) a + b , 2 − f k∗ − µ( ) 2 ˆ f k( ) − f ˆ ( −1) 2 and uses the Gaussian tail bound in the event E1 . [sent-319, score-0.119]
95 The last inequality uses 2ab ≤ a /r + r b2 , where r > 0 is r = λc( ) /2. [sent-320, score-0.047]
96 Solving this recursion for R ( ) leads to the desired Proof Sketch of Theorem 3 Again, more detailed proof can be found in the supplemental material. [sent-325, score-0.134]
97 We have ˆ( ) f −g 2 ≤ 2 ˆ( wk α ( ) f −1) + (1 − α( ) )f k − g k +λc( ) ( 2 + µ( 2 ˆ( wk f ) −1) 2 − fk k wk ln(1/qk ) − ln(1/qk( ) )) + 2ξ ˆ ˆ( − f 2 (1 − α( ) )f k( ) − (1 − α( ) ) ˆ k −1) 2 − f k( ) ˆ wk f k . [sent-327, score-0.633]
98 k ˆ The inequality is equivalent to Q( ) (k ( ) ) ≤ ( ) k wk Q (k), 2 ( ) which is a simple fact of the definition 2 ˆ ˆ ¯ of k ( ) in the algorithm. [sent-328, score-0.197]
99 Denote by R( ) = f − g − f − g 2 , then the same derivation as 2 that of Theorem 2 implies that − 1 ( −1) ¯ 2 R( ) ≤ R + λc( ) wk ln(1/(δqk )) + [µ( ) + (1 − α( ) )2 ] wk f k − f 2 . [sent-329, score-0.318]
100 A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. [sent-343, score-0.112]
wordName wordTfidf (topN-words)
[('ewma', 0.633), ('gma', 0.443), ('ln', 0.301), ('qj', 0.221), ('star', 0.177), ('averaging', 0.175), ('qk', 0.163), ('wk', 0.15), ('estimator', 0.115), ('regret', 0.103), ('stages', 0.082), ('greedy', 0.078), ('mse', 0.075), ('theorem', 0.071), ('bsm', 0.063), ('progressive', 0.061), ('event', 0.059), ('wj', 0.056), ('isn', 0.056), ('compete', 0.052), ('supplemental', 0.051), ('inferior', 0.047), ('inequality', 0.047), ('deviation', 0.043), ('philippe', 0.042), ('rigollet', 0.042), ('cov', 0.042), ('procedure', 0.037), ('hull', 0.036), ('superior', 0.035), ('pointed', 0.034), ('jersey', 0.034), ('projection', 0.034), ('fk', 0.033), ('classical', 0.033), ('relatively', 0.032), ('replications', 0.032), ('rutgers', 0.032), ('competes', 0.032), ('sketches', 0.032), ('tail', 0.032), ('default', 0.031), ('regarded', 0.031), ('considers', 0.031), ('five', 0.03), ('achieves', 0.029), ('pace', 0.029), ('estimators', 0.029), ('stage', 0.029), ('bound', 0.028), ('extra', 0.028), ('hundred', 0.028), ('proof', 0.027), ('purpose', 0.026), ('sketch', 0.026), ('moreover', 0.026), ('fm', 0.025), ('achieve', 0.025), ('decays', 0.025), ('alexandre', 0.025), ('averaged', 0.024), ('tuned', 0.024), ('rewrite', 0.024), ('aggregation', 0.024), ('prior', 0.024), ('scenario', 0.024), ('argmin', 0.023), ('free', 0.023), ('recursion', 0.022), ('clarity', 0.022), ('worst', 0.021), ('model', 0.021), ('replacement', 0.021), ('iid', 0.021), ('non', 0.021), ('rmed', 0.02), ('keeps', 0.02), ('exponential', 0.02), ('proposition', 0.02), ('best', 0.02), ('fj', 0.019), ('theoretical', 0.019), ('validation', 0.019), ('gilbert', 0.019), ('anatoli', 0.019), ('implies', 0.018), ('shall', 0.018), ('leads', 0.018), ('practical', 0.018), ('choice', 0.018), ('squares', 0.018), ('static', 0.017), ('noise', 0.017), ('bene', 0.017), ('tzhang', 0.017), ('run', 0.017), ('competing', 0.017), ('average', 0.016), ('slower', 0.016), ('desired', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
2 0.14191304 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
3 0.11720444 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
4 0.108407 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
5 0.10001283 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
6 0.098603331 217 nips-2011-Practical Variational Inference for Neural Networks
7 0.09731669 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
8 0.087388247 260 nips-2011-Sparse Features for PCA-Like Linear Regression
9 0.084909633 56 nips-2011-Committing Bandits
10 0.078471698 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
11 0.077988565 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
12 0.076681584 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
13 0.074503809 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
14 0.070112161 291 nips-2011-Transfer from Multiple MDPs
15 0.066303812 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
16 0.06585592 220 nips-2011-Prediction strategies without loss
17 0.063931949 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
18 0.063524023 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
19 0.062058933 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
20 0.055527356 198 nips-2011-On U-processes and clustering performance
topicId topicWeight
[(0, 0.138), (1, -0.128), (2, -0.001), (3, -0.027), (4, 0.028), (5, 0.028), (6, 0.017), (7, 0.059), (8, -0.03), (9, 0.038), (10, -0.107), (11, -0.03), (12, -0.063), (13, -0.095), (14, -0.006), (15, -0.057), (16, -0.079), (17, -0.094), (18, 0.026), (19, 0.106), (20, -0.055), (21, 0.032), (22, 0.096), (23, 0.117), (24, -0.124), (25, 0.046), (26, 0.141), (27, 0.066), (28, -0.031), (29, -0.081), (30, -0.185), (31, 0.049), (32, 0.097), (33, 0.087), (34, 0.11), (35, -0.036), (36, -0.01), (37, -0.065), (38, 0.076), (39, 0.038), (40, 0.075), (41, 0.049), (42, -0.006), (43, 0.046), (44, 0.021), (45, -0.063), (46, -0.004), (47, 0.092), (48, 0.012), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.93899733 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
2 0.65384197 25 nips-2011-Adaptive Hedge
Author: Tim V. Erven, Wouter M. Koolen, Steven D. Rooij, Peter Grünwald
Abstract: Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods. 1
3 0.64751619 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
Author: Joseph Keshet, David A. McAllester
Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
4 0.61761498 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
5 0.61238378 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
Author: Samory Kpotufe
Abstract: Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that k-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query x and depend only on the way masses of balls centered at x vary with radius. Furthermore, we show a simple way to choose k = k(x) locally at any x so as to nearly achieve the minimax rate at x in terms of the unknown intrinsic dimension in the vicinity of x. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure. 1
6 0.56766462 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
7 0.53195447 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
8 0.50019932 56 nips-2011-Committing Bandits
9 0.47939265 217 nips-2011-Practical Variational Inference for Neural Networks
10 0.47022861 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
11 0.46709639 241 nips-2011-Scalable Training of Mixture Models via Coresets
12 0.42217439 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
13 0.34815297 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
14 0.34263653 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
15 0.33889806 291 nips-2011-Transfer from Multiple MDPs
16 0.33339152 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis
17 0.31778842 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
18 0.31287968 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
19 0.31287572 68 nips-2011-Demixed Principal Component Analysis
20 0.31235102 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
topicId topicWeight
[(0, 0.024), (4, 0.041), (16, 0.228), (20, 0.076), (26, 0.027), (31, 0.052), (33, 0.022), (43, 0.057), (45, 0.124), (56, 0.011), (57, 0.041), (65, 0.035), (74, 0.062), (83, 0.02), (99, 0.072)]
simIndex simValue paperId paperTitle
1 0.78486228 55 nips-2011-Collective Graphical Models
Author: Daniel R. Sheldon, Thomas G. Dietterich
Abstract: There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models—a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highlyefficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally. 1
same-paper 2 0.76036775 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
3 0.64331925 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
4 0.62947023 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.62265074 244 nips-2011-Selecting Receptive Fields in Deep Networks
Author: Adam Coates, Andrew Y. Ng
Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive fields” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1
6 0.61497092 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
7 0.61387247 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
8 0.61360621 154 nips-2011-Learning person-object interactions for action recognition in still images
9 0.61360562 29 nips-2011-Algorithms and hardness results for parallel large margin learning
10 0.61203486 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
11 0.61168516 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
12 0.61149114 80 nips-2011-Efficient Online Learning via Randomized Rounding
13 0.61121458 303 nips-2011-Video Annotation and Tracking with Active Learning
14 0.61102819 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.61069995 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
16 0.61003363 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
17 0.60992843 287 nips-2011-The Manifold Tangent Classifier
18 0.60916865 265 nips-2011-Sparse recovery by thresholded non-negative least squares
19 0.60898626 25 nips-2011-Adaptive Hedge
20 0.60879612 251 nips-2011-Shaping Level Sets with Submodular Functions