nips nips2007 nips2007-44 knowledge-graph by maker-knowledge-mining

44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Source: pdf

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

Abstract: Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we deﬁne the switch-distribution, a modiﬁcation of the Bayesian model averaging distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efﬁcient algorithm. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 nl Abstract Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. [sent-9, score-0.256]

2 We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. [sent-13, score-0.321]

3 1 Introduction We consider inference based on a countable set of models (sets of probability distributions), focusing on two tasks: model selection and model averaging. [sent-15, score-0.252]

4 In model selection tasks, the goal is to select the model that best explains the given data. [sent-16, score-0.192]

5 In model averaging, the goal is to ﬁnd the weighted combination of models that leads to the best prediction of future data from the same source. [sent-17, score-0.117]

6 An attractive property of some criteria for model selection is that they are consistent under weak conditions, i. [sent-18, score-0.189]

7 BIC [14], Bayes factor model selection [8], Minimum Description Length (MDL) model selection [3] and prequential model validation [5] are examples of widely used model selection criteria that are usually consistent. [sent-21, score-0.618]

8 However, other model selection criteria such as AIC [1] and leave-one-out cross-validation (LOO) [16], while often inconsistent, do typically yield better predictions. [sent-22, score-0.189]

9 In this paper we reconcile these seemingly conﬂicting approaches [19] by improving the rate of convergence achieved in Bayesian model selection without losing its convergence properties. [sent-25, score-0.234]

10 and parameters therein, Bayesian inference associates each model Mk with the marginal distribution pk , given in (1), obtained by averaging over the parameters according to the prior. [sent-30, score-0.56]

11 In model selection the preferred model is the one with maximum a posteriori probability. [sent-31, score-0.192]

12 By Bayes’ rule this is arg maxk pk (xn )w(k), where w(k) denotes the prior probability of Mk . [sent-32, score-0.514]

13 The resulting distribution pbma (xn ) = k pk (xn )w(k) can be used for prediction. [sent-34, score-0.637]

14 In a se1 quential setting, the probability of a data sequence xn := x1 , . [sent-35, score-0.513]

15 , xn under a distribution p typically decreases exponentially fast in n. [sent-38, score-0.49]

16 It is therefore common to consider − log p(xn ), which we call the codelength of xn achieved by p. [sent-39, score-0.721]

17 We take all logarithms to base 2, allowing us to measure codelength in bits. [sent-40, score-0.163]

18 The name codelength refers to the correspondence between codelength functions and probability distributions based on the Kraft inequality, but one may also think of the codelength as the accumulated log loss that is incurred if we sequentially predict the xi by conditioning on the past, n i. [sent-41, score-0.652]

19 For BMA, we have − log pbma (xn ) = i=1 − log pbma (xi |xi−1 ). [sent-44, score-0.494]

20 Here the ith term represents the loss incurred when predicting xi given xi−1 using pbma (·|xi−1 ), which turns out to be equal to the posterior average: pbma (xi |xi−1 ) = k pk (xi |xi−1 )w(k|xi−1 ). [sent-45, score-0.873]

21 Prediction using pbma has the advantage that the codelength it achieves on xn is close to the codeˆ length of pk , where k is the index of best of the marginals p1 , p2 , . [sent-46, score-1.358]

22 Namely, given a prior w on ˆ model indices, the difference between − log pbma (xn ) = − log( k pk (xn )w(k)) and − log pk (xn ) ˆ ˆ must be in the range [0, − log w(k)], whatever data xn are observed. [sent-49, score-1.853]

23 Thus, using BMA for prediction is sensible if we are satisﬁed with doing essentially as well as the best model under consideration. [sent-50, score-0.117]

24 into a distribution that achieves ˆ smaller codelength than pk ! [sent-54, score-0.666]

25 It shows the accumulated codelength difference − log p2 (xn ) − (− log p1 (xn )) on “The Picture of Dorian Gray” by Oscar Wilde, where p1 and p2 are the Bayesian marginal distributions for the ﬁrst-order and second-order Markov chains, respectively, and each character in the book is an outcome. [sent-59, score-0.361]

26 In more complicated, more realistic model selection scenarios, the models may still be wrong, but it may not be known how to improve them. [sent-61, score-0.156]

27 However, pbma only starts to behave like p2 when it catches up with p1 at a sample size of about 310 000, when the codelength of p2 drops below that of p1 . [sent-66, score-0.366]

28 Thus, in the shaded area pbma behaves like p1 while p2 is making better predictions of those outcomes: since at n = 100 000, p2 is 40 000 bits behind, and at n = 310 000, it has caught up, in between it must have outperformed p1 by 40 000 bits! [sent-67, score-0.204]

29 We argue that failure to take this 20000 effect into account leads to the suboptimal 0 rate of convergence achieved by Bayes fac−20000 tor model selection and related methods. [sent-69, score-0.195]

30 We have developed an alternative method −40000 to combine distributions p1 and p2 into a −60000 single distribution psw , which we call the −80000 switch-distribution, deﬁned in Section 2. [sent-70, score-0.263]

31 psw differs from pbma in that it is based on a prior distribution on sequences of models rather than simply a prior distribution on models. [sent-72, score-0.479]

32 Indeed, the switch-distribution is related to earlier algorithms for tracking the best expert developed in the universal prediction literature [7, 18, 17, 10]; however, the applications we have in mind and the theorems we prove are completely different. [sent-75, score-0.11]

33 In Sections 3 and 4 we show that model selection based on the switch-distribution is consistent (Theorem 1), but unlike standard 2 Bayes factor model selection achieves optimal rates of convergence (Theorem 2). [sent-76, score-0.396]

34 , xn ) denote the ﬁrst n outcomes of X ∞ , such that xn takes values in the product space X n = ∞ X1 × · · · × Xn . [sent-93, score-1.027]

35 Any distribution P (X ∞ ) may be deﬁned by a sequential prediction strategy p that predicts the next outcome at any time n ∈ N. [sent-100, score-0.128]

36 To be precise: Given the previous outcomes xn at time n, this prediction strategy should issue a conditional density p(Xn+1 |xn ) with corresponding distribution P (Xn+1 |xn ) for the next outcome Xn+1 . [sent-101, score-0.698]

37 Such sequential prediction strategies are sometimes called prequential forecasting systems [5]. [sent-102, score-0.248]

38 It is natural to deﬁne the joint density p(xm |xn ) = p(xn+1 |xn ) · · · p(xm |xm−1 ) and let ∞ m P (Xn+1 |xn ) be the unique distribution such that, for all m > n, p(Xn+1 |xn ) is the density of its m ∞ n marginal distribution for Xn+1 . [sent-106, score-0.11]

39 Model Selection and Prediction The goal in model selection is to choose an explanation for observed data xn from a potentially inﬁnite list of candidate models M1 , M2 , . [sent-108, score-0.675]

40 We consider parametric models, which are sets {pθ : θ ∈ Θ} of prediction strategies pθ that are indexed by elements of Θ ⊆ Rd , for some smallest possible d ∈ N, the number of degrees of freedom. [sent-111, score-0.172]

41 A model selection criterion is a function δ : X ∗ → Z+ that, given any data sequence xn ∈ X ∗ , selects the model Mk with index k = δ(xn ). [sent-113, score-0.783]

42 We associate each model Mk with a single prediction strategy pk . [sent-114, score-0.6]

43 The bar emphasizes that pk is a ¯ ¯ meta-strategy based on the prediction strategies in Mk . [sent-115, score-0.6]

44 In many approaches to model selection, for ˆ example AIC and LOO, pk is deﬁned using some estimator θk for each model Mk , which maps a ¯ sequence xn of previous observations to an estimated parameter value that represents a “best guess” of the true/best distribution in the model. [sent-116, score-1.073]

45 Prediction is then based on this estimator: pk (Xn+1 | ¯ xn ) = pθk (xn ) (Xn+1 | xn ), which also deﬁnes a joint density pk (xn ) = pk (x1 ) · · · pk (xn |xn−1 ). [sent-117, score-2.867]

46 ¯ ¯ ¯ ˆ The Bayesian approach to model selection or model averaging goes the other way around. [sent-118, score-0.258]

47 We start out with a prior w on Θk , and deﬁne the Bayesian marginal density pk (xn ) = ¯ pθ (xn )w(θ) dθ. [sent-119, score-0.541]

48 (1) θ∈Θk When pk (xn ) is non-zero this joint density induces a unique conditional density pk (Xn+1 | xn ) = ¯ ¯ pk (Xn+1 , xn )/¯k (xn ), which is equal to the mixture of pθ ∈ Mk according to the posterior, ¯ p w(θ|xn ) = pθ (xn )w(θ)/ pθ (xn )w(θ) dθ, based on xn . [sent-120, score-2.954]

49 Thus the Bayesian approach also deﬁnes a prediction strategy pk (Xn+1 |xn ), whose corresponding distribution may be thought of as ¯ an estimator. [sent-121, score-0.564]

50 Then a prediction strategy p may be based on the Bernoulli ¯ model M = {pθ | θ ∈ [0, 1]} that regards X ∞ as a sequence of independent, identically distributed Bernoulli random variables with Pθ (Xn+1 = 1) = θ. [sent-129, score-0.165]

51 Perhaps surprisingly, the predictor p′ deﬁned by p′ (Xn+1 | xn ) = pθ′ (xn ) (Xn+1 ) equals the Bayesian predictive distribution based on ¯ ¯ ˆ a uniform prior. [sent-135, score-0.525]

52 ) We ﬁrst deﬁne a family Q = {qs : s ∈ S} of combinator prediction strategies that switch between the original prediction strategies. [sent-142, score-0.259]

53 (2) The parameter s ∈ S speciﬁes the identities of m constituent prediction strategies and the sample ′ ′ sizes, called switch-points, at which to switch between them. [sent-150, score-0.202]

54 , (t′ ′ , km′ )), we 1 m ′ ′ ′ deﬁne ti (s) = ti , ki (s) = ki and m(s) = m . [sent-154, score-0.264]

55 For each s ∈ S the corresponding qs ∈ Q is deﬁned as:   pk1 (Xn+1 |xn ) if n < t2 ,    p (X n   k2 n+1 |x ) if t2 ≤ n < t3 ,  . [sent-158, score-0.326]

56   p n  km−1 (Xn+1 |x ) if tm−1 ≤ n < tm ,    p (X n km n+1 |x ) if tm ≤ n. [sent-164, score-0.396]

57 Then the switchdistribution Psw with prior π is the distribution for X ∞ such that, for any n ∈ Z+ , the density of its marginal distribution for X n is given by psw (xn ) = qs (xn ) · π(s). [sent-169, score-0.702]

58 (4) s∈S Although the switch-distribution provides a general way to combine prediction strategies, in this paper it will only be applied to combine prediction strategies p1 , p2 , . [sent-170, score-0.261]

59 ¯ ¯ In this case we may deﬁne a corresponding model selection criterion δsw . [sent-174, score-0.189]

60 To this end, let Kn+1 : S → Z+ be a random variable that denotes the strategy/model that is used to predict Xn+1 given past observations xn . [sent-175, score-0.511]

61 Formally, Kn+1 (s) = ki (s) iff ti (s) ≤ n and i = m(s) ∨ n < ti+1 (s). [sent-176, score-0.132]

62 Algorithm 1, given in Section 5, efﬁciently computes the posterior distribution on Kn+1 given xn : π(Kn+1 = k | xn ) = {s:Kn+1 (s)=k} π psw (xn ) s qs (xn ) , (5) which is deﬁned whenever psw (xn ) is non-zero. [sent-177, score-1.818]

63 We turn this into a model selection criterion δsw (xn ) = arg maxk π(Kn+1 = k|xn ) that selects the model with maximum posterior probability. [sent-178, score-0.299]

64 ¯ ¯ Bayes factor model selection is consistent if for all k, k ′ = k, Pk (X ∞ ) and Pk′ (X ∞ ) are mutually ∞ ¯k (A) = 1 and Pk′ (A) = 0 [3]. [sent-181, score-0.215]

65 For consistency of δsw , we need to strengthen this to the requirement ∞ ∞ ¯ ¯ that, for all k ′ = k and all xn ∈ X ∗ , the distributions Pk (Xn+1 | xn ) and Pk′ (Xn+1 | xn ) are mutually singular. [sent-183, score-1.589]

66 according to each Pθ in all models, but also if X is countable and pk (xn+1 | xn ) > 0 for all k, all xn+1 ∈ X n+1 , then this conditional mutual ¯ ¯ ¯ singularity is automatically implied by ordinary mutual singularity of Pk (X ∞ ) and Pk′ (X ∞ ). [sent-190, score-1.194]

67 4 Let Es = {s′ ∈ S | m(s′ ) > m(s), (ti (s′ ), ki (s′ )) = (ti (s), ki (s)) for i = 1, . [sent-191, score-0.126]

68 be Bayesian prediction ¯ ¯ strategies with respective parameter spaces Θ1 , Θ2 , . [sent-198, score-0.142]

69 Suppose further that Pk (Xn+1 | xn ) and Pk′ (Xn+1 | xn ) are mutually singular ′ + ′ n ∗ ∗ + for all k, k ∈ Z , k = k , x ∈ X . [sent-207, score-1.086]

70 We deﬁne the risk at sample size n ≥ 1 of the ¯ estimator P relative to P ∗ as ¯ ¯ Rn (P ∗ , P ) = EX n−1 ∼P ∗ [D(P ∗ (Xn = · | X n−1 ) P (Xn = · | X n−1 ))], where D(· ·) is the Kullback-Leibler (KL) divergence [4]. [sent-215, score-0.122]

71 The following identity connects information-theoretic redundancy and accumulated statistical risk (see [4] or [6, Chapter 15]): If P ∗ admits a density p∗ , then for all prediction strategies p, ¯ n E X n ∼P ∗ ∗ n ¯ Ri (P ∗ , P ). [sent-218, score-0.368]

72 n [− log p(X ) + log p (X )] = ¯ (8) i=1 For a union of parametric models M = k≥1 Mk , we deﬁne the information closure M = {P ∗ | inf P ∈M D(P ∗ P ) = 0}, i. [sent-219, score-0.19]

73 The theorem requires that the prior π in (4) is of the form (7), and satisﬁes − log πM (m) = O(m) ; − log πK (k) = O(log k) ; − log πT (t) = O(log t). [sent-226, score-0.295]

74 In that case, Theorem 2 shows that the switch-distribution achieves the minimax convergence rate. [sent-231, score-0.145]

75 We restrict ourselves to model selection criteria which, at sample size n, never select a model Mk with k > nτ for some arbitrarily large but ﬁxed τ > 0; note that this condition will be met for most 5 practical model selection criteria. [sent-233, score-0.405]

76 Let h : Z+ → R+ denote the minimax optimal achievable risk as a function of the sample size, i. [sent-234, score-0.134]

77 ,⌈n ⌉} P ∗ ∈M∗ n′ ≥n where the inﬁmum is over all model selection criteria restricted to sample size n, and ⌈·⌉ denotes rounding up to the nearest integer. [sent-239, score-0.213]

78 pδ is the prediction strategy satisfying, for all n′ ≥ n, all ¯ ′ ′ ′ ′ xn ∈ X n , pδ (Xn′ +1 | xn ) := pδ(xn ) (Xn′ +1 | xn ), i. [sent-240, score-1.576]

79 at sample size n it predicts xn+1 using ¯ ¯ n pk for the k = δ(X ) chosen by δ, and it keeps predicting future xn′ +1 by this k. [sent-242, score-0.504]

80 We call h(n) ¯ the minimax optimal rate of convergence for model selection relative to data from M∗ , model list ¯ ¯ M1 , M2 , . [sent-243, score-0.34]

81 Theorem 2 expresses that the accumulated risk of the switch-distribution, as n increases, is not signiﬁcantly larger than the accumulated risk of any other procedure. [sent-263, score-0.222]

82 The proof works by bounding the redundancy of the switch-distribution, which, by (8), is identical to the accumulated risk. [sent-265, score-0.122]

83 , K do initialise wk ← θ · πK (k); wk ← (1 − θ) · πK (k) od a Report prior π(K1 ) = wK1 (a K-sized array) for n = 1, . [sent-278, score-0.373]

84 , K do wk ← wk · pk (xn |xn−1 ); wk ← wk · pk (xn |xn−1 ) od (loss update) a pool ← πT (Z = n | Z ≥ n) · k wk (share update) for k = 1, . [sent-284, score-1.716]

85 If πT (Z = n | Z ≥ n) and πK (k) can be computed in constant time, then the running time is Θ(N · K), which is of the same order as that of fast model selection criteria like AIC and BIC. [sent-289, score-0.189]

86 It is sufﬁcient to show that n s∈Un π s qs (X ) ¯ lim =0 with Pk∗ -probability 1. [sent-300, score-0.35]

87 (12) n n→∞ s∈S π s qs (X ) To see this, suppose the theorem is false. [sent-301, score-0.411]

88 Therefore π(s′ )qs′ (xn ) ≤ (π(s) + π(Es )) qs (xn ) ≤ (1 + c) s′ ∈Un s∈A Deﬁning the mixture r(xn ) = s∈A π(s)qs (xn ). [sent-312, score-0.326]

89 For ¯ tm (s) tm (s) ∞ tm ∞ ¯ all s ∈ A and x ∈X , by deﬁnition Qs (Xtm +1 |x ) equals Pkm (Xtm +1 |xtm ), which is ¯k∗ (X ∞ |xtm ) by assumption. [sent-315, score-0.507]

90 If X is a separable metric space, which mutually singular with P tm +1 holds because X ⊆ Rd for some d ∈ Z+ , it can be shown that this conditional mutual singularity ¯ implies mutual singularity of Qs (X ∞ ) and Pk∗ (X ∞ ). [sent-316, score-0.461]

91 To see this for countable X , let Bxtm be any t t ∞ ∞ ¯ tm |x m ) = 1 and Pk ∗ (Bxtm |x m ) = 0. [sent-317, score-0.229]

92 Any countable mixture e of distributions that are mutually singular with Pk∗ , in particular R, is mutually singular with Pk∗ . [sent-322, score-0.272]

93 1 of [2], which says that for any two mutually singular distributions R and P , the density ratio r(X n )/p(X n ) goes to 0 as n → ∞ with P -probability 1. [sent-324, score-0.161]

94 , ⌈nτ ⌉} be a model selection criterion, restricted to samples of size n, that is minimax optimal, i. [sent-334, score-0.217]

95 Fix an arbitrary n > 0 and let m be the unique integer such that tm < n ≤ tm+1 . [sent-339, score-0.169]

96 We will ﬁrst show that for arbitrary xn , psw achieves redundancy not much worse than qs with s = (t1 , k1 ), . [sent-340, score-1.165]

97 Then we show that the redundancy of this qs is small enough for (15) to hold. [sent-344, score-0.386]

98 Formally, we have, for some c > 0, uniformly for all n, xn ∈ X n , 7 m n − log psw (x ) = − log n ′ n q (x )π(s ) ≤ − log qs (x ) − log πM (m) − s′ s′ ∈S log πT (tj )πK (kj ) j=1 ≤ − log qs (xn ) + c log(n + 1) + cm(τ + 1) log n = − log qs (xn ) + O((log n)2 ). [sent-346, score-2.256]

99 Thus, applying (8), and then (16), and then (8) again, we ﬁnd that n Ri (P ∗ , Psw ) = EX n ∼P ∗ [− log psw (X n ) + log p∗ (X n )] i=1 ≤ EX n ∼P ∗ [− log qs (X n ) + log p∗ (X n )] + O((log n)2 ) m min{tj+1 ,n} n ¯ Ri (P ∗ , Pkj ) + O((log n)2 ). [sent-350, score-0.842]

100 Asymptotic mean efﬁciency of a selection of regression variables. [sent-449, score-0.12]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xn', 0.49), ('pk', 0.458), ('qs', 0.326), ('psw', 0.244), ('pbma', 0.179), ('tm', 0.169), ('codelength', 0.163), ('tj', 0.144), ('wk', 0.144), ('selection', 0.12), ('mk', 0.111), ('aic', 0.098), ('kn', 0.093), ('ri', 0.086), ('pkj', 0.081), ('prequential', 0.081), ('prediction', 0.081), ('ti', 0.069), ('log', 0.068), ('averaging', 0.066), ('bic', 0.065), ('singularity', 0.065), ('xtm', 0.065), ('nh', 0.065), ('theorem', 0.063), ('ki', 0.063), ('accumulated', 0.062), ('minimax', 0.061), ('strategies', 0.061), ('countable', 0.06), ('redundancy', 0.06), ('mutually', 0.059), ('km', 0.058), ('od', 0.057), ('density', 0.055), ('sw', 0.052), ('bayes', 0.052), ('bma', 0.049), ('loo', 0.049), ('supn', 0.049), ('switchdistribution', 0.049), ('risk', 0.049), ('outcomes', 0.047), ('singular', 0.047), ('bayesian', 0.047), ('achieves', 0.045), ('un', 0.04), ('convergence', 0.039), ('switch', 0.036), ('model', 0.036), ('priors', 0.036), ('predictor', 0.035), ('criteria', 0.033), ('xi', 0.033), ('criterion', 0.033), ('rn', 0.033), ('requirement', 0.033), ('bxtm', 0.033), ('ixed', 0.033), ('supi', 0.033), ('wkn', 0.033), ('estimators', 0.03), ('parametric', 0.03), ('mum', 0.03), ('predictors', 0.03), ('estimator', 0.03), ('universal', 0.029), ('list', 0.029), ('nonparametric', 0.029), ('harremo', 0.028), ('hare', 0.028), ('maxk', 0.028), ('rissanen', 0.028), ('mutual', 0.028), ('xm', 0.028), ('prior', 0.028), ('consistency', 0.027), ('ex', 0.027), ('mdl', 0.026), ('sup', 0.026), ('strategy', 0.025), ('bits', 0.025), ('sometimes', 0.025), ('posterior', 0.024), ('closure', 0.024), ('sample', 0.024), ('lim', 0.024), ('sequence', 0.023), ('index', 0.023), ('pool', 0.023), ('predicts', 0.022), ('suppose', 0.022), ('selects', 0.022), ('ml', 0.021), ('past', 0.021), ('satis', 0.02), ('omit', 0.02), ('combine', 0.019), ('relative', 0.019), ('nested', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

2 0.09664803 43 nips-2007-Catching Change-points with Lasso

Author: Céline Levy-leduc, Zaïd Harchaoui

Abstract: We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-squares criterion with a 1 -type penalty for this purpose. We prove some theoretical results on the estimated change-points and on the underlying piecewise constant estimated function. Then, we explain how to implement this method in practice by combining the LAR algorithm and a reduced version of the dynamic programming algorithm and we apply it to synthetic and real data. 1

3 0.096580639 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

Author: Anton Chechetka, Carlos Guestrin

Abstract: We present the ﬁrst truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efﬁcient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufﬁciently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very signiﬁcant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on ﬁxed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1

4 0.080885321 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis

Author: Christoforos Christoforou, Paul Sajda, Lucas C. Parra

Abstract: Traditional analysis methods for single-trial classiﬁcation of electroencephalography (EEG) focus on two types of paradigms: phase locked methods, in which the amplitude of the signal is used as the feature for classiﬁcation, e.g. event related potentials; and second order methods, in which the feature of interest is the power of the signal, e.g. event related (de)synchronization. The procedure for deciding which paradigm to use is ad hoc and is typically driven by knowledge of the underlying neurophysiology. Here we propose a principled method, based on a bilinear model, in which the algorithm simultaneously learns the best ﬁrst and second order spatial and temporal features for classiﬁcation of EEG. The method is demonstrated on simulated data as well as on EEG taken from a benchmark data used to test classiﬁcation algorithms for brain computer interfaces. 1 1.1

5 0.078559071 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our ﬁrst result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

6 0.078394562 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

7 0.070780307 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

8 0.067290761 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.065368861 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

10 0.063527919 118 nips-2007-Learning with Transformation Invariant Kernels

11 0.062979922 159 nips-2007-Progressive mixture rules are deviation suboptimal

12 0.058137845 115 nips-2007-Learning the 2-D Topology of Images

13 0.057159152 32 nips-2007-Bayesian Co-Training

14 0.054021597 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

15 0.050938573 213 nips-2007-Variational Inference for Diffusion Processes

16 0.048493039 106 nips-2007-Invariant Common Spatial Patterns: Alleviating Nonstationarities in Brain-Computer Interfacing

17 0.048309226 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

18 0.047822814 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI

19 0.046700101 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

20 0.046091247 187 nips-2007-Structured Learning with Approximate Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.162), (1, 0.004), (2, -0.041), (3, 0.044), (4, -0.016), (5, 0.018), (6, -0.045), (7, 0.003), (8, -0.063), (9, -0.009), (10, -0.006), (11, 0.019), (12, -0.013), (13, -0.044), (14, 0.112), (15, 0.002), (16, 0.069), (17, 0.015), (18, -0.062), (19, -0.09), (20, 0.114), (21, -0.054), (22, 0.033), (23, 0.015), (24, 0.079), (25, 0.033), (26, -0.038), (27, -0.085), (28, -0.018), (29, 0.061), (30, -0.048), (31, 0.114), (32, -0.009), (33, 0.065), (34, 0.012), (35, 0.125), (36, -0.006), (37, -0.125), (38, -0.09), (39, -0.084), (40, -0.177), (41, -0.069), (42, -0.087), (43, 0.179), (44, 0.014), (45, 0.022), (46, 0.125), (47, 0.055), (48, -0.066), (49, -0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96353179 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

2 0.57064724 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

3 0.56168079 159 nips-2007-Progressive mixture rules are deviation suboptimal

Author: Jean-yves Audibert

Abstract: We consider the learning task consisting in predicting as well as the best function in a ﬁnite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisﬁed by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisﬁes ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1

4 0.54765934 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

Author: Tony Jebara, Yingbo Song, Kapil Thadani

Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya afﬁnity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difﬁcult and laden with local optima. Experiments in density estimation on a variety of datasets conﬁrm the value of isd over iid estimation, id estimation and mixture modeling.

5 0.53245223 43 nips-2007-Catching Change-points with Lasso

Author: Céline Levy-leduc, Zaïd Harchaoui

6 0.46204802 196 nips-2007-The Infinite Gamma-Poisson Feature Model

7 0.45820275 101 nips-2007-How SVMs can estimate quantiles and the median

8 0.44107997 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

9 0.43098539 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

10 0.41850069 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis

11 0.38369668 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

12 0.38059786 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

13 0.37135676 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

14 0.37066942 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

15 0.36997625 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

16 0.36325085 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

17 0.35960555 193 nips-2007-The Distribution Family of Similarity Distances

18 0.35889953 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

19 0.35340184 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

20 0.34855103 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.036), (13, 0.035), (16, 0.021), (21, 0.057), (31, 0.017), (34, 0.405), (35, 0.044), (47, 0.073), (49, 0.022), (83, 0.128), (85, 0.017), (87, 0.011), (90, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86733508 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

Abstract: It is becoming increasingly evident that organisms acting in uncertain dynamical environments often employ exact or approximate Bayesian statistical calculations in order to continuously estimate the environmental state, integrate information from multiple sensory modalities, form predictions and choose actions. What is less clear is how these putative computations are implemented by cortical neural networks. An additional level of complexity is introduced because these networks observe the world through spike trains received from primary sensory afferents, rather than directly. A recent line of research has described mechanisms by which such computations can be implemented using a network of neurons whose activity directly represents a probability distribution across the possible “world states”. Much of this work, however, uses various approximations, which severely restrict the domain of applicability of these implementations. Here we make use of rigorous mathematical results from the theory of continuous time point process ﬁltering, and show how optimal real-time state estimation and prediction may be implemented in a general setting using linear neural networks. We demonstrate the applicability of the approach with several examples, and relate the required network properties to the statistical nature of the environment, thereby quantifying the compatibility of a given network with its environment. 1

2 0.84454799 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

Author: Kyomin Jung, Devavrat Shah

Abstract: We present a new local approximation algorithm for computing MAP and logpartition function for arbitrary exponential family distribution represented by a ﬁnite-valued pair-wise Markov random ﬁeld (MRF), say G. Our algorithm is based on decomposing G into appropriately chosen small components; computing estimates locally in each of these components and then producing a good global solution. We prove that the algorithm can provide approximate solution within arbitrary accuracy when G excludes some ﬁnite sized graph as its minor and G has bounded degree: all Planar graphs with bounded degree are examples of such graphs. The running time of the algorithm is Θ(n) (n is the number of nodes in G), with constant dependent on accuracy, degree of graph and size of the graph that is excluded as a minor (constant for Planar graphs). Our algorithm for minor-excluded graphs uses the decomposition scheme of Klein, Plotkin and Rao (1993). In general, our algorithm works with any decomposition scheme and provides quantiﬁable approximation guarantee that depends on the decomposition scheme.

3 0.83718193 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

Author: Ping Li, Qiang Wu, Christopher J. Burges

Abstract: We cast the ranking problem as (1) multiple classiﬁcation (“Mc”) (2) multiple ordinal classiﬁcation, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classiﬁcations result in perfect DCG scores and the DCG errors are bounded by classiﬁcation errors. We propose using the Expected Relevance to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve LambdaRank [5] and the regressions-based ranker [6], in terms of the (normalized) DCG scores. An efﬁcient implementation of the boosting tree algorithm is also presented.

same-paper 4 0.81811684 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

Author: Tim V. Erven, Steven D. Rooij, Peter Grünwald

5 0.55123603 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to ﬁnd the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efﬁciency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a ﬁxed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the ﬁxed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will ﬁnd the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } deﬁned as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, deﬁne the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is deﬁned as ﬁnding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is deﬁned as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, ﬁx all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider ﬁxing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-ﬁxed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we deﬁned λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are ﬁxed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisﬁed: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x deﬁned as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being ﬁxed outside the star implies that λ−i (xj ) is ﬁxed. Deﬁne: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the deﬁnition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a ﬁxed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difﬁculty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP ﬁxed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the ﬁxed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it sufﬁces to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the ﬁxed point of MPLP. Then the dual objective satisﬁes λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP ﬁxed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, ﬁxed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP ﬁxed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) deﬁned via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To deﬁne the LP we introduce the following deﬁnitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are ﬁxed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be:   1 1 λ−c (xs ) + max  λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s deﬁned as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the ﬁxed points of a class of max-product like algorithms. Their analysis focused on properties of ﬁxed points rather than convergence guarantees. Speciﬁcally, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its ﬁxed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP ﬁxed points, and in fact we can show that MPLP satisﬁes the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their ﬁxed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its ﬁxed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is speciﬁcally tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not ﬁnd the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is deﬁned as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We deﬁne the hit-time as the ﬁrst iteration at which the maximum value of f (x) is achieved.4 We ﬁrst experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of ﬁnding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All ﬁgures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in signiﬁcant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we ﬁrst express the constraint set ML (G) in a slightly different way. The deﬁnition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artiﬁcial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8

6 0.54097611 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

7 0.53347361 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

8 0.5159567 41 nips-2007-COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

9 0.50812876 128 nips-2007-Message Passing for Max-weight Independent Set

10 0.50593305 146 nips-2007-On higher-order perceptron algorithms

11 0.505243 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

12 0.50391906 141 nips-2007-New Outer Bounds on the Marginal Polytope

13 0.49570999 39 nips-2007-Boosting the Area under the ROC Curve

14 0.49163908 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers

15 0.49005133 171 nips-2007-Scan Strategies for Meteorological Radars

16 0.48557281 187 nips-2007-Structured Learning with Approximate Inference

17 0.4847953 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

18 0.47830784 142 nips-2007-Non-parametric Modeling of Partially Ranked Data

19 0.47268075 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

20 0.4722715 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs