jmlr jmlr2005 jmlr2005-52 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky
Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization
Reference: text
sentIndex sentText sentNum sentScore
1 At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. [sent-10, score-0.437]
2 This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. [sent-13, score-0.44]
3 Recently, some additional justifications for loopy belief propagation have been developed, including a handful of convergence results for graphs with cycles (Weiss, 2000; Tatikonda and Jordan, 2002; Heskes, 2004). [sent-20, score-0.485]
4 The approximate nature of loopy belief propagation is often a more than acceptable price for performing efficient inference; in fact, it is sometimes desirable to make additional approximations. [sent-21, score-0.448]
5 For belief propagation involving continuous, non-Gaussian potentials, some form of approximation is required to obtain a finite parameterization for the messages (Sudderth et al. [sent-24, score-0.378]
6 In distributed message passing, one may approximate the transmitted message to reduce its representational cost (Ihler et al. [sent-32, score-0.385]
7 In order to characterize the approximation effects in graphs with cycles, we analyze the deviation from the solution given by “exact” loopy belief propagation (not, as is typically considered, the deviation of loopy BP from the true marginal distributions). [sent-37, score-0.698]
8 This allows us to derive conditions for the convergence of traditional loopy belief propagation, and bounds on the distance between any pair of BP fixed points (Sections 5. [sent-42, score-0.383]
9 (c) For a graph with cycles, one may show an equivalence between n iterations of loopy BP and the depthn computation tree [shown here for n = 3 and rooted at node 1; example from Tatikonda and Jordan (2002)]. [sent-62, score-0.374]
10 ut (2) u∈Γt For tree-structured graphical models, belief propagation can be used to efficiently perform exact marginalization. [sent-71, score-0.386]
11 For loopy BP, the sequence of messages defined by (1) is not guaranteed to converge to a fixed point after any number of iterations. [sent-74, score-0.395]
12 , the initial messages m0 in the loopy graph are ut taken as inputs to the leaf nodes. [sent-90, score-0.574]
13 Thus, the incoming messages to the root node (level 0) correspond to the messages in the nth iteration of loopy BP. [sent-92, score-0.7]
14 We begin by assuming that the “true” i+1 i messages mts (xs ) are some fixed point of BP, so that mts = mts . [sent-95, score-1.44]
15 We may ask what happens when these messages are perturbed by some (perhaps small) error function ets (xs ). [sent-96, score-0.441]
16 Although there are certainly other possibilities, the fact that BP messages are combined by taking their product makes it natural to consider multiplicative message deviations (or additive in the log-domain): i mts (xs ) = mts (xs )ets (xs ). [sent-97, score-1.199]
17 In the first, we focus on the message products ˆi Mts (xt ) ∝ ψt (xt ) ∏ ˆ ˆ ut Mti (xt ) ∝ ψt (xt ) ∏ mi (xt ) mi (xt ) ˆ ut u∈Γt u∈Γt \s (3) 1. [sent-99, score-0.466]
18 The second operation, then, is the message convolution mts (xs ) ∝ ˆ i+1 Z ˆi ψts (xs , xt )Mts (xt )dxt (4) ˆ where again M is a normalized message or product of messages. [sent-102, score-1.066]
19 ) refer to their products—at node t, the product ˆ of all incoming messages and the local potential is denoted Mt (xt ), its approximation Mt (xt ) = ˆ ts , and Ets . [sent-109, score-0.436]
20 909 I HLER , F ISHER AND W ILLSKY log m/m ˆ m(x) 0 }log d (e) m(x) ˆ (a) } αmin (b) Figure 2: (a) A message m(x) and an example approximation m(x); (b) their log-ratio ˆ log m(x)/m(x), and the error measure log d (e). [sent-131, score-0.503]
21 We then apply these properties to analyze the behavior of loopy belief propagation (Section 5). [sent-137, score-0.435]
22 In this section, we examine the following measure on ets (xs ): let d (ets ) denote the function’s dynamic range,2 specifically d (ets ) = sup ets (a)/ets (b). [sent-141, score-0.606]
23 , the pointwise equality condition mts (x) = mts (x)∀x) if and only if ˆ ˆ log d (ets ) = 0. [sent-144, score-0.993]
24 The dynamic range measure (6) may be equivalently defined by log d (ets ) = inf sup | log αmts (x) − log mts (x)| = inf sup | log α − log ets (x)|. [sent-150, score-1.258]
25 The minimum is given by log α = 2 (supa log ets (a) + infb log ets (b)), and thus the right-hand 1 side is equal to 1 (supa log ets (a) − infb log ets (b)), or 2 (supa,b log ets (a)/ets (b)), which by definition 2 is log d (ets ). [sent-152, score-1.995]
26 910 L OOPY BP: C ONVERGENCE AND E FFECT OF M ESSAGE E RRORS The scalar α serves the purpose of “zero-centering” the function log ets (x) and making the measure invariant to simple rescaling. [sent-156, score-0.377]
27 The dynamic range measure can be used to bound the log-approximation error: |log mts (x) − log mts (x)| ≤ 2 log d (ets ) ˆ ∀x. [sent-162, score-1.119]
28 We first consider the magnitude of log α: ∀x, ⇒ ⇒ αmts (x) ≤ log d (ets ) mts (x) ˆ 1 αmts (x) ≤ ≤ d (ets ) d (ets ) mts (x) ˆ Z Z Z 1 ≤ α mts (x)dx ≤ mts (x)dxd (ets ) ˆ mts (x)dx ˆ d (ets ) log and since the messages are normalized, | log α| ≤ log d (ets ). [sent-164, score-2.769]
29 Then by the triangle inequality, |log mts (x) − log mts (x)| ≤ |log αmts (x) − log mts (x)| + |log α| ≤ 2 log d (ets ) . [sent-165, score-1.597]
30 4) may be equivalently regarded as a statement about the required quantization level for an accurate implementation of loopy belief propagation. [sent-167, score-0.399]
31 Interestingly, it may also be related to a floating-point precision on mts (x). [sent-168, score-0.427]
32 Let mts (x) be an F-bit mantissa floating-point approximation to mts (x). [sent-170, score-0.854]
33 For an F-bit mantissa, we have |mts (x) − mts (x)| < 2−F · 2 log2 mts (x) ≤ 2−F · mts (x). [sent-173, score-1.281]
34 The KL-divergence satisfies the inequality D(mts mts ) ≤ 2 log d (ets ) ˆ Proof. [sent-178, score-0.522]
35 By Theorem 2, we have D(mts mts ) = ˆ Z mts (x) log mts (x) dx ≤ mts (x) ˆ Z mts (x) (2 log d (ets )) dx = 2 log d (ets ) . [sent-179, score-2.42]
36 The log of the dynamic range measure is sub-additive: i log d Ets ≤ ∑ log d ei ut log d Eti ≤ ∑ log d u∈Γt u∈Γt \s 912 ei . [sent-196, score-0.709]
37 ut ut 2 a,b Increasing the number of degrees of freedom gives ≤ 1 log ∏ sup ei (au )/ei (bu ) = ∑ log d ei (x) . [sent-204, score-0.522]
38 ut ut ut 2 au ,bu Theorem 6 allows us to bound the error resulting from a combination of the incoming approximations from two different neighbors of the node t. [sent-205, score-0.621]
39 The log of the dynamic range measure satisfies the triangle inequality: log d (e1 e2 ) ≤ log d (e1 ) + log d (e2 ) . [sent-208, score-0.469]
40 (8) I HLER , F ISHER AND W ILLSKY log d (e) → log d (E) log d (ψ)2 2 log d(ψ) 2d(E)+1 d(ψ) +d(E) log d (E) → Figure 4: Three bounds on the error output d (e) as a function of the error on the product of incoming messages d (E). [sent-217, score-0.747]
41 The outgoing message error d (ets ) is bounded by i+1 d ets ≤ d (ψts )2 . [sent-224, score-0.468]
42 The limits of this bound are quite intuitive: for log d (ψ) = 0 (independence of xt and xs ), this derivai+1 tive is zero; increasing the error in incoming messages mi has no effect on the error in mts . [sent-229, score-1.241]
43 We begin by examining loopy belief propagation with exact messages, using the previous results to derive a new sufficient condition for BP convergence to a unique fixed point. [sent-232, score-0.46]
44 This allows us to consider the effect of introducing additional errors into the messages passed at each iteration, showing sufficient conditions for this operation to converge, and a bound on the resulting error from exact loopy BP. [sent-234, score-0.479]
45 We then use the same methodology to analyze the behavior of loopy BP for quantized or otherwise approximated messages and potential functions. [sent-237, score-0.487]
46 Let the “true” messages mts be any fixed point of BP, and consider the incoming error observed by a node t at level n − 1 of the computation tree (corresponding to the 1 first iteration of BP), and having parent node s. [sent-250, score-0.846]
47 Note that this is trivially true (for any is bounded above by some constant log ε n) for the constant log ε1 = maxt ∑u∈Γt log d (ψut )2 , since the error on any message mut is bounded above by d (ψut )2 . [sent-252, score-0.805]
48 Theorem 8 bounds the maximum logi+1 error log d Ets at any replica of node t with parent s, where s is on level n − i of the tree (which corresponds to the ith iteration of loopy BP) by i+1 log d Ets ≤ gts (log εi ) = Gts (εi ) = ∑ u∈Γt \s log d (ψut )2 εi + 1 d (ψut )2 + εi . [sent-254, score-0.749]
49 (11) We observe a contraction of the error between iterations i and i + 1 if the bound gts (log εi ) is smaller than log εi for every (t, s) ∈ E , and asymptotically achieve log εi → 0 if this is the case for any value of εi > 1. [sent-255, score-0.395]
50 However, both Theorem 10 and Theorem 11 depend only on the pairwise potentials ψst (xs , xt ), and not on the single-node potentials ψs (xs ), ψt (xt ). [sent-267, score-0.54]
51 Moreover, applying this bound to another, different fixed point {Mt } tells us that all fixed points of loopy BP must lie within a sphere of a given diameter [as measured by ˜ log d Mt /Mt ]. [sent-283, score-0.366]
52 Then, after n > 1 ˆ iterations of loopy BP resulting in beliefs {Mtn }, for any node t and for all x ˆ log d Mt /Mtn ≤ ∑ log u∈Γt d (ψut )2 εn−1 + 1 d (ψut )2 + εn−1 where εi is given by ε1 = maxs,t d (ψst )2 and log εi+1 = max ∑ (s,t)∈E u∈Γ \s t log d (ψut )2 εi + 1 d (ψut )2 + εi . [sent-286, score-0.736]
53 Then, for any node t and for all x ˜ ˜ | log Mt (x)/Mt (x)| ≤ 2 log d Mt /Mt ≤ 2 ∑ log u∈Γt d (ψut )2 ε + 1 d (ψut )2 + ε (14) where ε is the largest value satisfying log ε = max Gts (ε) = max (s,t)∈E ∑ (s,t)∈E u∈Γ \s t log d (ψut )2 ε + 1 d (ψut )2 + ε . [sent-291, score-0.535]
54 The rest follows from Theorem 12—taking the “approximate” messages to be any other fixed point of loopy BP, we see that the error cannot decrease over any number of iterations. [sent-294, score-0.411]
55 Letting the number ˜ of iterations i → ∞, we see that the message “errors” log d Mts /Mts must be at most ε, and thus the difference in Mt (the belief of the root node of the computation tree) must satisfy (14). [sent-297, score-0.469]
56 Then, by additivity we obtain the stated bound on errors log d ets ts d (Etn ) at the root node. [sent-311, score-0.57]
57 As before, if this procedure results in log εts → 0 for all (t, s) ∈ E we are guaranteed that there is a unique fixed point for loopy BP; if not, we again obtain a bound on the distance between any two fixed-point beliefs. [sent-313, score-0.363]
58 When the graph is perfectly symmetric (every node has identical neighbors and potential strengths), this yields the same bound as Theorem 12; however, if the potential strengths are inhomogeneous Theorem 14 provides a strictly better bound on loopy BP convergence and errors. [sent-314, score-0.477]
59 ) One grid (a) has equal-strength potentials log d (ψ)2 = ω, while the other has many weaker potentials (ω/2). [sent-324, score-0.369]
60 88 Figure 6: Comparison of various uniqueness bounds: for binary potentials parameterized by η, we find the predicted ηcrit at which loopy BP can no longer be guaranteed to be unique. [sent-356, score-0.389]
61 This may be the result of an actual distortion imposed on the message (perhaps to decrease its complexity, for example quantization), or the result of censoring the message update (reusing the message from the previous iteration) when the two are sufficiently similar. [sent-368, score-0.572]
62 Suppose that {Mt } are a fixed point of loopy BP on a graph defined by potentials ψts ˆ and ψt , and let {Mtn } be the beliefs of n iterations of loopy BP performed on a graph with potentials ˆ ˆ ˆ ˆ ψts and ψt , where d (ψts /ψts ) ≤ δ1 and d (ψt /ψt ) ≤ δ2 . [sent-380, score-0.884]
63 Then, ˆ log d Mt /Mtn ≤ ∑ log υn + log δ2 ut u∈Γt where υi is defined by the iteration ut i+1 log υts = log i d (ψts )2 εts + 1 i d (ψts )2 + εts + log δ1 i log εts = log δ2 + ∑ log υi ut u∈Γt \s with initial condition υ1 = δ1 d (ψut )2 . [sent-381, score-1.323]
64 Then, E log d Eti 2 ≤ ∑ σi ut 2 (18) u∈Γt 1 where σts = log d (ψts )2 and i+1 2 σts = log i d (ψts )2 λts + 1 2 d (ψts ) i + λts 2 + (log δ)2 i log λts 2 = ∑ σi ut 2 . [sent-408, score-0.66]
65 Let us define the (nuisance) scale factor αts = arg minα supx | log αets (x)| for each error ets , i (x) = log αi ei (x). [sent-410, score-0.49]
66 The assumption of uncorrelated errors is clearly questionable, since propagation around loops may couple the incoming message errors. [sent-414, score-0.421]
67 potentials 1 1 10 max log d (Et ) max log d (Et ) 10 0 10 −1 10 −2 10 δ→ −3 10 −3 10 −2 10 −1 −1 10 −2 10 δ→ −3 10 0 10 0 10 10 −3 10 2 −2 10 −1 0 10 10 2 (a) log d (ψ) = . [sent-430, score-0.422]
68 More specifically, the tails of a message approximation can become important if two parts of the graph strongly disagree, in which case the tails of each message are the only overlap of significant likelihood. [sent-446, score-0.411]
69 One way to discount this possibility is to consider the graph potentials themselves (in particular, the single node potentials ψt ) as a realization of random variables which “typically” agree, then apply a probabilistic measure to estimate the typical performance. [sent-447, score-0.389]
70 As discussed in the previous section, we treat the {yt } as random variables; thus almost all quantities in this graph are themselves random variables (as they are dependent on the yt ), so that the single node observation potentials ψy (xs ), messages mst (xt ), etc. [sent-460, score-0.408]
71 For models of the form (21)-(22), the (unique) BP message fixed point consists of normalized versions of the likelihood functions mts (xs ) ∝ p(yts |xs ), where yts denotes the set of all observations {yu } such that t separates u from s. [sent-463, score-0.754]
72 Then, for a tree-structured graph, the prior-weight normalized fixed-point message from t to s is precisely mts (xs ) = p(yts |xs )/p(yts ) (23) and the products of incoming messages to t, as defined in Section 2. [sent-465, score-0.835]
73 We may now apply a posterior-weighted log-error measure, defined by D (mut mut ) = ˆ Z p(xt |y) log mut (xt ) dxt ; mut (xt ) ˆ (24) and may relate (24) to the Kullback-Leibler divergence. [sent-467, score-1.154]
74 Again, the error D (mut mut ) is a function of the observations y, both explicitly through the term ˆ p(xt |y) and implicitly through the message mut (xt ), and is thus also a random variable. [sent-472, score-0.838]
75 Specifically, we can see that in expectation over the data y, it is simply E [D (mut mut )] = E ˆ Z p(xt )mut (xt ) log mut (xt ) dxt . [sent-474, score-0.836]
76 First, if xs is discrete-valued and the prior distribution p(xs ) constant (uniform), the expected message distortion with prior-normalized messages, E[D (m m)], and the KL-divergence of traditionally normalized messages behave equivalently, i. [sent-479, score-0.563]
77 , ˆ mts mts ˆ R E [D (mts mts )] = E D R ˆ mts mts ˆ 925 I HLER , F ISHER AND W ILLSKY where we have abused the notation of KL-divergence slightly to apply it to the normalized likelihood R mts / mts . [sent-481, score-2.989]
78 For a true BP fixed-point message mut and two approximations mut , mut , we assume ˆ ˜ D (mut mut ) ≤ D (mut mut ) + D (mut mut ). [sent-510, score-2.116]
79 For true BP fixed-point messages {mut } and approximations {mut }, we assume ˆ ˆ ˆ (27) D (Mts Mts ) ≤ ∑ D (mut mut ). [sent-514, score-0.499]
80 For a true BP fixed-point message product Mts and approximaˆ tion Mts , we assume ˆ D (mts mts ) ≤ (1 − γts )D (Mts Mts ) ˆ (28) where γts = min a,b Z min [ρ(xs , xt = a) , ρ(xs , xt = b)] dxs ρ(xs , xt ) = R ψts (xs , xt )ψx (xs ) s . [sent-516, score-1.605]
81 For tree-structured graphical models with the parametrization described by (21)-(22), ρ(xs , xt ) = p(xs |xt ), and γts corresponds to the rate of contraction described by Boyen and Koller (1998). [sent-518, score-0.376]
82 This again applies Theorem 29 ignoring any effects due to loops, as well as the assumption that the message errors are uncorrelated (implying additivity of the variances of each incoming message). [sent-533, score-0.371]
83 Figure 8(a) shows the maximum KL-divergence from the correct fixed point resulting in each Monte Carlo trial for a grid with relatively weak potentials (in which loopy BP is analytically guaranteed to converge). [sent-540, score-0.373]
84 Conclusions and Future Directions We have described a framework for the analysis of belief propagation stemming from the view that the message at each iteration is some noisy or erroneous version of some true BP fixed point. [sent-546, score-0.408]
85 Further analysis of the propagation of message errors has the potential to give an improved understanding of when and why BP converges (or fails to converge), and potentially also the role of the message schedule in determining the performance. [sent-563, score-0.53]
86 Recall that, for tree-structured models described by (21)-(22), the prior-weight normalized messages of the (unique) fixed point are equivalent to mut (xt ) = p(yut |xt )/p(yut ), and that the message products are given by Mts (xt ) = p(xt |yts )Mt (xt ) = p(xt |y). [sent-607, score-0.663]
87 Furthermore, let us define the approximate messages mut (x) in Rterms of some approximate likeˆ lihood function, i. [sent-608, score-0.503]
88 For a tree-structured graphical model parameterized as in (21)-(22), and given the true BP message mut (xt ) and two approximations mut (xt ), mut (xt ), suppose that mut (xt ) ≤ ˆ ˜ cut mut (xt ) ∀xt . [sent-616, score-1.869]
89 Then, ˆ D (mut mut ) ≤ D (mut mut ) + cut D (mut mut ) ˜ ˆ ˆ ˜ and furthermore, if mut (xt ) ≤ c∗ mut (xt ) ∀xt , then mut (xt ) ≤ cut c∗ mut (xt ) ∀xt . [sent-617, score-2.274]
90 Since m, m are prior-weight normalized ( p(x)m(x) = p(x)m(x) = 1), for a strictly ˆ ˆ positive prior p(x) we see that cut ≥ 1, with equality if and only if mut (x) = mut (x) ∀x. [sent-619, score-0.66]
91 The expected error E[D (Mt Mt )] between the true and approximate beliefs is nearly sub-additive; specifically, ˆ E D (Mt Mt ) ≤ ∑ E [D (mut ˆ mut )] + I − I ˆ (31) u∈Γt where I = E log p(y)/ ∏ p(yut ) and u∈Γt ˆ I = E log p(y)/ ∏ p(yut ) . [sent-625, score-0.583]
92 ˆ ˆ u∈Γt 932 L OOPY BP: C ONVERGENCE AND E FFECT OF M ESSAGE E RRORS Moreover, if mut (xt ) ≤ cut mut (xt ) for all xt and for each u ∈ Γt , then ˆ Mt (xt ) ≤ ˆ ∏ cut Ct∗ Mt (xt ) Ct∗ = u∈Γt p(y) ˆ ∏u∈Γt p(yut ) p(y) ˆ ∏u∈Γt p(yut ) (32) Proof. [sent-626, score-0.928]
93 By definition we have ˆ E[D (Mt Mt )] = E Z p(xt , y) log Mt (xt ) dxt = E ˆ Mt (xt ) Z p(xt |y) log p(xt ) p(y|xt ) p(y) ˆ dxt . [sent-627, score-0.4]
94 On a tree-structured graphical model parameterized as in (21)-(22), the error meaˆ sure D (M, M) satisfies the inequality ˆ E [D (mts mts )] ≤ (1 − γts )E D (Mts Mts ) ˆ where γts = min a,b Z min [p(xs |xt = a) , p(xs |xt = b)] dxs . [sent-654, score-0.506]
95 Since the conditional f1 induces independence between xs and xt , the first divergence term is zero, and since f2 is a valid conditional distribution, the second divergence term is less than D(p(xt |yts ) p(xt |yts )) (see Cover ˆ and Thomas, 1991). [sent-661, score-0.484]
96 (22)], ˜ ˜ p(x) ∝ ∏ ψst (xs , xt )∏s ψx (xs ) (33) s (s,t)∈E ˜ ˜ then given a BP fixed point {Mst , Ms } of (33), we may choose a new parameterization of the same x given by prior ψst , ψs ψst (xs , xt ) = ˜ ˜ ˜ Mst (xs )Mts (xt )ψst (xs , xt ) ˜ ˜ Ms (xs )Mt (xt ) and ˜ ψx (xs ) = Ms (xs ). [sent-674, score-0.752]
97 However, the messages of loopy BP are no longer precisely equal to the likelihood functions m(x) = p(y|x)/p(y), and thus the expectation applied in Theorem 28 is no longer consistent with the messages themselves. [sent-677, score-0.554]
98 However, the assumption of independence is precisely the same assumption applied by loopy belief propagation itself to perform tractable approximate inference. [sent-680, score-0.448]
99 Thus, for problems in which loopy BP is well-behaved and results in answers similar to the true posterior distributions, we may expect our estimates of belief error to be similarly incorrect but near to the true divergence. [sent-681, score-0.366]
100 On the uniqueness of loopy belief propagation fixed points. [sent-733, score-0.451]
wordName wordTfidf (topN-words)
[('mts', 0.427), ('bp', 0.395), ('mut', 0.318), ('ets', 0.266), ('xt', 0.244), ('loopy', 0.236), ('xs', 0.204), ('message', 0.186), ('messages', 0.159), ('yts', 0.141), ('ut', 0.14), ('potentials', 0.137), ('mt', 0.129), ('ts', 0.119), ('belief', 0.114), ('dxt', 0.105), ('log', 0.095), ('yut', 0.089), ('contraction', 0.085), ('propagation', 0.085), ('hler', 0.083), ('illsky', 0.083), ('oopy', 0.078), ('rrors', 0.078), ('gts', 0.073), ('isher', 0.07), ('onvergence', 0.066), ('essage', 0.066), ('incoming', 0.063), ('node', 0.06), ('ffect', 0.058), ('quantized', 0.057), ('tatikonda', 0.057), ('heskes', 0.056), ('uncorrelated', 0.049), ('quantization', 0.049), ('graphical', 0.047), ('beliefs', 0.046), ('dynamic', 0.042), ('ihler', 0.042), ('mtn', 0.042), ('graph', 0.039), ('errors', 0.038), ('potential', 0.035), ('additivity', 0.035), ('mb', 0.033), ('simon', 0.032), ('triangle', 0.031), ('mae', 0.031), ('mbe', 0.031), ('graphs', 0.027), ('crit', 0.026), ('willsky', 0.026), ('tree', 0.025), ('condition', 0.025), ('cut', 0.024), ('strength', 0.024), ('iteration', 0.023), ('cycles', 0.023), ('neighbors', 0.023), ('convolution', 0.023), ('approximations', 0.022), ('boyen', 0.022), ('pairwise', 0.022), ('strict', 0.021), ('sudderth', 0.021), ('stochastic', 0.021), ('st', 0.02), ('parameterization', 0.02), ('median', 0.02), ('pointwise', 0.019), ('theorem', 0.019), ('ei', 0.018), ('diameter', 0.018), ('bounds', 0.018), ('divergence', 0.018), ('replicas', 0.017), ('bound', 0.017), ('error', 0.016), ('xb', 0.016), ('measure', 0.016), ('koller', 0.016), ('uniqueness', 0.016), ('sup', 0.016), ('dxs', 0.016), ('mti', 0.016), ('mtu', 0.016), ('yedidia', 0.015), ('wainwright', 0.015), ('distance', 0.015), ('strengths', 0.015), ('distortion', 0.014), ('iterations', 0.014), ('parent', 0.013), ('ya', 0.013), ('relied', 0.013), ('mst', 0.013), ('xed', 0.013), ('operation', 0.013), ('approximate', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky
Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization
2 0.13498668 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
3 0.13188426 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
4 0.10337016 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
Author: Marianthi Markatou, Hong Tian, Shameek Biswas, George Hripcsak
Abstract: This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random T T variables Y = Card(S j S j ) and Y ∗ = Card(Sc Sc ), where S j , S j are two training sets, and Sc , j j j c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric Sj and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation. Keywords: cross-validation, generalization error, moment approximation, prediction, variance estimation
5 0.059566308 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
6 0.058318671 32 jmlr-2005-Expectation Consistent Approximate Inference
7 0.055015523 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
8 0.051239874 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
9 0.048646096 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
10 0.041830737 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
11 0.040095273 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
12 0.031119395 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
13 0.031084917 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
14 0.027891371 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
15 0.027637046 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
16 0.024553472 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
17 0.023151971 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
18 0.023020022 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
19 0.02073472 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
20 0.02023075 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers
topicId topicWeight
[(0, 0.172), (1, -0.091), (2, 0.038), (3, -0.178), (4, -0.398), (5, 0.265), (6, 0.072), (7, 0.168), (8, -0.138), (9, -0.05), (10, 0.048), (11, -0.045), (12, 0.091), (13, -0.066), (14, -0.267), (15, 0.0), (16, 0.167), (17, 0.036), (18, 0.043), (19, 0.033), (20, 0.066), (21, -0.1), (22, 0.052), (23, 0.066), (24, 0.051), (25, 0.04), (26, -0.061), (27, 0.025), (28, 0.104), (29, 0.006), (30, -0.046), (31, -0.113), (32, -0.031), (33, -0.135), (34, 0.017), (35, -0.078), (36, -0.141), (37, -0.018), (38, -0.011), (39, 0.069), (40, 0.062), (41, 0.02), (42, -0.053), (43, 0.04), (44, -0.089), (45, -0.056), (46, 0.025), (47, -0.061), (48, -0.055), (49, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.96852332 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky
Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization
2 0.58769983 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
3 0.47877684 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
Author: Marianthi Markatou, Hong Tian, Shameek Biswas, George Hripcsak
Abstract: This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random T T variables Y = Card(S j S j ) and Y ∗ = Card(Sc Sc ), where S j , S j are two training sets, and Sc , j j j c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric Sj and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation. Keywords: cross-validation, generalization error, moment approximation, prediction, variance estimation
4 0.40156466 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
5 0.23288646 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
6 0.2060314 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
7 0.1869968 32 jmlr-2005-Expectation Consistent Approximate Inference
8 0.18455137 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
9 0.18200785 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
10 0.14984627 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
11 0.12991139 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
12 0.12714165 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
13 0.11786316 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
14 0.10818846 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
15 0.10612752 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
16 0.099839419 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
17 0.09898892 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
18 0.095714211 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
19 0.089026742 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
20 0.088227324 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
topicId topicWeight
[(13, 0.049), (17, 0.016), (19, 0.047), (36, 0.02), (37, 0.035), (42, 0.014), (43, 0.033), (47, 0.011), (52, 0.095), (59, 0.021), (70, 0.033), (75, 0.408), (88, 0.079), (90, 0.021), (94, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.70600218 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky
Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization
2 0.33070016 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
3 0.32191879 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.31875697 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
5 0.31803495 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
6 0.31736806 64 jmlr-2005-Semigroup Kernels on Measures
7 0.31708777 71 jmlr-2005-Variational Message Passing
8 0.31676441 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
9 0.31500778 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.31129169 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
11 0.3091096 3 jmlr-2005-A Classification Framework for Anomaly Detection
12 0.30892992 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
13 0.30873558 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
14 0.30843404 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
15 0.3068673 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
16 0.30606306 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
17 0.30515727 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.30339819 20 jmlr-2005-Clustering with Bregman Divergences
19 0.30139485 44 jmlr-2005-Learning Module Networks
20 0.30083403 11 jmlr-2005-Algorithmic Stability and Meta-Learning