nips nips2004 nips2004-116 knowledge-graph by maker-knowledge-mining

116 nips-2004-Message Errors in Belief Propagation

Source: pdf

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether from quantization or other simpliﬁed message representations or from stochastic approximation methods. Introducing such errors into the BP message computations has the potential to adversely affect the solution obtained. We analyze this effect with respect to a particular measure of message error, and show bounds on the accumulation of errors in the system. This leads both to convergence conditions and error bounds in traditional and approximate BP message passing. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. [sent-8, score-0.212]

2 At times, even further approximations are required, whether from quantization or other simpliﬁed message representations or from stochastic approximation methods. [sent-9, score-0.598]

3 Introducing such errors into the BP message computations has the potential to adversely affect the solution obtained. [sent-10, score-0.572]

4 We analyze this effect with respect to a particular measure of message error, and show bounds on the accumulation of errors in the system. [sent-11, score-0.633]

5 This leads both to convergence conditions and error bounds in traditional and approximate BP message passing. [sent-12, score-0.665]

6 In particular, the belief propagation (BP, or sum-product) algorithm has become a popular means of solving inference problems exactly or approximately. [sent-14, score-0.251]

7 Recently, further justiﬁcations for loopy belief propagation have been developed, including a few convergence results for graphs with cycles [1–3]. [sent-18, score-0.64]

8 The approximate nature of loopy BP is often a more than acceptable price for efﬁcient inference; in fact, it is sometimes desirable to make additional approximations. [sent-19, score-0.321]

9 There may be a number of reasons for this—for example, when exact message representation is computationally intractable, the messages may be approximated stochastically [4] or deterministically by discarding low-likelihood states [5]. [sent-20, score-0.703]

10 For BP involving continuous, non-Gaussian potentials, some form of approximation is required to obtain a ﬁnite parametrization for the messages [6–8]. [sent-21, score-0.202]

11 Additionally, graph simpliﬁcation by edge removal may be regarded as a coarse form of message approximation. [sent-22, score-0.533]

12 Finally, one may wish to approximate the messages and reduce their representation size for another reason—to decrease the communications required for distributed inference applications. [sent-23, score-0.262]

13 In a distributed environment, one may approximate the transmitted message to reduce its representational cost [9], or discard it entirely if it is deemed “sufﬁciently similar” to the previously sent version [10]. [sent-24, score-0.492]

14 Given that message approximation may be desirable, we would like to know what effect the introduced errors have on our overall solution. [sent-26, score-0.589]

15 To characterize the effect in graphs with cycles, we analyze the deviation from a solution given by “exact” loopy BP (not, as is typically considered, the deviation of loopy BP from the true marginal distributions). [sent-27, score-0.67]

16 In the process, we also develop some results on the convergence of loopy BP. [sent-28, score-0.318]

17 Speciﬁcally, each node s in a graph is associated with a random variable xs , while the set of edges E is used to describe the conditional dependency structure of the variables. [sent-31, score-0.311]

18 Here we consider graphs with at most pairwise interactions (a typical assumption in BP), where the distribution factors according to p(x) = ψst (xs , xt ) ψs (xs ) (1) s (s,t)∈E The goal of belief propagation [12], or BP, is to compute the marginal distribution p(xt ) at each node t. [sent-33, score-0.543]

19 At any iteration, one may calculate the belief at node t by Mti (xt ) ∝ ψt (xt ) mi (xt ) ut (3) u∈Γt For tree-structured graphical models, belief propagation can be used to efﬁciently perform exact marginalization. [sent-35, score-0.74]

20 Speciﬁcally, the iteration (2) converges in a ﬁnite number of iterations (at most the length of the longest path in the graph), after which the belief (3) equals the correct marginal p(xt ). [sent-36, score-0.208]

21 However, as observed by [12], one may also apply belief propagation to arbitrary graphical models by following the same local message passing rules at each node and ignoring the presence of cycles in the graph; this procedure is typically referred to as “loopy” BP. [sent-37, score-0.909]

22 For loopy BP, the sequence of messages deﬁned by (2) is not guaranteed to converge to a ﬁxed point after any number of iterations. [sent-38, score-0.484]

23 However, they may not be unique, nor are the results exact (the belief Mti does not converge to the true marginal). [sent-40, score-0.237]

24 1 1 3 2 4 2 3 4 4 1 2 3 1 1 4 3 1 2 Figure 1: For a graph with cycles, one may show an equivalence between n iterations of loopy BP and the n-level computation tree (shown here for n = 3 and rooted at node 1; example from [2]). [sent-42, score-0.511]

25 It is sometimes convenient to think of loopy BP in terms of its computation tree [2]. [sent-43, score-0.304]

26 The n-level computation tree rooted at some node t is a treestructured “unrolling” of the graph, so that n iterations of loopy BP on the original graph is equivalent at the node t to exact inference on the computation tree. [sent-44, score-0.608]

27 log m/m ˆ m(x) m(x) ˆ } log d (e) (a) (b) Figure 2: (a) A message m(x), solid, and its approximation m(x), dashed. [sent-46, score-0.717]

28 (b) Their log-ratio ˆ log m(x)/m(x); log d (e) characterizes their similarity by measuring the error’s dynamic range. [sent-47, score-0.262]

29 We begin by considering multiplicative error functions which describe the difference between a “true” message m(x) (typically meaning consistent with some BP ﬁxed-point) and some approximation m(x) = m(x) · e(x). [sent-49, score-0.515]

30 When applied to traditional BP, this results in a novel sufﬁcient condition for its convergence to a unique solution, speciﬁcally 2 d (ψut ) − 1 max < 1, (4) 2 (s,t)∈E d (ψut ) + 1 u∈Γt \s and may be further improved in most cases. [sent-51, score-0.231]

31 The condition (4) is shown to be slightly stronger than the sufﬁcient condition given in [2]. [sent-52, score-0.181]

32 More importantly, however, the method in which it is derived allows us to generalize to many other situations: • The condition (4) is easily improved for graphs with irregular geometry or potential strengths • The method also provides a bound on the distance between any two BP ﬁxed points. [sent-53, score-0.296]

33 • The same methodology may be applied to the case of quantized or otherwise approximated messages, yielding bounds on the ensuing error (our original motivation). [sent-54, score-0.213]

34 • By regarding message errors as a stochastic process and applying a few additional assumptions, a similar analysis obtains alternate, tighter estimates (though not necessarily bounds) of performance. [sent-55, score-0.616]

35 4 Message Approximations In order to discuss the effects and propagation of errors introduced to the BP messages, we ﬁrst require a measure of the difference between two messages. [sent-56, score-0.222]

36 This measure may ˆ also be related to more traditional error measures, including an absolute error on log m(x), a ﬂoating-point precision on m(x), and the Kullback-Leibler divergence D(m(x) m(x)); ˆ for details, see [14]. [sent-58, score-0.314]

37 In this light our analysis of message approximation (Section 5. [sent-59, score-0.455]

38 3) may be equivalently regarded as a statement about the required precision for an accurate implementation of loopy BP. [sent-60, score-0.305]

39 Figure 2 shows an example message m(x) and approximation m(x) along with their associated error e(x). [sent-61, score-0.515]

40 ˆ To facilitate our analysis, we split the message update operation (2) into two parts. [sent-62, score-0.451]

41 In the ﬁrst, we focus on the message products Mts (xt ) ∝ ψt (xt ) mut (xt ) u∈Γt \s Mt (xt ) ∝ ψt (xt ) mut (xt ) u∈Γt (6) where as usual, the proportionality constant is chosen to normalize M . [sent-63, score-0.574]

42 We show the message error metric is (sub-)additive, i. [sent-64, score-0.488]

43 that the errors in each incoming message (at most) add in their impact on M . [sent-66, score-0.616]

44 The second operation is the message convolution mts (xs ) ∝ ψts (xt , xs )Mts (xt )dxt (7) where M is a normalized message or product of messages. [sent-67, score-1.396]

45 We demonstrate a level of contraction, that is, the approximation of mts is measurably better than the approximation of Mts used to construct it. [sent-68, score-0.417]

46 We use the convention that lowercase quantities (mts , ets , . [sent-69, score-0.236]

47 ) refer to messages and message errors, while uppercase ones (Mts , Ets , Mt , . [sent-72, score-0.603]

48 ) refer to products of messages or errors—all incoming messages to node t (Mt and Et ), or all except the one from s (Mts and Ets ). [sent-75, score-0.518]

49 1 Additivity and Error Contraction The log of (5) is sub-additive, since for several incoming messages {mut (x)} we have ˆ ˆ ts /Mts = log d log d (Ets ) = log d M eut ≤ log d (eut ) (8) We may also derive a minimum rate of contraction on the errors. [sent-78, score-1.281]

50 We consider the message from t to s; since all quantities in this section relate to mts and Mts we suppress the subscripts. [sent-79, score-0.755]

51 The error measure d (e) is given by ψ(xt , a)M (xt )E(xt )dxt ψ(xt , b)M (xt )dxt 2 2 d (e) = d (m/m) = max ˆ · (9) a,b ψ(xt , a)M (xt )dxt ψ(xt , b)M (xt )E(xt )dxt subject to certain constraints, such as positivity of the messages and potentials. [sent-80, score-0.296]

52 (12) d (ψ) = max a,b,c,d ψ(c, d) log d(ψ) d(E)+1 d(ψ) +d(E) However, with some work one may show [14] the stronger measure of contraction, log d (E) → 2 d (ψ) d (E) + 1 . [sent-82, score-0.381]

53 (13) d (e) ≤ 2 Figure 3: Bounds on the error output d (ψ) + d (E) 2 2 d (e) as a function of the error in the Sketch of proof: While the full proof is rather involved, product of incoming messages d (E). [sent-83, score-0.384]

54 The bound (13) is shown in Figure 3; note that it improves both error bounds (11), shown as straight lines. [sent-89, score-0.221]

55 In the next section, we use (8)-(13) to analyze the behavior of loopy BP. [sent-90, score-0.27]

56 5 Implications in Graphs with Cycles We begin by examining loopy BP with exact message passing, using the previous results to derive a new sufﬁcient condition for convergence to a unique ﬁxed point. [sent-91, score-0.891]

57 When this condition is not satisﬁed, we instead obtain a bound on the relative distances between any two ﬁxed points of the loopy BP equations. [sent-92, score-0.432]

58 We then consider the effect of introducing additional errors into the messages passed at each iteration, showing sufﬁcient conditions for this operation to converge, and a bound on the resulting error from exact loopy BP. [sent-93, score-0.784]

59 Dobrushin’s condition is a global measure and difﬁcult to verify; given in [2] is a sufﬁcient condition (often called Simon’s condition): max t log d (ψut ) < 1 (14) u∈Γt where d (ψ) is deﬁned as in (12). [sent-96, score-0.35]

60 Let us take the “true” messages mts to be any ﬁxed point of BP, and “approximate” them at each iteration by performing loopy BP from some arbitrary initial conditions. [sent-98, score-0.817]

61 Now suppose that the largest message-product error log d (Eut ) in any node u with parent t at level i of the computation tree (corresponding to iteration n − i out of n total iterations of loopy BP) is bounded above by some constant log i . [sent-99, score-0.773]

62 Note that this 2 is trivially true (at any i) for the constant log i = max(u,t)∈E |Γt | log d (ψut ) . [sent-100, score-0.262]

63 Now, we may bound d (Ets ) at any replicate of node t with parent s on level i − 1 of the tree by 2 i log d (Ets ) ≤ gts (log i ) = d (ψut ) log 2 +1 d (ψut ) + u∈Γt \s i . [sent-101, score-0.62]

64 (15) and we may deﬁne log i−1 = maxt,s gts (log i ) to bound the error at level i−1. [sent-102, score-0.436]

65 This is guaranteed by the conditions gts (0) = 0, gts (0) < 1 and gts (z) < 0. [sent-109, score-0.403]

66 Doing so, we see x +1 that Simon’s condition is sufﬁcient to guarantee (16), but that (16) may be true (implying convergence) when Simon’s condition is not satisﬁed. [sent-111, score-0.224]

67 In this case, the contraction (16) tells us that the outgoing message in either direction is always closer to the BP ﬁxed point than the incoming message. [sent-114, score-0.637]

68 In this case, the sequence { i } as deﬁned will decrease until it reaches the largest value such that maxts gts (log ) = log . [sent-118, score-0.258]

69 5 (c) Figure 4: Two small (5 × 5) grids, with (a) all equal-strength potentials log d (ψ)2 = α and (b) several weaker ones (log d (ψ)2 = . [sent-122, score-0.266]

70 If log is small (the condition (16) is nearly satisﬁed) then although we cannot guarantee convergence to a unique ﬁxed point, we can guarantee that every ﬁxed point and our estimate are all mutually close (in a log-ratio sense). [sent-127, score-0.381]

71 In particular the proofs of (16)-(17) assume that, as a message error propagates through the graph, repeated convolution with only the strongest set of potentials is possible. [sent-130, score-0.618]

72 Let us consider a message-passing procedure (potentially performed ofﬂine) in which node i t passes a (scalar) bound υts on the message error d ei at iteration i to its neighbor s. [sent-133, score-0.695]

73 ts 2 1 The bound may be initialized to υts = d (ψts ) , and the next iteration’s (updated) outgoing bound is given by the pair of equations 2 d (ψts ) i + 1 ts i+1 i log υts = log log i = log υut (18) ts 2 d (ψts ) + i ts u∈Γt \s Here, as in Section 5. [sent-134, score-1.498]

74 1, i ts bounds the error d (Ets ) in the product of incoming messages. [sent-135, score-0.412]

75 i log υts If (18) converges to → 0 for all t, s we may guarantee a unique ﬁxed point for i loopy BP; if not, we may compute log i = Γt log υut to obtain a bound on the belief t error at each node t. [sent-136, score-1.155]

76 If every node is identical (same number of neighbors, same potential strengths) this yields the same bound as (17); however, if the graph or potential strengths are inhomogeneous it provides a strictly stronger bound on loopy BP convergence and errors. [sent-137, score-0.818]

77 This situation is illustrated in Figure 4—we specify two 5×5 grids in terms of their potential strengths and compute bounds on the log-range of their ﬁxed point beliefs. [sent-138, score-0.213]

78 (While potential strength does not completely specify the graphical model, it is sufﬁcient for all the bounds 2 considered here. [sent-139, score-0.188]

79 ) One grid (a) has equal-strength potentials log d (ψ) = α, while the other has many weaker potentials (α/2). [sent-140, score-0.397]

80 However, the dashed curves show the estimate of (18), which improves only slightly for the strongly coupled graph (a) but considerably for the weaker graph (b). [sent-142, score-0.198]

81 3 Introducing additional errors As discussed in the introduction, we may wish to introduce or allow additional errors in our messages at each stage, in order to improve the computational or communication efﬁciency of the algorithm. [sent-145, score-0.452]

82 Speciﬁcally, let us describe the error func2 tions log ets (xs ) for each xs as a random variable with mean zero and variance σts . [sent-152, score-0.589]

83 By assuming that the errors in each incoming message are uncorrelated, we obtain additivity 2 of their variances: Σ2 = u∈Γt \s σut . [sent-153, score-0.648]

84 The assumption of uncorrelated errors is clearly ts questionable since propagation around loops may couple the incoming message errors, but is common in quantization analysis, and we shall see that it appears reasonable in practice. [sent-154, score-1.084]

85 We would also like to estimate the contraction of variance incurred in the convolution step. [sent-155, score-0.161]

86 We may do so by applying a simple sigma-point quadrature (“unscented”) approximation [16], in which the standard deviation of the convolved function mts (xs ) is estimated by applying the same nonlinearity (13) to the standard deviation of the error on the incoming product Mts . [sent-156, score-0.596]

87 6 Experiments We demonstrate the error bounds for perturbed messages with a set of Monte Carlo trials. [sent-159, score-0.313]

88 In particular, for each trial we construct a binary-valued 5 × 5 grid with uniform potential strengths, which are either (1) all positively correlated, or (2) randomly chosen to be positively or negatively correlated (equally likely); we also assign random single-node potentials to each xs . [sent-160, score-0.396]

89 We then run a quantized version of BP, rounding each log-message to discrete values separated by 2 log δ (ensuring that the newly introduced error satisiﬁes d (e) ≤ δ). [sent-161, score-0.231]

90 Figure 5 shows the maximum belief error in each of 100 trials of this procedure for various values of δ. [sent-162, score-0.193]

91 Also shown are the bound on belief error developed in Section 5. [sent-163, score-0.276]

92 3 and the 2σ estimate computed assuming uncorrelated message errors. [sent-164, score-0.505]

93 max log d (Et ) max log d (Et ) Strict bound Stochastic estimate Positive corr. [sent-167, score-0.436]

94 potentials 1 10 0 10 10 10 10 1 2 log δ → 3 3 10 10 2 10 1 2 0 10 (a) log d (ψ) = . [sent-169, score-0.364]

95 25 10 1 100 10 -1 10 -2 log δ → 10 -3 -3 10 10-2 10-1 100 2 (b) log d (ψ) = 1 Figure 5: Maximum belief errors incurred as a function of the quantization error. [sent-170, score-0.591]

96 The scatterplot indicates the maximum error measured in the graph for each of 200 Monte Carlo runs; this is strictly bounded above by the solution of (18), solid, and bounded with high probability (assuming uncorrelated errors) by (20), dashed. [sent-171, score-0.253]

97 7 Conclusions We have described a particular measure of distortion on BP messages and shown that it is sub-additive and measurably contractive, leading to sufﬁcient conditions for loopy BP to converge to a unique ﬁxed point. [sent-172, score-0.627]

98 Furthermore, this enables analysis of quantized, stochastic, or other approximate forms of BP, yielding sufﬁcient conditions for convergence and bounds on the deviation from exact message passing. [sent-173, score-0.664]

99 On the uniqueness of loopy belief propagation ﬁxed points. [sent-189, score-0.498]

100 Constructing free energy approximations and generalized belief propagation algorithms. [sent-259, score-0.256]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('message', 0.428), ('bp', 0.379), ('mts', 0.327), ('loopy', 0.27), ('ets', 0.236), ('ts', 0.185), ('messages', 0.175), ('ut', 0.17), ('xt', 0.164), ('xs', 0.162), ('belief', 0.133), ('log', 0.131), ('gts', 0.127), ('potentials', 0.102), ('errors', 0.099), ('propagation', 0.095), ('incoming', 0.089), ('simon', 0.089), ('contraction', 0.087), ('dxt', 0.087), ('bound', 0.083), ('condition', 0.079), ('node', 0.079), ('bounds', 0.078), ('quantization', 0.076), ('mut', 0.073), ('ihler', 0.072), ('graph', 0.07), ('graphical', 0.065), ('error', 0.06), ('eut', 0.055), ('uncorrelated', 0.052), ('cycles', 0.052), ('convergence', 0.048), ('strengths', 0.047), ('iteration', 0.045), ('potential', 0.045), ('grids', 0.043), ('graphs', 0.042), ('mb', 0.04), ('quantized', 0.04), ('xed', 0.039), ('converge', 0.039), ('stochastic', 0.039), ('unique', 0.036), ('dobrushin', 0.036), ('mbe', 0.036), ('measurably', 0.036), ('mti', 0.036), ('tatikonda', 0.036), ('may', 0.035), ('tree', 0.034), ('outgoing', 0.033), ('max', 0.033), ('weaker', 0.033), ('mt', 0.032), ('additivity', 0.032), ('lincoln', 0.032), ('guarantee', 0.031), ('suf', 0.031), ('marginal', 0.03), ('exact', 0.03), ('grid', 0.029), ('approximate', 0.029), ('iii', 0.029), ('uai', 0.029), ('deviation', 0.029), ('mae', 0.029), ('nonuniform', 0.029), ('positively', 0.029), ('tighter', 0.028), ('convolution', 0.028), ('measure', 0.028), ('approximations', 0.028), ('satis', 0.028), ('approximation', 0.027), ('wainwright', 0.027), ('sudderth', 0.027), ('willsky', 0.027), ('strict', 0.027), ('gibbs', 0.026), ('fisher', 0.026), ('estimate', 0.025), ('loops', 0.025), ('perturbations', 0.025), ('strictly', 0.025), ('oxford', 0.025), ('stronger', 0.023), ('inference', 0.023), ('rooted', 0.023), ('laboratory', 0.023), ('operation', 0.023), ('bounded', 0.023), ('passing', 0.022), ('additional', 0.022), ('conditions', 0.022), ('neighbors', 0.022), ('distortion', 0.021), ('incurred', 0.021), ('implying', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 116 nips-2004-Message Errors in Belief Propagation

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

2 0.31296134 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We introduce a computationally efﬁcient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network (“C. Elegans”). Although the method is restricted to pair-wise interactions, no local evidence (zero “biases”) and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we ﬁnd that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to signiﬁcantly outperform MF. 1

3 0.17763476 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

Author: Erik B. Sudderth, Michael I. Mandel, William T. Freeman, Alan S. Willsky

Abstract: We describe a three–dimensional geometric hand model suitable for visual tracking applications. The kinematic constraints implied by the model’s joints have a probabilistic structure which is well described by a graphical model. Inference in this model is complicated by the hand’s many degrees of freedom, as well as multimodal likelihoods caused by ambiguous image measurements. We use nonparametric belief propagation (NBP) to develop a tracking algorithm which exploits the graph’s structure to control complexity, while avoiding costly discretization. While kinematic constraints naturally have a local structure, self– occlusions created by the imaging process lead to complex interpendencies in color and edge–based likelihood functions. However, we show that local structure may be recovered by introducing binary hidden variables describing the occlusion state of each pixel. We augment the NBP algorithm to infer these occlusion variables in a distributed fashion, and then analytically marginalize over them to produce hand position estimates which properly account for occlusion events. We provide simulations showing that NBP may be used to reﬁne inaccurate model initializations, as well as track hand motion through extended image sequences. 1

4 0.13525669 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey

Abstract: We present a generative model and stochastic ﬁltering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic ﬁltering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact ﬁltering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic ﬂow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic ﬂow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-ﬂow, a generative model and stochastic ﬁltering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We deﬁne a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k ﬁxed morph bases, with coefﬁcients c = [c1 , c2 , · · · , ck ]T . The ﬁxed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic ﬂow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to ﬁnd ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic ﬂow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efﬁciently using the Gauss-Newton method. We will take advantage of this fact to develop an efﬁcient stochastic inference algorithm within the framework of G-ﬂow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for speciﬁc values taken by random variables, and Greek letters for ﬁxed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-ﬂow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-ﬂow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspeciﬁed since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difﬁcult nonlinear ﬁltering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-ﬂow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the speciﬁc values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle ﬁltering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman ﬁltering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo ﬁltering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle ﬁltering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-ﬂow, we accomplished this by performing an optic ﬂow-like optimization, using an efﬁcient algorithm similar to those in [10, 2]. 3 Inference Our goal is to ﬁnd an expression for the ﬁltering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the ﬁltering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The ﬁltering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman ﬁlter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, deﬁne µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to ﬁnd an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the ﬁnal term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships deﬁned by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic ﬂow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic ﬂow, we can ﬁnd the pose estimate ut (u1:t−1 ) ˆ efﬁciently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Deﬁne σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To ﬁnd ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To reﬁne our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the ﬁltering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic ﬂow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-ﬂow involves a combination of optic ﬂow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic ﬂow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic ﬂow [10, 2, 11] is simply a special case of optimal inference under G-ﬂow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic ﬂow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic ﬂow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains ﬁxed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the ﬁxed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-ﬂow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic ﬂow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-ﬂow infers a model of the background which is used to improve tracking. Figure 2: G-ﬂow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-ﬂow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefﬁcients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-ﬂow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-ﬂow’s performance versus deterministic constrained optic ﬂow algorithms such as [10, 2, 11] , we used both G-ﬂow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-ﬂow (green) and for deterministic optic ﬂow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-ﬂow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic ﬂow algorithm and for a 20-expert version of the G-ﬂow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-ﬂow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible ﬂow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman ﬁltering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman ﬁlters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle ﬁltering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle ﬁltering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian ﬁltering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.

5 0.13451242 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

Author: Manfred Opper, Ole Winther

Abstract: We propose a novel a framework for deriving approximations for intractable probabilistic models. This framework is based on a free energy (negative log marginal likelihood) and can be seen as a generalization of adaptive TAP [1, 2, 3] and expectation propagation (EP) [4, 5]. The free energy is constructed from two approximating distributions which encode different aspects of the intractable model such a single node constraints and couplings and are by construction consistent on a chosen set of moments. We test the framework on a difﬁcult benchmark problem with binary variables on fully connected graphs and 2D grid graphs. We ﬁnd good performance using sets of moments which either specify factorized nodes or a spanning tree on the nodes (structured approximation). Surprisingly, the Bethe approximation gives very inferior results even on grids. 1

6 0.11348259 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

7 0.11280005 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.091002651 122 nips-2004-Modelling Uncertainty in the Game of Go

9 0.09013769 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

10 0.089839622 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

11 0.087150596 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

12 0.083542898 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

13 0.081735283 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

14 0.08096575 175 nips-2004-Stable adaptive control with online learning

15 0.077824056 73 nips-2004-Generative Affine Localisation and Tracking

16 0.069144107 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

17 0.068278715 185 nips-2004-The Convergence of Contrastive Divergences

18 0.067655005 48 nips-2004-Convergence and No-Regret in Multiagent Learning

19 0.066900022 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

20 0.065937169 28 nips-2004-Bayesian inference in spiking neurons

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.19), (1, -0.036), (2, 0.145), (3, -0.006), (4, 0.166), (5, -0.105), (6, -0.009), (7, 0.157), (8, -0.159), (9, -0.365), (10, 0.068), (11, -0.203), (12, 0.173), (13, 0.01), (14, 0.007), (15, -0.063), (16, 0.058), (17, 0.07), (18, 0.061), (19, 0.157), (20, 0.097), (21, -0.107), (22, 0.11), (23, 0.075), (24, 0.049), (25, -0.077), (26, -0.034), (27, 0.095), (28, -0.004), (29, -0.007), (30, 0.117), (31, -0.032), (32, -0.031), (33, -0.029), (34, 0.045), (35, -0.051), (36, -0.023), (37, 0.038), (38, 0.033), (39, 0.105), (40, -0.006), (41, -0.047), (42, 0.103), (43, -0.097), (44, 0.008), (45, -0.0), (46, 0.055), (47, -0.058), (48, 0.041), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96432507 116 nips-2004-Message Errors in Belief Propagation

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

2 0.87163377 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

3 0.71659267 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

Author: Manfred Opper, Ole Winther

4 0.49806735 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks

Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick

Abstract: Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. Some properties of WSAT which have not previously been reported make it an ideal tool for this purpose – its mean cost is proportional to the number of variables in the formula (at a £xed ratio of clauses to variables) in the easy-SAT regime and slightly beyond, and its behavior in the hardSAT regime appears to re¤ect the underlying structure of the solution space that has been predicted by replica symmetry-breaking arguments. An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. 1

5 0.48522046 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

Author: Erik B. Sudderth, Michael I. Mandel, William T. Freeman, Alan S. Willsky

6 0.46945032 122 nips-2004-Modelling Uncertainty in the Game of Go

7 0.37922868 29 nips-2004-Beat Tracking the Graphical Model Way

8 0.32877287 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

9 0.32478538 57 nips-2004-Economic Properties of Social Networks

10 0.30467936 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

11 0.2800127 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

12 0.27789155 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

13 0.27584293 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

14 0.27267957 138 nips-2004-Online Bounds for Bayesian Algorithms

15 0.26974624 130 nips-2004-Newscast EM

16 0.26447946 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

17 0.25560981 175 nips-2004-Stable adaptive control with online learning

18 0.24614897 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

19 0.24345493 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

20 0.24300548 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.13), (15, 0.147), (26, 0.076), (31, 0.025), (33, 0.207), (35, 0.02), (39, 0.03), (50, 0.037), (88, 0.218)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89973462 116 nips-2004-Message Errors in Belief Propagation

Author: Alexander T. Ihler, John W. Fisher, Alan S. Willsky

2 0.88885808 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

Author: Chakra Chennubhotla, Allan D. Jepson

Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efﬁcient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is ﬁrst solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientiﬁc computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or afﬁnity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process deﬁned in terms of the afﬁnity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the ﬁrst eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difﬁculty is to ﬁrst sparsify the links in the graph [22] and then apply an efﬁcient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the afﬁnity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding afﬁnity matrix; solve the eigenproblem and ﬁnally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classiﬁcation and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from afﬁnity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative afﬁnities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We ﬁrst provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the afﬁnity between vertices vi and vj . The afﬁnities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n deﬁned to be: dj = i=1 ai,j = j=1 aj,i , where we deﬁne D = diag(d1 , . . . , dn ). A Markov chain is deﬁned using these afﬁnities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix deﬁnes the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized afﬁnity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufﬁcient accuracy. For notational ease, we think of M as a ﬁne-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the ﬁne scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefﬁcient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to ﬁnd a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the ﬁne scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Speciﬁcally, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to deﬁne M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be deﬁned as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the ﬁne scale using the kernels K resulting in a ﬁne scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the ﬁne grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the deﬁnition of M , and its stationary distribution δ, we can generate a symmetric coarse scale afﬁnity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale afﬁnity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree deﬁnition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is deﬁned where each pixel in the test image is associated with a vertex of the graph G. The edges in G are deﬁned by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The afﬁnity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the ﬁne scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the afﬁnity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the ﬁrst coarse scale 83 kernels are picked. The kernels each correspond to a different column in the ﬁne scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the ﬁrst iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the afﬁnity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsiﬁcation step to be not critical. To summarize, we use the stationary distribution π at the ﬁne-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an afﬁnity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized afﬁnity matrix only at the coarsest level. As discussed in the previous section, the afﬁnity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized afﬁnity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the ﬁne-scale normalized afﬁnity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the ﬁne-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the ﬁne-scale normalized afﬁnity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading ﬁne-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to ﬁne} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefﬁcient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We ﬁx this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the ﬁne-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the ﬁne-scale normalized afﬁnity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the ﬁne-scale afﬁnity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are deﬁned by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The afﬁnity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufﬁciently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is ﬁxed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the ﬁrst column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the ﬁrst coarse scale to the required ﬁne scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at speciﬁc values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difﬁculty of the speciﬁc eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efﬁcient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indeﬁnite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for ﬁnding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientiﬁc computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.

3 0.80106831 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

4 0.7997942 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

5 0.79644543 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a ﬁrst-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the ﬁrst-order (single-step) transitions. But for many problems, the ﬁrstorder conditional independence assumptions are not satisﬁed, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a ﬁrst-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the beneﬁts of using ﬁrst-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

6 0.79296827 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.79241943 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.79221708 70 nips-2004-Following Curved Regularized Optimization Solution Paths

9 0.79182547 163 nips-2004-Semi-parametric Exponential Family PCA

10 0.79175282 73 nips-2004-Generative Affine Localisation and Tracking

11 0.79146093 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

12 0.79093516 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

13 0.79005986 69 nips-2004-Fast Rates to Bayes for Kernel Machines

14 0.78898799 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

15 0.78861964 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

16 0.78787798 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

17 0.787431 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.78735411 124 nips-2004-Multiple Alignment of Continuous Time Series

19 0.78689367 127 nips-2004-Neighbourhood Components Analysis

20 0.78655118 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification