nips nips2006 nips2006-57 knowledge-graph by maker-knowledge-mining

57 nips-2006-Conditional mean field

Source: pdf

Author: Peter Carbonetto, Nando D. Freitas

Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean ﬁeld approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-speciﬁed variational approximations based on mean ﬁeld theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean ﬁeld. Moreover, experiments on a well-studied problem— inferring the stable conﬁgurations of the Ising spin glass—show that the solutions can be signiﬁcantly better than those obtained using sum-product-based methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Conditional mean ﬁeld Nando de Freitas Department of Computer Science University of British Columbia Vancouver, BC, Canada V6T 1Z4 nando@cs. [sent-1, score-0.134]

2 ca Abstract Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. [sent-5, score-0.361]

3 We propose a new class of conditionally-speciﬁed variational approximations based on mean ﬁeld theory. [sent-7, score-0.48]

4 While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean ﬁeld. [sent-8, score-0.193]

5 1 Introduction Behind all variational methods for inference in probabilistic models lies a basic principle: treat the quantities of interest, which amount to moments of the random variables, as the solution to an optimization problem obtained via convex duality. [sent-10, score-0.299]

6 Since optimizing the dual is rarely an amelioration over the original inference problem, various strategies have arisen out of statistical physics and machine learning for making principled (and unprincipled) approximations to the objective. [sent-11, score-0.171]

7 One such class of techniques, mean ﬁeld theory, requires that the solution deﬁne a distribution that factorizes in such a way that the statistics of interest are easily derived. [sent-12, score-0.184]

8 As remarked by Yedidia in [17], however, mean ﬁeld theory often imposes unrealistic or questionable factorizations, leading to poor solutions. [sent-14, score-0.134]

9 Advances have been made in improving the quality of mean ﬁeld approximations [17, 22, 26], but their applicability remains limited to speciﬁc models. [sent-15, score-0.26]

10 Related variational approximations based on convex combinations of treestructured distributions [24] have the added advantage that they possess a unique global optimum (by contrast, we can only hope to discover a local minimum of the Bethe-Kikuchi and mean ﬁeld objectives). [sent-17, score-0.48]

11 Expectation propagation projections and Monte Carlo approximations to the sum-product messages get around these limitations, but can be unsuitable for dense graphs or can introduce extraordinary computational costs [5, 23]. [sent-19, score-0.209]

12 Thus, there still exist factorized probabilistic models, such as sigmoid belief networks [21] and latent Dirichlet allocation [5], whereby mean ﬁeld remains to date the tractable approximation of choice. [sent-20, score-0.357]

13 Several Monte Carlo methods have been proposed to correct for the discrepancy between the factorized variational approximations and the target distribution. [sent-21, score-0.436]

14 These methods include importance sampling [8, 14] and adaptive Markov Chain Monte Carlo (MCMC) [6]. [sent-22, score-0.157]

15 However, none of these techniques scale well to general, high-dimensional state spaces because the variational approxi- mations tend to be too restrictive when used as a proposal distribution. [sent-23, score-0.253]

16 We propose an entirely new approach that overcomes the problems of the aforementioned methods by constructing a sequence of variational approximations that converges to the target distribution. [sent-25, score-0.422]

17 To accomplish this, we derive a new class of conditionally-speciﬁed mean ﬁeld approximations, and use sequential Monte Carlo (SMC) [7] to obtain samples from them. [sent-26, score-0.193]

18 SMC acts as a mechanism to migrate particles from an easy-to-sample distribution (naive mean ﬁeld) to a difﬁcult-to-sample one (the distribution of interest), through a sequence of artiﬁcial distributions. [sent-27, score-0.368]

19 Each artiﬁcial distribution is a conditional mean ﬁeld approximation, designed in such a way that it is at least as sensible as its predecessor because it recovers dependencies left out by mean ﬁeld. [sent-28, score-0.488]

20 The problem is that the variance of the importance weights tends to degenerate around a system’s critical range of temperatures, as observed in [9]. [sent-37, score-0.227]

21 If, however, a reintroduced constraint has a large effect on the distribution, the particles may again rapidly deterioriate. [sent-40, score-0.146]

22 We limit our study to the Ising spin glass model [16]. [sent-41, score-0.173]

23 Conditional mean ﬁeld raises many questions, and since we can only hope to answer some in this study, the Ising model represents a respectable ﬁrst step. [sent-45, score-0.134]

24 , xn )T ∈ Ω, that admit a distribution belonging to the standard exponential family [25]. [sent-56, score-0.215]

25 Using the fact that − log(x) is convex, we obtain the variational lower bound Ψ(θ) = log Ep( · ;α) exp(θ T φ(X)) p(X;α) ≥ θT µ(α) − p(x; α) log p(x; α) dx, (2) where the mean statistics are deﬁned by µ(α) ≡ Ep( · ;α) {φ(X)}. [sent-60, score-0.445]

26 As it is presented here, the variational principle is of little practical use because no tractable expressions exist for the entropy and mean statistics. [sent-66, score-0.493]

27 There do, however, exist particular choices of the variational parameters α where it is possible to compute them both. [sent-67, score-0.22]

28 We shall examine one particular set of choices, naive mean ﬁeld, in the context of the Ising spin glass model. [sent-68, score-0.425]

29 , n}, the random variable Xi is deﬁned to be xi = +1 if the magnetic dipole in the “up” spin position, or xi = −1 if it is “down”. [sent-72, score-0.314]

30 , n}, to represent the conditional independence structure of the probability measure (there is no edge between i and j if and only if Xi and Xj are conditionally independent given values at all other points of the graph). [sent-79, score-0.148]

31 Associating singleton factors with nodes of G and pairwise factors with its edges, and setting the entries of the sufﬁcient statistics vector to be xi , ∀ i ∈ V and xi xj , ∀ (i, j) ∈ E, we can write the probability density as p(x; θ) = exp i∈V θi xi + (i,j)∈E θij xi xj − Ψ(θ) . [sent-80, score-0.533]

32 (3) The corresponding variational lower bound on the log-partition function Ψ(θ) then decomposes as F (α) ≡ i∈V θi µi (α) + (i,j)∈E θij µij (α) + H(α), (4) where µi (α) and µij (α) are the expectations of single spins i and pairs of spins (i, j), respectively. [sent-81, score-0.417]

33 Naive mean ﬁeld restricts the variational parameters α to belong to {α | ∀ (i, j) ∈ E, αij = 0}. [sent-82, score-0.354]

34 We can compute the lower bound (4) for any α belonging to this subset because we have tractable expressions for the mean statistics and entropy. [sent-83, score-0.324]

35 For the Ising spin glass, the mean statistics are µi (α) ≡ µij (α) ≡ xi p(x; α) dx = tanh(αi ) (5) xi xj p(x; α) dx = µi (α) µj (α), (6) and the entropy is derived to be 1−µi (α) 2 H(α) = − i∈V log 1−µi (α) 2 1+µi (α) 2 − log 1+µi (α) 2 . [sent-84, score-0.691]

36 Since the variables µi must be valid mean statistics, they are constrained to lie within an envelope known as the marginal polytope [25]. [sent-86, score-0.211]

37 Alternatively, one can solve the optimization problem with respect to the unconstrained variational parameters α. [sent-87, score-0.254]

38 This approach, as we will see, is necessary for optimizing the conditional mean ﬁeld objective. [sent-89, score-0.282]

39 Assuming familiarity with importance sampling, this will be sufﬁcient to explain key concepts underlying SMC, and does not overwhelm the reader with subscripts. [sent-91, score-0.157]

40 In the ﬁrst step, samples x(s) ∈ Ω are drawn from some proposal density q(x) and assigned importance weights w(x) = π(x)/q(x). [sent-93, score-0.301]

41 In the second step, a Markov transition kernel K (x | x) shifts each sample towards the target, and the importance weights w(x, x ) compensate for any failure to ˜ do so. [sent-94, score-0.358]

42 In effect, the second step consists of extending the path of each particle onto the joint space Ω × Ω. [sent-95, score-0.166]

43 To prevent potential particle degeneracy in the marginal space, we adopt the standard stratiﬁed resampling algorithm [13]. [sent-98, score-0.189]

44 Loosely speaking, this means that if π(x) were to be a mean ﬁeld approximation, then it would likely have lighter tails than the target distribution π (x). [sent-101, score-0.174]

45 3], the importance weights would simplify to w(x, x ) = π (x) / π(x) × w(x). [sent-106, score-0.227]

46 ˜ (9) Implicitly, this is the choice of backward kernel made in earlier sequential frameworks [11, 15]. [sent-107, score-0.211]

47 Since the mean ﬁeld approximation π(x) might very well fail to “dominate” the target π (x), the expression (9) risks having unbounded variance. [sent-108, score-0.209]

48 This is a problem because the weights may change abruptly from one iteration to the next, or give too much importance to too few values x [18]. [sent-109, score-0.227]

49 K (x | x) π(x) dx L(x | x ) = (10) It offers some hope because the resulting importance weights on the joint space, following (8), are w(x, x ) = ˜ π (x ) × w(x). [sent-115, score-0.35]

50 K (x | x) π(x) dx (11) If the transition kernel increases the mass of the proposal in regions where π(x) is weak relative to π (x), the backward kernel (10) will rectify the problems caused by an overconﬁdent proposal. [sent-116, score-0.355]

51 The drawback of the backward kernel (10) is that it limits the choice of transition kernel K (x | x), a crucial ingredient to a successful SMC simulation. [sent-118, score-0.283]

52 One transition kernel which ﬁts our requirements and is widely applicable is a mixture of kernels based on the random-scan Gibbs sampler [18]. [sent-120, score-0.175]

53 Following (11) and the identity for conditional probability, we arrive at the expression for the importance weights, w(x, x ) = ˜ π (x ) π(x ) ρk k π (xk | x−k ) π(xk | x−k ) −1 × w(x). [sent-122, score-0.305]

54 The normalized importance sampling estimator [18] yields (asymptotically unbiased) importance weights w(x, x ) ∝ w(x, x ), where the unnormalized importance weights w(x, x ) in the joint space remain ˜ ˆ ˆ the same as (13), except that we substitute π(x) for f (x), and π (x) for f (x). [sent-126, score-0.659]

55 4 Conditional mean ﬁeld We start with a partition R (equivalence relation) of the set of vertices V . [sent-128, score-0.18]

56 Our strategy is to come up with a good naive mean ﬁeld approximation to the conditional density p(xA | x−A ; θ) for every equivalence class A ∈ R, and then again for every conﬁguration x−A . [sent-130, score-0.509]

57 The crux of the matter is that for any point α, the functions p(xA | x−A ; α) only represent valid conditional densities if they correspond to some unique joint, as discussed in [2]. [sent-132, score-0.189]

58 Fortunately, under the Ising model the terms p(xA | x−A ; α) represent valid conditionals for any α. [sent-133, score-0.137]

59 ” Conditional mean ﬁeld forces each conditional p(xA | x−A ; α) to decompose as a product of marginals p(xi | x−A ; α), for all i ∈ A. [sent-136, score-0.282]

60 Notice that we have a set of free variational parameters αij deﬁned on the edges (i, j) that straddle subsets of the partition. [sent-138, score-0.322]

61 / Our variational formulation consists of competing objectives, since the conditionals p(xA | x−A ; α) share a common set of parameters. [sent-141, score-0.316]

62 We formulate the ﬁnal objective function as a linear combination of conditional objectives. [sent-142, score-0.148]

63 A conditional mean ﬁeld optimization problem with respect to graph partition R and linear weights λ is of the form maximize FR,λ (α) ≡ A∈R xN (A) λA (xN (A) )FA (α, xN (A) ) subject to αij = 0, for all (i, j) ∈ E \ CR . [sent-143, score-0.503]

64 Each conditional objective FA (α, xN (A) ) represents a naive mean ﬁeld lower bound to the logpartition function of the conditional density p(xA | x−A ; θ) = p(xA | xN (A) ; θ). [sent-146, score-0.675]

65 For the Ising model, FA (α, xN (A) ) follows from the exact same steps used in the derivation of the naive mean ﬁeld lower bound in Sec. [sent-147, score-0.293]

66 (17) (18) The entropy is identical to (7), with the mean statistics replaced with their conditional counterparts. [sent-150, score-0.372]

67 Unlike standard naive mean ﬁeld, conditional mean ﬁeld optimizes over the pairwise interactions αij deﬁned on the connecting edges (i, j) ∈ CR . [sent-164, score-0.627]

68 Second, the conditional objective of a singleton subset has a unique maximum at αi = θi , so any solution to (15) is guaranteed to recover the original distribution when |R| = n. [sent-168, score-0.186]

69 1 The Conditional mean ﬁeld algorithm We propose an SMC algorithm that produces progressively reﬁned particle estimates of the mean statistics, in which conditional mean ﬁeld acts in a supporting role. [sent-170, score-0.704]

70 The initial SMC distribution is obtained by solving (15) for R = {V }, which amounts to the mean ﬁeld approximation derived in Sec. [sent-171, score-0.169]

71 In subsequent steps, we iteratively solve (15), update the estimates of the mean statistics by reweighting (see (20)) and occasionally resampling the particles, then we split the partition until we cannot split it anymore, at which point |R| = n and we recover the target p(x; θ). [sent-173, score-0.343]

72 It is easy to Figure 1: The graphs on the left depict the Markov properties of the conditional mean ﬁeld approximations in steps 1 to 4. [sent-174, score-0.408]

73 Note that this estimate is not a variational lower bound. [sent-180, score-0.22]

74 We currently have a particle estimate of the R-partition conditional mean ﬁeld approximation p(x; α) with samples x(s) and marginal importance weights w(s) . [sent-182, score-0.698]

75 In this manner, we ensure that the sequence is progressing toward the target (provided R = R ), and that it is always possible to evaluate the importance weights. [sent-186, score-0.233]

76 Next, we use the random-scan Gibbs sampler (12) to shift the particles toward the new distribution, where the Gibbs sites k correspond to the subsets B ∈ R . [sent-188, score-0.28]

77 To obtain the particle estimate of the new distribution, we normalize the ˆ weights w(s) ∝ w(s) , assign the marginal importance weights w(s) ← w(s) , and set x(s) ← (x )(s) . [sent-192, score-0.451]

78 We assume we have enough particles to recover the distributions almost perfectly. [sent-197, score-0.184]

79 Setting R = {{1, 2, 3, 4}}, the ﬁrst artiﬁcial distribution is the naive mean ﬁeld solution α1:4 = (0. [sent-198, score-0.252]

80 Knowing that the true mean statistics are µ1:4 = (0. [sent-204, score-0.184]

81 27), and Var(Xi ) = 1 − µ2 , it is easy to see naive mean ﬁeld largely i underestimates the variance of the spins. [sent-208, score-0.252]

82 In step 2, we split the partition into R = {{1, 2}, {3, 4}}, and the new conditional mean ﬁeld approximation is given by α1:4 = (0. [sent-209, score-0.363]

83 1] p(x; α)1−γ p(x; α )γ gives the transition kernel more opportunity to correctly migrate the samples to the next distribution. [sent-229, score-0.183]

84 To our knowledge, this is the only SMC implementation in which the next distribution in the sequence is constructed dynamically according to the particle approximation from the previous step. [sent-233, score-0.189]

85 (b) Average error of the mean statistics according to the hot coupling (HC), conditional mean ﬁeld algorithm (CMF), Bethe-Kikuchi variational approximation (B-K), and tree-reweighted upper bound (TRW) estimates. [sent-238, score-0.938]

86 We use 1000 particles (as with most particle methods, the running time is proportional to the number of particles), and we temper across successive distributions with a linear inverse temperature schedule of length 100. [sent-246, score-0.264]

87 The 1 particles are resampled when the effective sample size [18] drops below 2 . [sent-247, score-0.146]

88 We compare our results with the “hot coupling” SMC algorithm described in [9] (appropriately, using the same algorithm settings), and with two sum-product methods based on Bethe-Kikuchi approximations [1] and treereweighted upper bounds [24]. [sent-248, score-0.16]

89 We adopt the simplest formulation of both methods in which the regions (or junction graph nodes) are deﬁned as the edges E. [sent-249, score-0.189]

90 Both Bethe-Kikuchi approximations and tree-reweighted upper bounds provide good approximations to the grid model. [sent-254, score-0.286]

91 The SMC algorithms proposed here and in [9], by contrast, produce signiﬁcantly improved estimates of the mean statistics. [sent-258, score-0.134]

92 It is surprising that we achieve similar performance with hot coupling [9], given that we do not exploit the tractability of sum-product messages in the Ising model (which would offer guaranteed improvements due to the Rao-Blackwell theorem). [sent-259, score-0.185]

93 6 Conclusions and discussion We presented a sequential Monte Carlo algorithm in which each artiﬁcial distribution is the solution to a conditionally-speciﬁed mean ﬁeld optimization problem. [sent-260, score-0.227]

94 We believe that the extra expense of nonlinear optimization at each step may be warranted in the long run as our method holds promise in solving more difﬁcult inference problems, problems where Monte Carlo and variational methods alone perform poorly. [sent-261, score-0.299]

95 As noted in [22], naive mean ﬁeld implies complete factorizability, which is not necessary under the Ising model. [sent-264, score-0.252]

96 Bethe-Kikuchi approximations based on junction graphs have many merits, but they cannot be considered candidates for our framework because they produce estimates of local mean statistics without deﬁning a joint distribution. [sent-267, score-0.422]

97 2] to derive an expression for the importance weights that does not involve the joint. [sent-272, score-0.227]

98 Furthermore, conditions for guaranteeing the validity of conditional densities have been extensively studied in multivariate [2] and spatial statistics [3]. [sent-273, score-0.198]

99 Hot Coupling: a particle approach to inference and normalization on pairwise undirected graphs. [sent-337, score-0.163]

100 Variational approximations between mean ﬁeld theory and the junction tree algorithm. [sent-464, score-0.324]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ising', 0.309), ('smc', 0.295), ('eld', 0.268), ('variational', 0.22), ('xn', 0.215), ('ij', 0.183), ('xa', 0.173), ('carlo', 0.172), ('monte', 0.172), ('importance', 0.157), ('conditional', 0.148), ('particles', 0.146), ('mean', 0.134), ('approximations', 0.126), ('particle', 0.118), ('naive', 0.118), ('spin', 0.096), ('conditionals', 0.096), ('hot', 0.091), ('cr', 0.086), ('xi', 0.084), ('spins', 0.078), ('fa', 0.077), ('glass', 0.077), ('dx', 0.075), ('recovers', 0.072), ('graph', 0.071), ('arti', 0.07), ('cial', 0.07), ('weights', 0.07), ('transition', 0.069), ('junction', 0.064), ('fr', 0.062), ('kernel', 0.062), ('gibbs', 0.061), ('sequential', 0.059), ('xk', 0.057), ('belief', 0.055), ('backward', 0.054), ('edges', 0.054), ('xj', 0.053), ('cmf', 0.052), ('factorizability', 0.052), ('hamze', 0.052), ('migrate', 0.052), ('nando', 0.052), ('overcon', 0.052), ('expressions', 0.051), ('coupling', 0.051), ('factorized', 0.05), ('magnetic', 0.05), ('statistics', 0.05), ('tractable', 0.048), ('subsets', 0.048), ('scalars', 0.048), ('joint', 0.048), ('markov', 0.047), ('solver', 0.046), ('partition', 0.046), ('logpartition', 0.045), ('moral', 0.045), ('inference', 0.045), ('sampler', 0.044), ('dent', 0.044), ('messages', 0.043), ('wainwright', 0.043), ('sites', 0.042), ('doucet', 0.041), ('freitas', 0.041), ('tanh', 0.041), ('temperatures', 0.041), ('density', 0.041), ('bound', 0.041), ('valid', 0.041), ('propagation', 0.04), ('entropy', 0.04), ('target', 0.04), ('connecting', 0.039), ('bc', 0.039), ('counting', 0.038), ('canada', 0.038), ('volume', 0.038), ('recover', 0.038), ('sequence', 0.036), ('british', 0.036), ('vancouver', 0.036), ('hc', 0.036), ('progressively', 0.036), ('choice', 0.036), ('marginal', 0.036), ('approximation', 0.035), ('resampling', 0.035), ('sigmoid', 0.035), ('del', 0.035), ('optimization', 0.034), ('upper', 0.034), ('guration', 0.033), ('proposal', 0.033), ('equivalence', 0.033), ('intelligence', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

2 0.19372356 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external ﬁeld). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various ﬁelds such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external ﬁeld case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external ﬁeld, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Deﬁnitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } deﬁned as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, deﬁned over pairs of variables. The P partition function is deﬁned as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We deﬁne those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then deﬁned via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is deﬁned as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we deﬁne E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are deﬁned as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is deﬁned in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is deﬁned as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is deﬁned as the product of the weights of the edges in the matching. Finally, we recall the deﬁnition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is deﬁned as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, sufﬁce to describe M(G) (see [3] page 434). The triangle inequalities are deﬁned by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The deﬁnition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We brieﬂy review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The ﬁrst step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We deﬁne an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a ﬁrst step in calculating a partition function over GT , we introduce the following deﬁnition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is deﬁned as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by ﬁnding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Deﬁne the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Deﬁne Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfafﬁan orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Deﬁne the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Deﬁne G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following deﬁnitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisﬁes the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Deﬁne the optimal bound for a ﬁxed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, deﬁne the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact ﬁnd the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efﬁciently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difﬁculty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope deﬁned in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efﬁciently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , ﬁnd θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to ﬁnd the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external ﬁeld. Original graph is shown on the left. The ﬁeld vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the ﬁeld vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Speciﬁcally, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as deﬁned above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We ﬁrst characterize the optimum of Equation 7 for a ﬁxed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisﬁes τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well deﬁned set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We ﬁrst P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external ﬁeld. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for ﬁeld parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various ﬁeld parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the ﬁeld node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The ﬁeld parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difﬁcult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is signiﬁcantly better than TRW for almost all parameter settings, although the difference becomes smaller for large ﬁeld values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a ﬁxed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the ﬁeld is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given ﬁeld. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external ﬁelds. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to ﬁnd one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientiﬁc, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

3 0.18589699 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle ﬁlter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

4 0.14093168 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classiﬁcation setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classiﬁer combination. 1

5 0.13965279 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efﬁcient, easy to implement and signiﬁcantly more accurate than standard variational Bayesian inference for LDA.

6 0.12368354 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

7 0.11587863 98 nips-2006-Inferring Network Structure from Co-Occurrences

8 0.11585026 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

9 0.112133 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

10 0.097897418 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

11 0.097888187 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

12 0.097449116 20 nips-2006-Active learning for misspecified generalized linear models

13 0.096255854 159 nips-2006-Parameter Expanded Variational Bayesian Methods

14 0.08671578 69 nips-2006-Distributed Inference in Dynamical Systems

15 0.08596655 65 nips-2006-Denoising and Dimension Reduction in Feature Space

16 0.083525196 169 nips-2006-Relational Learning with Gaussian Processes

17 0.079414912 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

18 0.078499615 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

19 0.076557145 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

20 0.075322807 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.284), (1, 0.061), (2, 0.017), (3, -0.061), (4, 0.113), (5, 0.072), (6, 0.252), (7, 0.078), (8, -0.115), (9, -0.02), (10, 0.025), (11, -0.058), (12, 0.091), (13, -0.041), (14, 0.03), (15, 0.075), (16, -0.111), (17, 0.059), (18, 0.036), (19, 0.052), (20, 0.11), (21, 0.061), (22, 0.005), (23, 0.011), (24, 0.13), (25, 0.079), (26, -0.027), (27, 0.152), (28, -0.14), (29, 0.043), (30, -0.021), (31, -0.025), (32, -0.001), (33, -0.091), (34, -0.009), (35, -0.053), (36, -0.034), (37, -0.059), (38, -0.002), (39, 0.024), (40, -0.086), (41, -0.029), (42, 0.129), (43, 0.113), (44, 0.018), (45, -0.166), (46, 0.055), (47, 0.055), (48, -0.012), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96925336 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

2 0.69960409 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

3 0.66505855 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

4 0.62858945 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

Author: Thomas Ott, Ruedi Stoop

Abstract: We rigorously establish a close relationship between message passing algorithms and models of neurodynamics by showing that the equations of a continuous Hopﬁeld network can be derived from the equations of belief propagation on a binary Markov random ﬁeld. As Hopﬁeld networks are equipped with a Lyapunov function, convergence is guaranteed. As a consequence, in the limit of many weak connections per neuron, Hopﬁeld networks exactly implement a continuous-time variant of belief propagation starting from message initialisations that prevent from running into convergence problems. Our results lead to a better understanding of the role of message passing algorithms in real biological neural networks.

5 0.56350702 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

6 0.56195235 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

7 0.54288536 169 nips-2006-Relational Learning with Gaussian Processes

8 0.51610601 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

9 0.50535047 98 nips-2006-Inferring Network Structure from Co-Occurrences

10 0.48659068 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

11 0.45426089 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

12 0.44330832 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

13 0.43489447 77 nips-2006-Fast Computation of Graph Kernels

14 0.41765523 159 nips-2006-Parameter Expanded Variational Bayesian Methods

15 0.41467607 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

16 0.39889473 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

17 0.39638603 192 nips-2006-Theory and Dynamics of Perceptual Bistability

18 0.3958503 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

19 0.39273807 82 nips-2006-Gaussian and Wishart Hyperkernels

20 0.39263502 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.095), (3, 0.021), (7, 0.069), (9, 0.03), (20, 0.03), (22, 0.05), (42, 0.017), (44, 0.399), (57, 0.085), (65, 0.034), (69, 0.046), (83, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99569649 69 nips-2006-Distributed Inference in Dynamical Systems

Author: Stanislav Funiak, Carlos Guestrin, Rahul Sukthankar, Mark A. Paskin

Abstract: We present a robust distributed algorithm for approximate probabilistic inference in dynamical systems, such as sensor networks and teams of mobile robots. Using assumed density ﬁltering, the network nodes maintain a tractable representation of the belief state in a distributed fashion. At each time step, the nodes coordinate to condition this distribution on the observations made throughout the network, and to advance this estimate to the next time step. In addition, we identify a signiﬁcant challenge for probabilistic inference in dynamical systems: message losses or network partitions can cause nodes to have inconsistent beliefs about the current state of the system. We address this problem by developing distributed algorithms that guarantee that nodes will reach an informative consistent distribution when communication is re-established. We present a suite of experimental results on real-world sensor data for two real sensor network deployments: one with 25 cameras and another with 54 temperature sensors. 1

2 0.98508877 96 nips-2006-In-Network PCA and Anomaly Detection

Author: Ling Huang, Xuanlong Nguyen, Minos Garofalakis, Michael I. Jordan, Anthony Joseph, Nina Taft

Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data ﬁlters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.

3 0.97899044 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

same-paper 4 0.97123933 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

5 0.93408728 139 nips-2006-Multi-dynamic Bayesian Networks

Author: Karim Filali, Jeff A. Bilmes

Abstract: We present a generalization of dynamic Bayesian networks to concisely describe complex probability distributions such as in problems with multiple interacting variable-length streams of random variables. Our framework incorporates recent graphical model constructs to account for existence uncertainty, value-speciﬁc independence, aggregation relationships, and local and global constraints, while still retaining a Bayesian network interpretation and efﬁcient inference and learning techniques. We introduce one such general technique, which is an extension of Value Elimination, a backtracking search inference algorithm. Multi-dynamic Bayesian networks are motivated by our work on Statistical Machine Translation (MT). We present results on MT word alignment in support of our claim that MDBNs are a promising framework for the rapid prototyping of new MT systems. 1 INTRODUCTION The description of factorization properties of families of probabilities using graphs (i.e., graphical models, or GMs), has proven very useful in modeling a wide variety of statistical and machine learning domains such as expert systems, medical diagnosis, decision making, speech recognition, and natural language processing. There are many different types of graphical model, each with its own properties and beneﬁts, including Bayesian networks, undirected Markov random ﬁelds, and factor graphs. Moreover, for different types of scientiﬁc modeling, different types of graphs are more or less appropriate. For example, static Bayesian networks are quite useful when the size of set of random variables in the domain does not grow or shrink for all data instances and queries of interest. Hidden Markov models (HMMs), on the other hand, are such that the number of underlying random variables changes depending on the desired length (which can be a random variable), and HMMs are applicable even without knowing this length as they can be extended indeﬁnitely using online inference. HMMs have been generalized to dynamic Bayesian networks (DBNs) and temporal conditional random ﬁelds (CRFs), where an underlying set of variables gets repeated as needed to ﬁll any ﬁnite but unbounded length. Probabilistic relational models (PRMs) [5] allow for a more complex template that can be expanded in multiple dimensions simultaneously. An attribute common to all of the above cases is that the speciﬁcation of rules for expanding any particular instance of a model is ﬁnite. In other words, these forms of GM allow the speciﬁcation of models with an unlimited number of random variables (RVs) using a ﬁnite description. This is achieved using parameter tying, so while the number of RVs increases without bound, the number of parameters does not. In this paper, we introduce a new class of model we call multi-dynamic Bayesian networks. MDBNs are motivated by our research into the application of graphical models to the domain of statistical machine translation (MT) and they have two key attributes from the graphical modeling perspective. First, an MDBN generalizes a DBN in that there are multiple “streams” of variables that can get unrolled, but where each stream may be unrolled by a differing amount. In the most general case, connecting these different streams together would require the speciﬁcation of conditional probabil- ity tables with a varying and potentially unlimited number of parents. To avoid this problem and retain the template’s ﬁnite description length, we utilize a switching parent functionality (also called value-speciﬁc independence). Second, in order to capture the notion of fertility in MT-systems (deﬁned later in the text), we employ a form of existence uncertainty [7] (that we call switching existence), whereby the existence of a given random variable might depend on the value of other random variables in the network. Being fully propositional, MDBNs lie between DBNs and PRMs in terms of expressiveness. While PRMs are capable of describing any MDBN, there are, in general, advantages to restricting ourselves to a more speciﬁc class of model. For example, in the DBN case, it is possible to provide a bound on inference costs just by looking at attributes of the DBN template only (e.g., the left or right interfaces [12, 2]). Restricting the model can also make it simpler to use in practice. MDBNs are still relatively simple, while at the same time making possible the easy expression of MT systems, and opening doors to novel forms of probabilistic inference as we show below. In section 2, we introduce MDBNs, and describe their application to machine translation showing how it is possible to represent even complex MT systems. In section 3, we describe MDBN learning and decoding algorithms. In section 4, we present experimental results in the area of statistical machine translation, and future work is discussed in section 5. 2 MDBNs A standard DBN [4] template consists of a directed acyclic graph G = (V, E) = (V1 ∪ V2 , E1 ∪ → E2 ∪ E2 ) with node set V and edge set E. For t ∈ {1, 2}, the sets Vt are the nodes at slice t, Et → are the intra-slice edges between nodes in Vt , and Et are the inter-slice edges between nodes in V1 and V2 . To unroll a DBN to length T , the nodes V2 along with the edges adjacent to any node in V2 are cloned T − 1 times (where parameters of cloned variables are constrained to be the same as the template) and re-connected at the corresponding places. An MDBN with K streams consists of the union of K DBN templates along with a template structure specifying rules to connect the various streams together. An MDBN template is a directed graph (k) G = (V, E) = ( V (k) , E (k) ∪ E ) k (k) (k) th k (k) where (V , E ) is the k DBN, and the edges E are rules specifying how to connect stream k to the other streams. These rules are general in that they specify the set of edges for all values of Tk . There can be arbitrary nesting of the streams such as, for example, it is possible to specify a model that can grow along several dimensions simultaneously. An MDBN also utilizes “switching existence”, meaning some subset of the variables in V bestow existence onto other variables in the network. We call these variables existence bestowing (or ebnodes). The idea of bestowing existence is well deﬁned over a discrete space, and is not dissimilar to a variable length DBN. For example, we may have a joint distribution over lengths as follows: p(X1 , . . . , XN , N ) = p(X1 , . . . , Xn |N = n)p(N = n) where here N is an eb-node that determines the number of other random variables in the DGM. Our notion of eb-nodes allows us to model certain characteristics found within machine translation systems, such as “fertility” [3], where a given English word is cloned a random number of times in the generative process that explains a translation from French into English. This random cloning might happen simultaneously at all points along a given MDBN stream. This means that even for a given ﬁxed stream length Ti = ti , each stream could have a randomly varying number of random variables. Our graphical notation for eb-nodes consists of the eb-node as a square box containing variables whose existence is determined by the eb-node. We start by providing a simple example of an expanded MDBN for three well known MT systems, namely the IBM models 1 and 2 [3], and the “HMM” model [15].1 We adopt the convention in [3] that our goal is to translate from a string of French words F = f of length M = m into a string of English words E = e of length L = l — of course these can be any two languages. The basic generative (noisy channel) approach when translating from French to English is to represent the joint 1 We will refer to it as M-HMM to avoid confusion with regular HMMs. distribution P (f , e) = P (f |e)P (e). P (e) is a language model specifying the prior over the word string e. The key goal is to produce a ﬁnite-description length representation for P (f |e) where f and e are of arbitrary length. A hidden alignment string, a, speciﬁes how the English words align to the French word, leading to P (f |e) = a P (f , a|e). Figure 1(a) is a 2-stream MDBN expanded representation of the three models, in this case ℓ = 4 and m = 3. As shown, it appears that the fan-in to node fi will be ℓ and thus will grow without bound. However, a switching mechanism whereby P (fi |e, ai ) = P (fi |eai ) limits the number of parameters regardless of L. This means that the alignment variable ai indicates the English word eai that should be aligned to French word fi . The variable e0 is a null word that connects to French words not explained by any of e1 , . . . , eℓ . The graph expresses all three models — the difference is that, in Models 1 and 2, there are no edges between aj and aj+1 . In Model 1, p(aj = ℓ) is uniform on the set {1, . . . , L}; in Model 2, the distribution over aj is a function only of its position j, and on the English and French lengths ℓ and m respectively. In the M-HMM model, the ai variables form a ﬁrst order Markov chain. l e0 ℓ e1 e3 e2 e1 e4 e2 e3 φ1 φ2 φ3 m’ φ0 τ01 a1 f2 a2 f3 a3 m (a) Models 1,2 and M-HMM τ12 τ13 τ21 π02 π11 π12 π13 π21 f2 f3 f4 f5 f6 a1 u v τ11 f1 f1 τ02 a2 a3 a4 a5 a6 π01 w y x m (b) Expanded M3 graph Figure 1: Expanded 2-stream MDBN description of IBM Models 1 and 2, and the M-HMM model for MT; and the expanded MDBN description of IBM Model 3 with fertility assignment φ0 = 2, φ1 = 3, φ2 = 1, φ3 = 0. From the above, we see that it would be difﬁcult to express this model graphically using a standard DBN since L and M are unequal random variables. Indeed, there are two DBNs in operation, one consisting of the English string, and the other consisting of the French string and its alignment. Moreover, the fully connected structure of the graph in the ﬁgure can represent the appropriate family of model, but it also represents models whose parameter space grows without bound — the switching function allows the model template to stay ﬁnite regardless of L and M . With our MDBN descriptive abilities complete, it is now possible to describe the more complex IBM models 3, and 4[3] (an MDBN for Model3 is depicted in ﬁg. 1(b)). The top most random variable, ℓ, is a hidden switching existence variable corresponding to the length of the English string. The box abutting ℓ includes all the nodes whose existence depends on the value of ℓ. In the ﬁgure, ℓ = 3, thus resulting in three English words e1 , e2 , and e3 connected using a second-order Markov chain. To each English word ei corresponds a conditionally dependent fertility eb-node φi , which indicates how many times ei is used by words in the French string. Each φi in turn controls the existence of a set of variables under it. Given the fertilities (the ﬁgure depicts the case φ1 = 3, φ2 = 1, φ3 = 0), for each word ei , φi French word variables are granted existence and are denoted by τi1 , τi2 , . . . , τiφi , what is called the tablet [3] of ei . The values taken by the τ variables need to match the actual observed French sequence f1 , . . . , fm . This is represented as a shared constraint between all the f , π, and τ variables which have incoming edges into the observed variable v. v’s conditional probability table is such that it is one only when the associated constraint is satisﬁed2 . The variable 2 This type of encoding of constraints corresponds to the standard mechanism used by Pearl [14]. A naive implementation, however, would enumerate a number of conﬁgurations exponential in the number of constrained variables, while typically only a small fraction of the conﬁgurations would have positive probability. πi,k ∈ {1, . . . , m} is a switching dependency parent with respect to the constraint variable v and determines which fj participates in an equality constraint with τi,k . The bottom variable m is a switching existence node (observed to be 6 in the ﬁgure) with corresponding French word sequence and alignment variables. The French sequence participates in the v constraint described above, while the alignment variables aj ∈ {1, . . . , ℓ}, j ∈ 1, . . . , m constrain the fertilities to take their unique allowable values (for the given alignment). Alignments also restrict the domain of permutation variables, π, using the constraint variable x. Finally, the domain size of each aj has to lie in the interval [0, ℓ] and that is enforced by the variable u. The dashed edges connecting the alignment a variables represent an extension to implement an M3/M-HMM hybrid. ℓ The null submodel involving the deterministic node m′ (= i=1 φi ) and eb-node φ0 accounts for French words that are not explained by any of the English words e1 , . . . , eℓ . In this submodel, successive permutation variables are ordered and this constraint is implemented using the observed child w of π0i and π0(i+1) . Model 4 [3] is similar to Model 3 except that the former is based on a more elaborate distortion model that uses relative instead of absolute positions both within and between tablets. 3 Inference, Parameter Estimation and MPE Multi-dynamic Bayesian Networks are amenable to any type of inference that is applicable to regular Bayesian networks as long as switching existence relationships are respected and all the constraints (aggregation for example) are satisﬁed. Unfortunately DBN inference procedures that take advantage of the repeatable template and can preprocess it ofﬂine, are not easy to apply to MDBNs. A case in point is the Junction Tree algorithm [11]. Triangulation algorithms exist that create an ofﬂine triangulated version of the input graph and do not re-triangulate it for each different instance of the input data [12, 2]. In MDBNs, due to the ﬂexibility to unroll templates in several dimensions and to specify dependencies and constraints spanning the entire unrolled graph, it is not obvious how we can exploit any repetitive patterns in a Junction Tree-style ofﬂine triangulation of the graph template. In section 4, we discuss sampling inference methods we have used. Here we discuss our extension to a backtracking search algorithm with the same performance guarantees as the JT algorithm, but with the advantage of easily handling determinism, existence uncertainty, and constraints, both learned and explicitly stated. Value Elimination (VE) ([1]), is a backtracking Bayesian network inference technique that caches factors associated with portions of the search tree and uses them to avoid iterating again over the same subtrees. We follow the notation introduced in [1] and refer the reader to that paper for details about VE inference. We have extended the VE inference approach to handle explicitly encoded constraints, existence uncertainty, and to perform approximate local domain pruning (see section 4). We omit these details as well as others in the original paper and brieﬂy describe the main data structure required by VE and sketch the algorithm we refer to as FirstPass (ﬁg. 1) since it constitutes the ﬁrst step of the learning procedure, our main contribution in this section. A VE factor, F , is such that we can write the following marginal of the joint distribution P (X = x, Y = y, Z) = F.val × f (Z) X=x such that (X∪Y)∩Z = ∅, F.val is a constant, and f (Z) a function of Z only. Y is a set of variables previously instantiated in the current branch of search tree to the value vector y. The pair (Y, y) is referred to as a dependency set (F.Dset). X is referred to as a subsumed set (F.Sset). By caching the tuple (F.Dset, F.Sset, F.val), we avoid recomputing the marginal again whenever (1) F.Dset is active, meaning all nodes stored in F.Dset are assigned their cached values in the current branch of the search tree; and (2) none of the variables in F.Sset are assigned yet. FirstPass (alg. 1) visits nodes in the graph in Depth First fashion. In line 7, we get the values of all Newly Single-valued (NSV) CPTs i.e., CPTs that involve the current node, V , and in which all We use a general directed domain pruning constraint. Deterministic relationships then become a special case of our constraint whereby the domain of the child variable is constrained to a single value with probability one. Variable traversal order: A, B, C, and D. Factors are numbered by order of creation. *Fi denotes the activation of factor i. Tau values propagated recursively F7: Dset={} Sset={A,B,C,D} val=P(E=e) F7.tau = 1.0 = P(Evidence)/F7.val A F5: Dset={A=0} Sset={B,C,D} F2 D *F1 *F2 Factor values needed for c(A=0) and c(C=0,B=0) computation: F5.val=P(B=0|A=0)*F3.val+P(B=1|A=0)*F4.val F3.val=P(C=0|B=0)*F1.val+P(C=1|B=0)*F2.val F4.val=P(C=0|B=1)*F1.val+P(C=1|B=1)*F2.val F1.val=P(D=0|C=0)P(E=e|D=0)+P(D=1|C=0)P(E=e|D=1) F2.val=P(D=0|C=1)P(E=e|D=0)+P(D=1|C=1)P(E=e|D=1) First pass C *F3 *F4 Second pass D B F4 C F6.tau = F7.tau * P(A=1) 1 B F3: Dset={B=0} Sset={C,D} F1 F5.tau = F7.tau * P(A=0) F6 0 F3.tau = F5.tau * P(B=0|A=0) + F6.tau * P(B=0|A=1) = P(B=0) F4.tau = F5.tau * P(B=1|A=0) + F6.tau * P(B=1|A=1) = P(B=1) F1.tau = F3.tau * P(C=0|B=0) + F4.tau * P(C=0|B=1) = P(C=0) F2.tau = F3.tau * P(C=1|B=0) + F4.tau * P(C=1|B=1) = P(C=1) c(A=0)=(1/P(e))*(F7.tau*P(A=0)*F5.val)=(1/P(e))(P(A=0)*P(E=e|A=0))=P(A=0|E=e) c(C=0,B=0)=(1/P(e))*F3.tau*P(C=0|B=0)*F1.val =(1/P(e) * (P(A=0,B=0)+P(A=1,B=0)) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(B=0) * P(C=0|B=0) * F1.val =(1/P(e)) * P(C=0,B=0) * F1.val =P(C=0,B=0,E=e)/P(e)=P(C=0,B=0|E=e) Figure 2: Learning example using the Markov chain A → B → C → D → E, where E is observed. In the ﬁrst pass, factors (Dset, Sset and val) are learned in a bottom up fashion. Also, the normalization constant P (E = e) (probability of evidence) is obtained. In the second pass, tau values are updated in a top-down fashion and used to calculate expected counts c(F.head, pa(F.head)) corresponding to each F.head (the ﬁgure shows the derivations for (A=0) and (C=0,B=0), but all counts are updated in the same pass). other variables are already assigned (these variables and their values are accumulated into Dset). We also check for factors that are active, multiply their values in, and accumulate subsumed vars in Sset (to avoid branching on them). In line 10, we add V to the Sset. In line 11, we cache a new factor F with value F.val = sum. We store V into F.head, a pointer to the last variable to be inserted into F.Sset, and needed for parameter estimation described below. F.Dset consists of all the variables, except V , that appeared in any NSV CPT or the Dset of an activated factor at line 6. Regular Value Elimination is query-based, similar to variable elimination and recursive conditioning—what this means is that to answer a query of the type P (Q|E = e), where Q is query variable and E a set of evidence nodes, we force Q to be at the top of the search tree, run the backtracking algorithm and then read the answers to the queries P (Q = q|E = e), q ∈ Dom[Q], along each of the outgoing edges of Q. Parameter estimation would require running a number of queries on the order of the number of parameters to estimate. We extend VE into an algorithm that allows us to obtain Expectation Maximization sufﬁcient statistics in a single run of Value Elimination plus a second pass, which can never take longer than the ﬁrst one (and in practice is much faster). This two-pass procedure is analogous to the collect-distribute evidence procedure in the Junction Tree algorithm, but here we do this via a search tree. Let θX=x|pa(X)=y be a parameter associated with variable X with value x and parents Y = pa(X) when they have value y. Assuming a maximum likelihood learning scenario3 , to estimate θX=x|pa(X)=y , we need to compute f (X = x, pa(X) = y, E = e) = P (W, X = x, pa(X) = y, E = e) W\{X,pa(X)} which is a sum of joint probabilities of all conﬁgurations that are consistent with the assignment {X = x, pa(X) = y}. If we were to turn off factor caching, we would enumerate all such variable conﬁgurations and could compute the sum. When standard VE factors are used, however, this is no longer possible whenever X or any of its parents becomes subsumed. Fig. 2 illustrates an example of a VE tree and the factors that are learned in the case of a Markov chain with an evidence node at the end. We can readily estimate the parameters associated with variables A and B as they are not subsumed along any branch. C and D become subsumed, however, and we cannot obtain the correct counts along all the branches that would lead to C and D in the full enumeration case. To address this issue, we store a special value, F.tau, in each factor. F.tau holds the sum over all path probabilities from the ﬁrst level of the search tree to the level at which the factor F was 3 For Bayesian networks the likelihood function decomposes such that maximizing the expectation of the complete likelihood is equivalent to maximizing the “local likelihood” of each variable in the network. either created or activated. For example, F 6.tau in ﬁg. 2 is simply P (A = 1). Although we can compute F 3.tau directly, we can also compute it recursively using F 5.tau and F 6.tau as shown in the ﬁgure. This is because both F 5 and F 6 subsume F 3: in the context {F 5.Dset}, there exists a (unique) value dsub of F 5.head4 s.t. F 3 becomes activable. Likewise for F 6. We cannot compute F 1.tau directly, but we can, recursively, from F 3.tau and F 4.tau by taking advantage of a similar subsumption relationship. In general, we can show that the following recursive relationship holds: F pa .tau × N SVF pa .head=dsub × F.tau ← F pa ∈F pa Fact .val F.val Fact ∈Fact (1) where F pa is the set of factors that subsume F , Fact is the set of all factors (including F ) that become active in the context of {F pa .Dset, F pa .head = dsub } and N SVF pa .head=dsub is the product of all newly single valued CPTs under the same context. For top-level factors (not subsumed by any factor), F.tau = Pevidence /F.val, which is 1.0 when there is a unique top-level factor. Alg. 2 is a simple recursive computation of eq. 1 for each factor. We visit learned factors in the reverse order in which they were learned to ensure that, for any factor F ′ , F ′ .tau is incremented (line 13) by any F that might have activated F ′ (line 12). For example, in ﬁg. 2, F 4 uses F 1 and F 2, so F 4.tau needs to be updated before F 1.tau and F 2.tau. In line 11, we can increment the counts for any NSV CPT entries since F.tau will account for the possible ways of reaching the conﬁguration {F.Dset, F.head = d} in an equivalent full enumeration tree. Algorithm 1: FirstPass(level) 1 2 3 4 5 6 7 8 9 10 Input: Graph G Output: A list of learned factors and Pevidence Select var V to branch on if V ==NONE then return Sset={}, Dset={} for d ∈ Dom[V ] do V ←d prod = productOfAllNSVsAndActiveFactors(Dset, Sset) if prod != 0 then FirstPass(level+1) sum += prod Sset = Sset ∪ {V } cacheNewFactor(F.head ← V ,F.val ← sum, F.Sset ← Sset, F.Dset ← Dset); Algorithm 2: SecondPass() 1 2 3 4 5 6 7 8 9 10 11 12 13 Input: F : List of factors in the reverse order learned in the ﬁrst pass and Pevidence . Result: Updated counts foreach F ∈ F do if F.Dset = {} then F.tau ← Pevidence /F.val else F.tau ← 0.0 Assign vars in F.Dset to their values V ← F.head (last node to have been subsumed in this factor) foreach d ∈ Dom[V ] do prod = productOfAllNSVsAndActiveFactors() prod∗ = F.tau foreach newly single-valued CPT C do count(C.child,C.parents)+=prod/Pevidence F ′ =getListOfActiveFactors() for F ′ ∈ F ′ do F ′ .tau+ = prod/F ′ .val Most Probable Explanation We compute MPE using a very similar two-pass algorithm. In the ﬁrst pass, factors are used to store a maximum instead of a summation over variables in the Sset. We also keep track of the value of F.head at which the maximum is achieved. In the second pass, we recursively ﬁnd the optimal variable conﬁguration by following the trail of factors that are activated when we assign each F.head variable to its maximum value starting from the last learned factor. 4 Recall, F.head is the last variable to be added to a newly created factor in line 10 of alg. 1 4 MACHINE TRANSLATION WORD ALIGNMENT EXPERIMENTS A major motivation for pursuing the type of representation and inference described above is to make it possible to solve computationally-intensive real-world problems using large amounts of data, while retaining the full generality and expressiveness afforded by the MDBN modeling language. In the experiments below we compare running times of MDBNs to GIZA++ on IBM Models 1 through 4 and the M-HMM model. GIZA++ is a special-purpose optimized MT word alignment C++ tool that is widely used in current state-of-the-art phrase-based MT systems [10] and at the time of this writing is the only publicly available software that implements all of the IBM Models. We test on French-English 107 hand-aligned sentences5 from a corpus of the European parliament proceedings (Europarl [9]) and train on 10000 sentence pairs from the same corpus and of maximum number of words 40. The Alignment Error Rate (AER) [13] evaluation metric quantiﬁes how well the MPE assignment to the hidden alignment variables matches human-generated alignments. Several pruning and smoothing techniques are used by GIZA and MDBNs. GIZA prunes low lexical (P (f |e)) probability values and uses a default small value for unseen (or pruned) probability table entries. For models 3 and 4, for which there is no known polynomial time algorithm to perform the full E-step or compute MPE, GIZA generates a set of high probability alignments using an MHMM and hill-climbing and collects EM counts over these alignments using M3 or M4. For MDBN models we use the following pruning strategy: at each level of the search tree we prune values which, together, account for the lowest speciﬁed percentage of the total probability mass of the product of all newly active CPTs in line 6 of alg. 1. This is a more effective pruning than simply removing low-probability values of each CPD because it factors in the joint contribution of multiple active variables. Table 1 shows a comparison of timing numbers obtained GIZA++ and MDBNs. The runtime numbers shown are for the combined tasks of training and decoding; however, training time dominates given the difference in size between train and test sets. For models 1 and 2 neither GIZA nor MDBNs perform any pruning. For the M-HMM, we prune 60% of probability mass at each level and use a Dirichlet prior over the alignment variables such that long-range transitions are exponentially less likely than shorter ones.6 This model achieves similar times and AER to GIZA’s. Interestingly, without any pruning, the MDBN M-HMM takes 160 minutes to complete while only marginally improving upon the pruned model. Experimenting with several pruning thresholds, we found that AER would worsen much more slowly than runtime decreases. Models 3 and 4 have treewidth equal to the number of alignment variables (because of the global constraints tying them) and therefore require approximate inference. Using Model 3, and a drastic pruning threshold that only keeps the value with the top probability at each level, we were able to achieve an AER not much higher than GIZA’s. For M4, it achieves a best AER of 31.7% while we do not improve upon Model3, most likely because a too restrictive pruning. Nevertheless, a simple variation on Model3 in the MDBN framework achieves a lower AER than our regular M3 (with pruning still the same). The M3-HMM hybrid model combines the Markov alignment dependencies from the M-HMM model with the fertility model of M3. MCMC Inference Sampling is widely used for inference in high-treewidth models. Although MDBNs support Likelihood Weighing, it is very inefﬁcient when the probability of evidence is very small, as is the case in our MT models. Besides being slow, Markov chain Monte Carlo can be problematic when the joint distribution is not positive everywhere, in particular in the presence of determinism and hard constraints. Techniques such as blocking Gibbs sampling [8] try to address the problem. Often, however, one has to carefully choose a problem-dependent proposal distribution. We used MCMC to improve training of the M3-HMM model. We were able to achieve an AER of 32.8% (down from 39.1%) but using 400 minutes of uniprocessor time. 5 CONCLUSION The existing classes of graphical models are not ideally suited for representing SMT models because “natural” semantics for specifying the latter combine ﬂavors of different GM types on top of standard directed Bayesian network semantics: switching parents found in Bayesian Multinets [6], aggregation relationships such as in Probabilistic Relational Models [5], and existence uncertainty [7]. We 5 Available at http://www.cs.washington.edu/homes/karim French and English have similar word orders. On a different language pair, a different prior might be more appropriate. With a uniform prior, the MDBN M-HMM has 36.0% AER. 6 Model Init M1 M2 M-HMM M3 M4 M3-HMM GIZA++ M1 M-HMM 1m45s (47.7%) N/A 2m02s (41.3%) N/A 4m05s (35.0%) N/A 2m50 (45%) 5m20s (38.5%) 5m20s (34.8%) 7m45s (31.7%) N/A MDBN M1 3m20s (48.0%) 5m30s (41.0%) 4m15s (33.0%) 12m (43.6%) 25m (43.6%) 9m30 (41.0%) M-HMM N/A N/A N/A 9m (42.5%) 23m (42.6%) 9m15s (39.1%) MCMC 400m (32.8%) Table 1: MDBN VE-based learning versus GIZA++ timings and %AER using 5 EM iterations. The columns M1 and M-HMM correspond to the model that is used to initialize the model in the corresponding row. The last row is a hybrid Model3-HMM model that we implemented using MDBNs and is not expressible using GIZA. have introduced a generalization of dynamic Bayesian networks to easily and concisely build models consisting of varying-length parallel asynchronous and interacting data streams. We have shown that our framework is useful for expressing various statistical machine translation models. We have also introduced new parameter estimation and decoding algorithms using exact and approximate searchbased probability computation. While our timing results are not yet as fast as a hand-optimized C++ program on the equivalent model, we have shown that even in this general-purpose framework of MDBNs, our timing numbers are competitive and usable. Our framework can of course do much more than the IBM and HMM models. One of our goals is to use this framework to rapidly prototype novel MT systems and develop methods to statistically induce an interlingua. We also intend to use MDBNs in other domains such as multi-party social interaction analysis. References [1] F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In UAI-03, pages 20–28, San Francisco, CA, 2003. Morgan Kaufmann. [2] J. Bilmes and C. Bartels. On triangulating dynamic graphical models. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the 19th Conference, pages 47–56. Morgan Kaufmann, 2003. [3] P. F. Brown, J. Cocke, S. A. Della Piettra, V. J. Della Piettra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, June 1990. [4] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [5] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300–1309, 1999. [6] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell., 82(1-2):45–74, 1996. [7] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3(4-5):697–707, May 2003. [8] C. Jensen, A. Kong, and U. Kjaerulff. Blocking Gibbs sampling in very large probabilistic expert systems. In International Journal of Human Computer Studies. Special Issue on Real-World Applications of Uncertain Reasoning., 1995. [9] P. Koehn. Europarl: A multilingual corpus for evaluation of machine http://www.isi.edu/koehn/publications/europarl, 2002. translation. [10] P. Koehn, F. Och, and D. Marcu. Statistical phrase-based translation. In NAACL/HLT 2003, 2003. [11] S. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [12] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 2002. [13] F. J. Och and H. Ney. Improved statistical alignment models. In ACL, pages 440–447, Oct 2000. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [15] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, pages 836–841, Morristown, NJ, USA, 1996.

6 0.80534804 159 nips-2006-Parameter Expanded Variational Bayesian Methods

7 0.78487456 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier

8 0.77387148 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

9 0.75792146 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

10 0.74940223 98 nips-2006-Inferring Network Structure from Co-Occurrences

11 0.74528593 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

12 0.74190897 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

13 0.74065083 121 nips-2006-Learning to be Bayesian without Supervision

14 0.72564942 193 nips-2006-Tighter PAC-Bayes Bounds

15 0.72435606 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

16 0.72144848 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

17 0.71893054 175 nips-2006-Simplifying Mixture Models through Function Approximation

18 0.71823668 192 nips-2006-Theory and Dynamics of Perceptual Bistability

19 0.71611893 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

20 0.71497476 109 nips-2006-Learnability and the doubling dimension