jmlr jmlr2010 jmlr2010-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
Reference: text
sentIndex sentText sentNum sentScore
1 We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. [sent-15, score-0.333]
2 The dynamics of such processes that are also time-homogeneous can be determined by a single rate matrix whose entries encode transition rates among states. [sent-47, score-0.385]
3 However, as the size of the state space is exponential in the number of components so does the size of the transition matrix. [sent-48, score-0.294]
4 The instantaneous dynamics of each component depends only on the state of its parents in the graph, allowing a representation whose size scales linearly with the number of components and exponentially only with the indegree of the nodes of the graph. [sent-52, score-0.29]
5 The resulting procedure approximates the posterior distribution of the CTBN as a product of independent components, each of which is an inhomogeneous continuous-time Markov process. [sent-78, score-0.274]
6 By using this representation, we derive an iterative variational procedure that employs passing information between neighboring components as well as solving a small set of differential equations (ODEs) in each iteration. [sent-80, score-0.274]
7 We finally describe how to extend the proposed procedure to branching processes and particularly to models of molecular evolution, which describe historical dynamics of biological sequences that employ many interacting components. [sent-83, score-0.282]
8 (2) dt z A similar characterization for the time-dependent probability distribution, p(t), whose entries are defined by px (t) = Pr(X (t) = x), x ∈ S, is obtained by multiplying the Markov transition function by entries of the initial distribution p(0) and marginalizing, resulting in d p = pQ. [sent-127, score-0.369]
9 (3) dt The solution of this ODE is p(t) = p(0) exp(tQ), 2748 M EAN F IELD A PPROXIMATION FOR C ONTINUOUS -T IME BAYESIAN N ETWORKS Figure 1: An example of a CTMP trajectory: The process starts at state x1 = s0 , transitions to x2 = s2 at t1 , to x3 = s1 at t2 , and finally to x4 = s2 at t3 . [sent-128, score-0.407]
10 (4) Although a CTMP is an uncountable collection of random variables (the state of the system at every time t), a trajectory σ of {X (t) }t≥0 over a time interval [0, T ] can be characterized by a finite number of transitions K, a sequence of states (x0 , x1 , . [sent-132, score-0.277]
11 We denote by σ(t) the state at time t, that is, σ(t) = xk for tk ≤ t < tk+1 . [sent-139, score-0.363]
12 First it is assumed that only one component can change at a time, thus transition rates involving simultaneous changes of two or more components are zero. [sent-145, score-0.317]
13 Second, the transition rate of each component i depends only on the state of some subset of components denoted Pai ⊆ {1, . [sent-146, score-0.382]
14 With each component i we then associate a conditional rate matrix Q·|ui i for each state ui of i|Pa Pai . [sent-155, score-0.249]
15 The off-diagonal entries qxi ,yii|ui represent the rate at which Xi transitions from state xi to state i|Pa i|Pa yi given that its parents are in state ui . [sent-156, score-0.834]
16 The diagonal entries are qxi ,xii|ui = − ∑yi =xi qxi ,yii|ui , ensuring that each row in each conditional rate matrix sums up to zero. [sent-157, score-0.531]
17 The subset of random variables {X tk : k ≥ 0}, where tk = k h, is a discrete-time Markov process over a D-dimensional state-space. [sent-161, score-0.551]
18 (t (7) ) Each such term is the local conditional probability that Xi k+1 = yi given the state of Xi and U i at time tk . [sent-164, score-0.378]
19 These are valid conditional distributions, because they are non-negative and are normalized, that is i|Pa ∑ 1 xi =yi + qxi ,yii|ui h = 1 yi ∈Si for every xi and ui . [sent-165, score-0.489]
20 More formally, we define the conditional rate matrices as qxi ,yii|ui = τ 1 + e−2yi β ∑ j∈Pai x j i|Pa −1 where x j ∈ {−1, 1}. [sent-173, score-0.29]
21 Here, in both components the conditional rates are higher for transitions into states that are identical to the state of their parent. [sent-184, score-0.279]
22 ++ 0 Clearly, transition rates into states in which both components have the same value is higher. [sent-187, score-0.278]
23 Higher transitions rate imply higher transition probabilities, for example: p -+ , -- (h) = 10 h + o(h), p -- , -+ (h) = h + o(h). [sent-188, score-0.332]
24 The two possible types of evidence that may be given are continuous evidence, where we know the state of a subset U ⊆ X continuously over some sub-interval [t1 ,t2 ] ⊆ [0, T ], and point evidence of the state of U at some point t ∈ [0, T ]. [sent-192, score-0.328]
25 These statistics are the amount of time in which Xi is in state xi and Pai are in state ui , and the number of transitions that Xi underwent from xi to yi while its parents were in state ui (Nodelman et al. [sent-200, score-0.688]
26 Moreover, calculating expected residence times and expected number of transitions involves integration over the time interval of these quantities (Nodelman et al. [sent-206, score-0.245]
27 Here we aim at defining a lower bound on ln PQ (eT |e0 ) as well as to approximating the posterior probability PQ (· | e0 , eT ), where PQ is the distribution of the Markov process whose instantaneous rate-matrix is Q. [sent-216, score-0.325]
28 2752 M EAN F IELD A PPROXIMATION FOR C ONTINUOUS -T IME BAYESIAN N ETWORKS Recall that the distribution of a time-homogeneous Markov process is characterized by the conditional transition probabilities px,y (t), which in turn is fully redetermined by the constant rate matrix Q. [sent-218, score-0.261]
29 It is not hard to see that whenever the prior distribution of a stochastic process is that of a homogeneous Markov process with rate matrix Q, then the posterior PQ (·|e0 , eT ) is also a Markov process, albeit generally not a homogeneous one. [sent-219, score-0.354]
30 The distribution of a continuous-time Markov processes that is not homogeneous in time is determined by conditional transition probabilities, px,y (s, s + t), which depend explicitly on both initial and final times. [sent-220, score-0.319]
31 These transition probabilities can be expressed by means of a time-dependent matrix-valued function, R(t), which describes instantaneous transition rates. [sent-221, score-0.412]
32 The connection between the time-dependent rate matrix R(t) and the transition probabilities, px,y (s, s + t) is established by the master equation, d px,y (s, s + t) = ∑ rz,y (s + t)px,z (s, s + t), dt z where rz,y (t) are the entries of R(t). [sent-222, score-0.418]
33 As in the homogeneous case, it leads to a master equation for the time-dependent probability distribution, d px (t) = ∑ ry,x (t)py (t), dt y thereby generalizing Equation (3). [sent-224, score-0.303]
34 More precisely, writing the posterior transition probability using basic properties of conditional probabilities and the definition of the Markov transition function gives PQ (X (t+h) = y|X (t) = x, X (T ) = eT ) = px,y (h)py,eT (T − t + h) . [sent-226, score-0.441]
35 px,eT (T − t) Taking the limit h → 0 we obtain the instantaneous transition rate of the posterior process py,eT (T − t) PQ (X (t+h) = y|X (t) = x, X (T ) = eT ) . [sent-227, score-0.414]
36 = qx,y · h→0 h px,eT (T − t) rx,y (t) = lim (9) This representation, although natural, proves problematic in the framework of deterministic evidence because as t approaches T the transition rate into the observed state tends to infinity. [sent-228, score-0.388]
37 In parpeT (T −t) ticular, when x = eT and y = eT , the posterior transition rate is qx,eT · px,e,eT(T −t) . [sent-229, score-0.315]
38 We therefore consider an alternative parameterization for this inhomogeneous process that is more suitable for variational approximations. [sent-231, score-0.296]
39 The function γx,y (t) is the probability density that X transitions from state x to y at time t. [sent-236, score-0.246]
40 Proof See appendix B The converse is also true: for every integrable inhomogeneous rate matrix R(t) the corresponding d marginal density set is defined by dt µx (t) = ∑y ry,x (t)µy (t) and γx,y (t) = µx (t)rx,y (t). [sent-254, score-0.503]
41 This analysis suggests that for every homogeneous rate matrix and point evidence e there is a member in Me that corresponds to the posterior process. [sent-261, score-0.306]
42 2 Variational Principle The marginal density representation allows us to state the variational principle for continuous processes, which closely tracks similar principles for discrete processes. [sent-264, score-0.287]
43 The maximum over this 2755 C OHN , E L - HAY, F RIEDMAN AND K UPFERMAN set is the log-likelihood of the evidence and is attained for a density set that represents the posterior distribution. [sent-266, score-0.257]
44 ) The two terms in the functional are the average energy, T E (η; Q) = 0 ∑ x µx (t)qx,x + ∑ γx,y (t) ln qx,y dt, y=x and the entropy, T H (η) = 0 ∑ ∑ γx,y (t)[1 + ln µx (t) − ln γx,y (t)]dt. [sent-270, score-0.463]
45 As in the homogeneous case, a trajectory σ of {X (t) }t≥0 over a time interval [0, T ] can be characterized by a finite number of transitions K, a sequence of states (x0 , x1 , . [sent-280, score-0.279]
46 , τK ≤ tK ), ∂t1 · · · ∂tK 2756 M EAN F IELD A PPROXIMATION FOR C ONTINUOUS -T IME BAYESIAN N ETWORKS which is given by K−1 tk+1 rxk ,xk (t)dt tk fR (σ) = px0 (0) · ∏ e rxk ,xk+1 (tk+1 ) · e tK+1 rxK ,xK (t)dt tK . [sent-308, score-0.323]
47 The KL-divergence between two densities that correspond to two inhomogeneous Markov processes with rate matrices R(t) and S(t) is I ( fR || fS ) = D Σ fR (σ) ln fR (σ) dσ . [sent-311, score-0.408]
48 The density function fQ (σ,e) of the posterior distribution PQ (·|e) satisfies fS (σ) = PQ (e) where S(t) is the time-dependent rate matrix that corresponds to the posterior process. [sent-318, score-0.301]
49 Manipulating (15), we get I ( fη || fS ) = D Σ fη (σ) ln fη (σ)dσ − Σ fη (σ) ln fS (σ)dσ ≡ E fη [ln fη (σ)] − E fη [ln fS (σ)] . [sent-319, score-0.27]
50 Replacing ln fS (σ) by ln fQ (σ, e) − ln PQ (e) and applying simple arithmetic operations gives ln PQ (e) = E fη [ln fQ (σ, e)] − E fη [ln fη (σ)] + I ( fη || fS ). [sent-320, score-0.54]
51 Applying these lemmas with ψ(x,t) = qx,x and ψ(x, y,t) = 1 x=y · ln qx,y gives T E fη [ln fQ (σ, e)] = 0 ∑ x µx (t)qx,x (t) + ∑ γx,y (t) ln qx,y (t) dt . [sent-333, score-0.464]
52 y=x Similarly, setting ψ(x,t) = rx,x (t) and ψ(x, y,t) = 1 x=y · ln rx,y (t), where R(t) is defined in Equation (11), we obtain T −E fη [ln fη (σ, e)] = − 0 ∑ x µx (t) γx,y (t) γx,x (t) dt = H (η) . [sent-334, score-0.329]
53 Assuming that Q is defined by a CTBN, and that η is a factored density set, we can rewrite E (η; Q) = ∑ i T 0 ∑ xi µi i (t)Eµ\i (t) qxi ,xi |U i + x ∑ γi i ,yi (t)Eµ\i (t) ln qxi ,yi |U i x dt xi ,yi =xi (see derivations in Appendix D). [sent-346, score-1.03]
54 i Note that terms such as Eµ\i (t) qxi ,xi |U i involve only µ j (t) for j ∈ Pai , because Eµ\i (t) [ f (U i )] = ∑ui µui (t) f (ui ). [sent-348, score-0.241]
55 The Lagrangian is a functional of x the functions µi i (t), γi i ,yi (t) and λi i (t), and takes the following form x x x D ˜ L = F (η; Q) − ∑ T i=1 0 λi i (t) x d i µ (t) − ∑ γi i ,yi (t) dt . [sent-375, score-0.252]
56 x dt xi yi A necessary condition for the optimality of a density set η is the existence of λ such that (η, λ) is a stationary point of the Lagrangian. [sent-376, score-0.415]
57 In these equations we use the following shorthand notations for the average rates i|Pa qxii ,xi (t) = Eµ\i (t) qxi ,xii|U i , i|Pa qxii ,xi |x j (t) = Eµ\i (t) qxi ,xii|U i | x j , where µ\i (t) is the product distribution of µ1 (t), . [sent-380, score-0.726]
58 Similarly, we have the following shorthand notations for the geometrically-averaged rates, i|Pa qi i ,yi (t) = exp Eµ\i (t) ln qxi ,yii|U i ˜x i|Pa , qi i ,yi |x j (t) = exp Eµ\i (t) ln qxi ,yii|U i | x j ˜x . [sent-387, score-0.844]
59 Thus, the system ˜ of equations to be solved is d µx (t) = ∑ (γy,x (t) − γx,y (t)) , dt y=x d ρx (t) = − ∑ qx,y ρy (t), dt y along with the algebraic equation ρx (t)γx,y (t) = µx (t)qx,y ρy (t), y = x. [sent-408, score-0.467]
60 y Using the definition of q (Equation 1), rearranging, dividing by h and taking the limit h → 0 gives d Pr(eT |X (t) = x) = − ∑ qx,y · Pr(eT |X (t) = y), dt y 2761 C OHN , E L - HAY, F RIEDMAN AND K UPFERMAN which is identical to the differential equation for ρ. [sent-412, score-0.274]
61 Equation (20) suggests that the instantaneous rate combines the original rate with the relative likelihood of the evidence at T given y and x. [sent-418, score-0.256]
62 Conversely, if y is unlikely to lead to the evidence the rate of transitions to it are lower. [sent-420, score-0.253]
63 Since the model is symmetric, these trajectories are either ones in which both components are most of the time in state −1, or ones where both are most of the time in state 1 (Figure 3(a)). [sent-425, score-0.246]
64 This example also allows to examine the implications of modeling the posterior by inhomogeneous Markov processes. [sent-431, score-0.241]
65 As expected in a dynamic process, we can see in Figure 3(d) that the inhomogeneous transition rates are very erratic. [sent-445, score-0.377]
66 This can be interpreted as putting most of the weight of the distribution on trajectories which transition from state -1 to 1 at that point. [sent-447, score-0.302]
67 dt 2763 C OHN , E L - HAY, F RIEDMAN AND K UPFERMAN where α and β are defined by the right-hand side of the differential equations (16). [sent-452, score-0.275]
68 As the free energy functional is bounded by the probability of the evidence, this procedure will always converge, and the rate of the free energy increase can be used to test for convergence. [sent-462, score-0.332]
69 To do so, we create a fictional rate ˜ matrix Qi for each component and initialize µi to be the posterior of the process given the evidence ei,0 and ei,T . [sent-470, score-0.312]
70 This setting typically leads to a posterior Markov process in which the instantaneous rates used by Opper and Sanguinetti diverge toward the end point—the rates of transition into the observed state go to infinity, leading to numerical problems at the end points. [sent-508, score-0.537]
71 In the case of a continuous-time Markov process, the sufficient statistics are Tx , the time spent in state x, and Mx,y , the number of transitions from state x to y. [sent-511, score-0.244]
72 dt Thus, our marginal density sets η provide what we consider a natural formulation for variational approaches to continuous-time Markov processes. [sent-516, score-0.413]
73 Specifically, we use Ising chains with the parameterization introduced in Example 1, where we explore regimes defined by the degree of coupling between the components (the parameter β) and the rate of transitions (the parameter τ). [sent-525, score-0.253]
74 In models with few transitions (small τ), most of the mass of the posterior is concentrated on a few canonical “types” of trajectories that can be captured by the approximation (as in Example 3). [sent-536, score-0.258]
75 At high transition rates, the components tend to transition often, and in a coordinated manner, which leads to a posterior that is hard to approximate by a product distribution. [sent-537, score-0.492]
76 Evaluation on Branching Processes The above-mentioned experimental results indicate that our approximation is accurate when reasoning about weakly-coupled components, or about time intervals involving few transitions (low transition rates). [sent-545, score-0.283]
77 2766 M EAN F IELD A PPROXIMATION FOR C ONTINUOUS -T IME BAYESIAN N ETWORKS Figure 4: (a) Relative error as a function of the coupling parameter β (x-axis) and transition rates τ (y-axis) for an 8-component Ising chain. [sent-548, score-0.272]
78 To gain intuition on this process we return to the discrete case, where our branching process can be viewed as a branching of the Dynamic Bayesian Network from one branch to two separate branches at the vertex, as in Figure 8(a). [sent-585, score-0.347]
79 Each point is an expected value of: (a) residence time in a specific state of a component and its parents; (b) number of transition between two states. [sent-626, score-0.315]
80 In this model the transition rate between two sequences that differ in a single nucleotide depends on difference between their folding energy. [sent-629, score-0.289]
81 This equation implies that transition rates are increasing monotonically with the reduction of the folding energy. [sent-632, score-0.331]
82 Discussion In this paper we formulate a general variational principle for continuous-time Markov processes (by reformulating and extending the one proposed by Opper and Sanguinetti, 2007), and use it to derive an efficient procedure for inference in CTBNs. [sent-643, score-0.264]
83 We show how it is extended to perform inference on phylogenetic trees, where the posterior distribution is directly affected by several evidence points, and show that it provides fairly accurate answers in the context of a real application (Figure 12). [sent-651, score-0.272]
84 Using this representation we reformulate and extend the variational principle proposed by Opper and Sanguinetti (2007) , which incorporates a different inhomogeneous representation. [sent-653, score-0.259]
85 We believe that even in cases where evidence is indirect and noisy, the marginal density representation should comprise smoother functions than posterior rates. [sent-661, score-0.297]
86 In such cases, the posterior transition rate form x into a state that better explains the observation might tend to a large quantity. [sent-663, score-0.383]
87 Plugging Equation (7) into Equation (6), the transition probability of the DBN is Ph (X (tk+1 ) = y|X (tk ) = x) = ∏ 1 xi =yi + qxi ,yii|ui · h i|Pa . [sent-681, score-0.467]
88 Expanding the product gives Ph (X (tk+1 ) = y|X (tk ) = x) = ∏ 1 xi =yi + ∑ qxi ,yii|ui · h ∏ 1 x j =y j + o(h) . [sent-684, score-0.292]
89 We can use these rates with the initial value µx (0) to construct the Markov process Pη from the forward master equation d Pη (X (t) = x) = ∑ Pη (X (t) = y)ry,x (t) , dt y and Pη (X (0) ) = µ(0) . [sent-691, score-0.357]
90 2 Expectations of Functions of Transitions - Proof of Lemma 5 Proof Given a trajectory σ there exists a small enough h > 0 such that for every transition and for every t ∈ (tk − h,tk ) we have σ(t) = xk−1 and σ(t + h) = xk . [sent-702, score-0.264]
91 Proof of the Factored Representation of the Energy Functional Proof We begin with the definition of the average energy T E (η; Q) = 0 ∑ µx (t)qx,x + ∑ γx,y (t) ln qx,y dt ∑ µx (t)qx,x + ∑ x T = 0 y=x x ∑ γix ,y (t)µ\i (t) ln qx,y i dt . [sent-709, score-0.754]
92 Next we simply write the previous sum as an expectation over X \ i E (η; Q) = ∑ i T 0 ∑ µix (t)Eµ i xi qxi ,xi |U i + ∑ \i (t) i T ∑ γix ,y (t)Eµ 0 y =x i i i i \i (t) ln qxi ,yi |U i dt , which concludes the proof. [sent-715, score-0.862]
93 For y(t) to be a stationary point, it must satisfy the Euler-Lagrange equations (Gelfand and Fomin, 1963) ∂ d f (t, y(t), y′ (t)) − ∂y dt ∂ f (t, y(t), y′ (t)) = 0 . [sent-729, score-0.281]
94 ˜ ∂µi j µ (t) x Therefore the derivative of the sum over j = i of the energy terms is ψi i (t) ≡ x ∑ ∑ j∈Childreni x j j j µx j (t)qx j ,x j |xi (t) + ∑ x j =y j j γx j ,y j (t) ln qx j ,y j |xi (t) ˜ j . [sent-748, score-0.303]
95 Additionally, the derivative of the energy term for j = i is qxii ,xi (t) ≡ Eµ\i (t) qxi ,xi |U i . [sent-749, score-0.413]
96 Finally, the derivative of f with respect to dt µi (t) x x xi is −λi i (t). [sent-751, score-0.245]
97 γi i ,yi (t) gives us x ln µi i (t) + ln qi i ,yi (t) − ln γi i ,yi (t) + λi i (t) − λi i (t) = 0 . [sent-756, score-0.451]
98 Using this result and the definition of γi i ,xi we have x γi i ,xi (t) = − x ∑ yi =xi γi i ,yi (t) = −µi i (t) ∑ qi i ,yi (t) ˜x x x xi ,yi Plugging this equality into (24) and using the identity d i dt ρxi (t) = ρi i (t) y . [sent-758, score-0.344]
99 ρi i (t) x d i i dt λxi (t)ρxi (t) d i ˜x ρ (t) = −ρi i (t) qxii ,xi (t) + ψi i (t) − ∑ qi i ,yi ρi i (t) . [sent-759, score-0.316]
100 x y x dt xi yi =xi Thus the stationary point of the Lagrangian matches the updates of (16–17). [sent-760, score-0.345]
wordName wordTfidf (topN-words)
[('tk', 0.257), ('qxi', 0.241), ('hay', 0.206), ('riedman', 0.206), ('upferman', 0.206), ('dt', 0.194), ('ohn', 0.176), ('transition', 0.175), ('ean', 0.159), ('ield', 0.159), ('ontinuous', 0.159), ('inhomogeneous', 0.15), ('pr', 0.145), ('ln', 0.135), ('nodelman', 0.134), ('sanguinetti', 0.13), ('pproximation', 0.129), ('etworks', 0.129), ('ctbn', 0.125), ('ime', 0.122), ('ctbns', 0.117), ('opper', 0.109), ('variational', 0.109), ('transitions', 0.108), ('markov', 0.104), ('branching', 0.101), ('pq', 0.098), ('evidence', 0.096), ('energy', 0.096), ('ui', 0.093), ('fq', 0.092), ('posterior', 0.091), ('rna', 0.087), ('ph', 0.085), ('ising', 0.081), ('qxii', 0.076), ('processes', 0.074), ('qx', 0.072), ('branch', 0.071), ('homogeneous', 0.07), ('density', 0.07), ('state', 0.068), ('fs', 0.067), ('pa', 0.067), ('pai', 0.067), ('folding', 0.065), ('evolution', 0.065), ('bayesian', 0.063), ('instantaneous', 0.062), ('trajectories', 0.059), ('mei', 0.058), ('functional', 0.058), ('fr', 0.057), ('childreni', 0.054), ('mef', 0.054), ('dbn', 0.054), ('integration', 0.054), ('propagation', 0.054), ('ode', 0.054), ('yi', 0.053), ('ix', 0.053), ('rates', 0.052), ('components', 0.051), ('xi', 0.051), ('trajectory', 0.051), ('interval', 0.05), ('shelton', 0.05), ('uai', 0.05), ('rate', 0.049), ('inference', 0.048), ('stationary', 0.047), ('factored', 0.047), ('qi', 0.046), ('coupling', 0.045), ('gibbs', 0.044), ('efold', 0.043), ('nucleotides', 0.043), ('expectations', 0.043), ('differential', 0.041), ('equations', 0.04), ('marginal', 0.04), ('component', 0.039), ('equation', 0.039), ('interacting', 0.039), ('xk', 0.038), ('process', 0.037), ('phylogenetic', 0.037), ('eld', 0.036), ('lagrangian', 0.035), ('forward', 0.035), ('parents', 0.035), ('dynamics', 0.035), ('marginals', 0.034), ('procedure', 0.033), ('ctmp', 0.033), ('ido', 0.033), ('residence', 0.033), ('ruttor', 0.033), ('rxk', 0.033), ('simma', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
2 0.32695696 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
3 0.21169358 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu
Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software
4 0.13985273 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
Author: Joshua W. Robinson, Alexander J. Hartemink
Abstract: Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data. An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical model called a nonstationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. Some examples of evolving networks are transcriptional regulatory networks during an organism’s development, neural pathways during learning, and traffic patterns during the day. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. Keywords: Bayesian networks, graphical models, model selection, structure learning, Monte Carlo methods
5 0.12142134 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
6 0.097429402 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
7 0.092723139 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
8 0.088997163 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
9 0.083644204 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
10 0.08330825 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
11 0.082283273 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
12 0.079895809 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
13 0.079193316 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
14 0.077164397 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.066840217 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
16 0.061628066 61 jmlr-2010-Learning From Crowds
17 0.059552468 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
18 0.055989977 22 jmlr-2010-Classification Using Geometric Level Sets
19 0.05586892 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
20 0.053996116 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
topicId topicWeight
[(0, -0.293), (1, 0.06), (2, -0.217), (3, -0.492), (4, 0.007), (5, 0.082), (6, -0.043), (7, 0.118), (8, 0.027), (9, -0.155), (10, -0.014), (11, -0.015), (12, 0.014), (13, 0.033), (14, 0.003), (15, -0.044), (16, 0.018), (17, -0.022), (18, 0.022), (19, -0.054), (20, 0.011), (21, -0.026), (22, -0.095), (23, 0.013), (24, 0.024), (25, -0.009), (26, 0.016), (27, 0.016), (28, -0.001), (29, 0.042), (30, 0.046), (31, -0.022), (32, -0.041), (33, 0.071), (34, 0.019), (35, -0.129), (36, 0.013), (37, 0.01), (38, 0.006), (39, -0.003), (40, -0.051), (41, 0.002), (42, -0.006), (43, -0.002), (44, -0.081), (45, -0.055), (46, -0.004), (47, -0.017), (48, 0.006), (49, -0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.95302421 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
2 0.8669982 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
3 0.77524418 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu
Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software
4 0.6015622 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
Author: Joshua W. Robinson, Alexander J. Hartemink
Abstract: Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data. An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical model called a nonstationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. Some examples of evolving networks are transcriptional regulatory networks during an organism’s development, neural pathways during learning, and traffic patterns during the day. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. Keywords: Bayesian networks, graphical models, model selection, structure learning, Monte Carlo methods
5 0.3937476 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
6 0.36407679 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
7 0.36294684 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
8 0.35276923 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
9 0.34058604 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
10 0.33825281 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
11 0.33557293 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
12 0.33347458 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
13 0.32443658 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
14 0.29515937 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
15 0.28106481 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
16 0.26390898 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
17 0.26154685 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
18 0.25924936 63 jmlr-2010-Learning Instance-Specific Predictive Models
19 0.25517866 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
20 0.25480169 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
topicId topicWeight
[(3, 0.016), (4, 0.016), (8, 0.012), (21, 0.015), (24, 0.011), (32, 0.046), (33, 0.01), (36, 0.518), (37, 0.041), (75, 0.16), (85, 0.04), (96, 0.037)]
simIndex simValue paperId paperTitle
1 0.92717999 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
Author: Joshua W. Robinson, Alexander J. Hartemink
Abstract: Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data. An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical model called a nonstationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. Some examples of evolving networks are transcriptional regulatory networks during an organism’s development, neural pathways during learning, and traffic patterns during the day. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. Keywords: Bayesian networks, graphical models, model selection, structure learning, Monte Carlo methods
same-paper 2 0.89050233 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
3 0.58622664 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
Author: Yu Fan, Jing Xu, Christian R. Shelton
Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efficiency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, filtering, smoothing
4 0.52215874 56 jmlr-2010-Introduction to Causal Inference
Author: Peter Spirtes
Abstract: The goal of many sciences is to understand the mechanisms by which variables came to take on the values they have (that is, to find a generative model), and to predict what the values of those variables would be if the naturally occurring mechanisms were subject to outside manipulations. The past 30 years has seen a number of conceptual developments that are partial solutions to the problem of causal inference from observational sample data or a mixture of observational sample and experimental data, particularly in the area of graphical causal modeling. However, in many domains, problems such as the large numbers of variables, small samples sizes, and possible presence of unmeasured causes, remain serious impediments to practical applications of these developments. The articles in the Special Topic on Causality address these and other problems in applying graphical causal modeling algorithms. This introduction to the Special Topic on Causality provides a brief introduction to graphical causal modeling, places the articles in a broader context, and describes the differences between causal inference and ordinary machine learning classification and prediction problems. Keywords: Bayesian networks, causation, causal inference
5 0.51478761 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
6 0.51109821 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
7 0.50788546 37 jmlr-2010-Evolving Static Representations for Task Transfer
9 0.48927224 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
10 0.48086107 69 jmlr-2010-Lp-Nested Symmetric Distributions
11 0.47995839 63 jmlr-2010-Learning Instance-Specific Predictive Models
12 0.47930202 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
13 0.47138211 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
14 0.4580009 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
15 0.45597801 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
16 0.45541129 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
17 0.45353287 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
19 0.4400939 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
20 0.43928805 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation