nips nips2012 nips2012-314 knowledge-graph by maker-knowledge-mining

314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Source: pdf

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). [sent-3, score-0.805]

2 In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. [sent-4, score-0.328]

3 This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. [sent-5, score-0.391]

4 It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. [sent-6, score-0.23]

5 We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. [sent-7, score-0.443]

6 We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. [sent-9, score-0.525]

7 It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e. [sent-10, score-0.2]

8 Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. [sent-13, score-0.443]

9 1 Introduction Markov logic [1] is a language for statistical relational learning, which employs weighted ﬁrst-order logic formulas to compactly represent a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). [sent-14, score-0.965]

10 A Markov logic theory where each predicate can take an argument representing a time point is called a dynamic Markov logic network (DMLN). [sent-15, score-0.774]

11 We will focus on two-slice dynamic Markov logic networks, i. [sent-16, score-0.391]

12 DMLNs are the undirected analogue of dynamic Bayesian networks (DBN) [13] and akin to dynamic conditional random ﬁelds [19]. [sent-19, score-0.202]

13 DMLNs have been shown useful for relational inference in complex dynamic domains; for example, [17] employed DMLNs for reasoning about the movements and strategies of 14-player games of Capture the Flag. [sent-20, score-0.145]

14 The usual method for performing ofﬂine inference in a DMLN is to simply unroll it into a CRF and employ a general MLN or CRF inference algorithm. [sent-21, score-0.137]

15 We will show, however, that the standard unrolling approach has a number of undesirable properties. [sent-22, score-0.115]

16 First, as one increases the number of time points in the domain, the marginals can ﬂuctuate, even if the observations have little or no inﬂuence on the hidden variables. [sent-24, score-0.115]

17 Second, the model can become time-inhomogeneous, even if the ground weighted formulas between the time slices originate from the same weighted ﬁrst-order logic formulas. [sent-25, score-0.669]

18 In domains where there are a large number of variables within each slice dynamic programming based exact inference cannot be used. [sent-27, score-0.318]

19 When 1 the number of time steps is high and/or online inference is required, unrolling the entire sequence (perhaps repeatedly) becomes prohibitively expensive. [sent-28, score-0.167]

20 Sequential Monte Carlo methods, or particle ﬁlters, are perhaps the most popular methods for online inference in high-dimensional sequential models. [sent-33, score-0.166]

21 , the Gaussian distributions used in [11], sampling from a two-slice CRF model can become expensive, due to the need to evaluate a partition function for each particle (see Sec. [sent-36, score-0.129]

22 As a solution to all of these concerns, we propose a novel way of unrolling a Markov logic theory such that in the resulting probabilistic model a smaller CRF is embedded into a larger CRF making the clique potentials between adjacent slices normalized. [sent-38, score-0.56]

23 We call this model slice normalized dynamic Markov logic network (SN-DMLN). [sent-39, score-0.525]

24 Because of the embedded CRF and the undirected components in our proposed model, the distribution represented by a SN-DMLN cannot be compactly captured by conventional chain graph [10], DBN or CRF graph representations, as we will explain in Sec. [sent-40, score-0.127]

25 The SN-DMLN has none of the negative theoretical or practical properties outlined above, and for accuracy and/or speed of inference matches or outperforms unrolled CRFs and the slice-by-slice approximate inference methods. [sent-42, score-0.234]

26 2 Background Probabilistic graphical models compactly represent probability distributions using a graph structure that expresses conditional independences among the variables. [sent-44, score-0.082]

27 , they model the joint distribution of the hidden variables and the observations, and during training the joint probability of the training data is maximized. [sent-47, score-0.078]

28 Hidden Markov models are the prototypical directed models used for sequential data with hidden and observable parts. [sent-48, score-0.177]

29 It has been demonstrated that for classiﬁcation problems, discriminative models, which model the conditional probability of the hidden variables given the observations, can outperform generative models [12]. [sent-49, score-0.119]

30 Markov logic [1] is a ﬁrst-order probabilistic language that allows one to deﬁne template features that apply to whole classes of objects at once. [sent-54, score-0.328]

31 A Markov logic network is a set of weighted ﬁrst-order logic formulas and a ﬁnite set of constants C = {c1 , c2 , . [sent-55, score-0.853]

32 , c|C| } which together deﬁne a Markov network ML,C that contains a binary node for each possible grounding of each predicate (ground atom) and a binary valued feature for each grounding of each ﬁrst-order logic formula. [sent-58, score-0.507]

33 We will also call the ground atoms variables (since they are random variables). [sent-59, score-0.145]

34 Ground atoms share the same weight if they are groundings of the same weighted ﬁrstorder logic formula, and (1) could be expressed in terms of ni (x, y) = j fi,j (x, y). [sent-65, score-0.451]

35 Dynamic MLNs [7] are MLNs with distinguished arguments in every predicate representing the ﬂow of time or some other sequential quantity. [sent-67, score-0.129]

36 In our settings, Yt and Xt will denote the set of hidden and observable random variables, respectively, at time t, and Y1:t and X1:t from time step 1 to t. [sent-68, score-0.091]

37 Each set can contain many variables, and we should note that their distribution will be represented compactly by weighted ﬁrst-order logic formulas. [sent-69, score-0.407]

38 The formulas in the knowledge base can be partitioned into 2 two sets. [sent-70, score-0.199]

39 The transitions part contains the formulas for which it is true that for any grounding of each formula, there is a t such that the grounding shares variables only with Yt and Yt+1 . [sent-71, score-0.344]

40 The emission part represents the formulas which connect the hidden and observable variables, i. [sent-72, score-0.287]

41 We will use P (Yt , Yt+1 ) (or P (Yt:t+1 )) and P (Yt , Xt ) to denote the product of the potentials corresponding to weighted ground formulas at time t of the transition and the observation formulas, respectively. [sent-75, score-0.344]

42 Since some ground formulas may contain only variables from Yt ( i. [sent-76, score-0.245]

43 , deﬁned over hidden variables within the same slice), in order to count the corresponding potentials exactly once, ˜ ˜ we always include their potentials P (Yt , Yt−1 ), and for t = 1 we have a separate P (Y1 ). [sent-78, score-0.168]

44 , when the knowledge base is unrolled into a CRF: 1. [sent-82, score-0.17]

45 As one increases the number of time points the marginals can ﬂuctuate, even if all the clique ˜ potentials P (Yi = yi , Xi = xi ) in (2) are uninformative. [sent-83, score-0.48]

46 The transition probability Pr(Yi+1 |Yi ) can be dependent on i, even if every P (Yi = yi , Xi = xi ) is uninformative and we use the same weighted ﬁrst-order logic formula responsible for the ground formulas covering the transitions between every i and i + 1. [sent-85, score-1.142]

47 ˜ ˜ Saying that P (Yi = yi , Xi = xi ) is uninformative is equivalent to saying that P (Yi = yi , Xi = xi ) ˜ is constant. [sent-90, score-0.745]

48 , for some q and r P (Yi = yi , Xi = xi ) = ˜ r(yi )q(xi ) then q could be marginalized out and r(Yi ) could be snapped to P (Yi , Yi−1 ) in (2). [sent-93, score-0.398]

49 ) To demonstrate Property 1, consider an unrolled MRF with the temporal domain T = {1, . [sent-94, score-0.155]

50 , T }, with only predicate P (t) (t ∈ T ) and with the weighted formulas (+∞, P (t) ⇔ P (t + 1)) (hard constraint) and (w, P (t)) (soft constraint). [sent-97, score-0.252]

51 Consequently, we have diverging marginals as T → +∞. [sent-108, score-0.105]

52 , an MLN with a single ﬁrst-order logic formula P (t) ∨ P (t + 1) with weight w. [sent-113, score-0.375]

53 1+exp(w) The unrolled MRF deﬁnes a distribution where Pr(¬P (3)|¬P (2)) = 1+2exp(w)+exp(2w) which is not equal to Pr(¬P (2)|¬P (1)) = 1+exp(w) 1+exp(w)+2 exp(2w) for an arbitrary choice of w. [sent-115, score-0.13]

54 , N is an enumeration of the all the possible truth assignments within each slice and N is the number of the possible truth assignments in the slice. [sent-123, score-0.208]

55 ,yT i=1 P (Yi = yi , Yi+1 = yi+1 ), where Z(Y1:T ) = i=1 P (Yi = Z(Y1:T ) yi , Yi+1 = yi+1 ). [sent-130, score-0.588]

56 The issue of diverging marginals and time-inhomogeneity has not been previously recognized as a practical problem. [sent-143, score-0.105]

57 This proposition can serve as an explanation why in practice we do not encounter diverging marginals in linear chain type CRFsand why except for a ﬁnite number of transitions the model becomes time-homogeneous. [sent-147, score-0.192]

58 , in a hidden Markov model), following standard particle ﬁlter design, having sampled s1:t−1 ∼ Pr(Y1:t−1 = s1:t−1 |X1:t−1 = x1:t−1 ), and then using s1:t−1 one would sample st ∼ Pr(Yt , Y1:t−1 = s1:t−1 |X1:t−1 ). [sent-151, score-0.179]

59 Since Pr(Y1:t = s1:t |X1:t−1 = x1:t−1 ) = Pr(Yt = st |Yt−1 = st−1 )Pr(Y1:t−1 = s1:t−1 |X1:t−1 = x1:t−1 ) (4) we do not have any difﬁculty performing this sampling step, and all that is left is to re-sample the collection of s1:t with importance weights Pr(Yt = st |Xt = xt ). [sent-152, score-0.191]

60 1 Although several algorithms have been proposed to estimate partition functions [16, 18], the partition function estimation can increase both the running time of the sampling algorithm signiﬁcantly and the error of the approximation of the sampling algorithm. [sent-156, score-0.12]

61 , when we have three weighted formulas in the previously used toy domain, namely, w : ¬P (Yt ) ∨ ¬P (Yt+1 ), −w : P (Yt ) ∧ ¬P (Yt+1 ) and w′ : P (Yt ) ↔ ¬P (Yt+1 ), where w > 0 and w′ < 0. [sent-160, score-0.197]

62 4 Slice normalized DMLNs As we demonstrated in Section 3, the root cause of the weaknesses of an ordinarily unrolled CRF ˜ ˜ lies in that P (Yt = yt , Yt−1 = yt−1 ) is unnormalized, i. [sent-162, score-0.592]

63 In that case we would directly represent Pr(Yt |Xt , Yt−1 ), hence we could implement a sequential Monte Carlo algorithm simply directly sampling st ∼ Pr(Yt |Xt = xt , Yt−1 = st−1 ) from slice to slice. [sent-166, score-0.339]

64 i=2 P (Yt = yt |Yt−1 = yt−1 ) is deﬁned by a two-slice Markov logic network (CRF), which describes the state transitions probabilities in a compact way. [sent-181, score-0.797]

65 If we hide the details of this nested CRF component and treat it as one potential, we could represent the distribution in (6) by regular chain graphs or CRFs; however we would lose then the compactness the nested CRF provides for describing the distribution. [sent-182, score-0.077]

66 Similarly, we could collapse the variables at every time slice into one and could use a DBN (or again a chain graph), but it would need exponentially more entries in its conditional probability ˜ tables. [sent-183, score-0.332]

67 Furthermore, we do not have to compute the partition function ˜ between the slices, because equation (5) shows, drawing a sample yt ∼ P (Yt , Yt−1 = yt−1 ) while keeping the value yt−1 ﬁxed is equivalent to sampling from P (Yt |Yt−1 = yt−1 ), the quantity present in equation (6). [sent-185, score-0.493]

68 We will partition the weights (parameters) of our model based on whether they belong to transition or to emission part of the model. [sent-190, score-0.112]

69 an emission parameter we (to which feature ne belongs) is: ∂Ld = ∂we t t ne (yi , xi ) − EP r(Y |X=x) i=1 ne (Yi , xi ) , (7) i=1 which is analogous to what one would expect for CRFs. [sent-197, score-0.141]

70 However, for a transition parameter wtr (belonging to feature ntr ) we get something different: ∂Ld = ∂wtr t−1 t EP (Yi+1 |yi ) [ntr (Yi+1 , Yi = yi )] ntr (yi+1 , yi ) − (8) i=1 i=1 t−1 t−1 − EP r(Y |X=x) ˜ EP (Yi+1 |Yi ) ntr (Yi+1 , Yi ) ˜ ntr (Yi+1 , Yi ) − . [sent-198, score-1.214]

71 , when the transition parameters are kept ﬁxed, allowing the transition parameters to vary makes Ld no longer concave. [sent-204, score-0.082]

72 ) In (8) the ﬁrst 2 ˜ Note that, in the SN-DMLN model the uniformity of P (Yi = yi , Xi = xi ) is a stronger assumption than the independence of Xi and Yi . [sent-205, score-0.346]

73 , when the model simpliﬁes to a Markov chain with a compactly represented transition probability distribution. [sent-209, score-0.133]

74 The ˜ rationale behind our heuristic is that if P (Yi = yi , Xi = xi ) had truly no information content, then for α = 0 we would ﬁnd the global optimum, and as we increase α we are taking into account that the observations are correlated with the hidden variables with an increasing weight. [sent-221, score-0.45]

75 5 Experiments For our experiments we extended the Probabilistic Consistency Engine (PCE) [3], a Markov logic implementation that has been used effectively in different problem domains. [sent-222, score-0.328]

76 For training, we used 10000 samples for the unrolled CRF and 100 particles and 100 samples for approximating the conditional expectations in (9) for the SN-DMLN to estimate the gradients. [sent-223, score-0.202]

77 For inference we used 10000 samples for the CRF and 10000 particles for the mixed model. [sent-224, score-0.083]

78 The hidden predicates in our knowledge base were Smokes(person, time), F riends(person1 , person2 , time) and the observable was Hangout(person, group, time). [sent-227, score-0.16]

79 The goal of inference was to predict which people could potentially be friends, based on the similarity in their smoking habits, which similarity could be inferred based on the groups the individuals hang out. [sent-228, score-0.454]

80 Initially 2 people were randomly chosen to be smokers and 2 to be non-smokers. [sent-230, score-0.109]

81 People with the same smoking habits can become friends at any time step with probability 1 − 0. [sent-231, score-0.35]

82 Every 5th time step (starting with t = 0) people hang out in groups and for each person the probability of joining one of the groups is 1 − 0. [sent-234, score-0.177]

83 05α, everyone spends time with the group reﬂecting their smoking habits, and with probability 0. [sent-237, score-0.173]

84 , a smoker stays a smoker and a non-smoker stays a non-smoker at the next time step with probability 1 − 0. [sent-242, score-0.15]

85 The weights of the clauses we learned using the SN-DMLN and the CRF unrolled models are in Table 1. [sent-245, score-0.157]

86 In our experiments we compared three types of inference, and measured the prediction quality for the hidden predicate F riends by assigning true to every ground atom the marginal probability of which was greater than 6 length 5 10 20 40 SN 1. [sent-248, score-0.399]

87 t to smoking and not smoking, and from the observations we could only tell that certain pairs of people have similar or different smoking habits, but not who smokes and who does not. [sent-301, score-0.707]

88 The inference over the entire test set, took approximately 6 minutes for SN and MAR in every test case, while UNR required 5, 8, 12 and 40 minutes for the different test cases. [sent-305, score-0.137]

89 The performance of SN and MAR stays the same as we increase the length of the chain while the performance of UNR degrades. [sent-309, score-0.121]

90 This can be explained by that MC-SAT requires more sampling steps to maintain the same performance as the chain length increases. [sent-311, score-0.115]

91 2 that SN outperforms both UNR and MAR as the chain length increases. [sent-316, score-0.089]

92 Moreover, UNR’s performance is clearly decreasing as the length of the chain increases. [sent-317, score-0.089]

93 6 Conclusion In this paper, we explored the theoretical and practical questions of unrolling a sequential Markov logic knowledge base into different probabilistic models. [sent-318, score-0.528]

94 The theoretical issues arising in a CRF7 (a) α = 0 (b) α = 1 Figure 2: F-score of models trained and tested on different length of data based MLN unrolling are a warning that unexpected results may occur if the observations are weakly correlated with the hidden variables. [sent-319, score-0.206]

95 We demonstrated that the CRF based unrolling can be outperformed by a model that mixes directed and undirected components (the proposed model does not suffer from any of the theoretical weaknesses, nor from the label-bias problem). [sent-321, score-0.217]

96 These savings are due to that we do not have to unroll a new CRF at every time step, or estimate a partition function which is responsible for normalizing the product of clique potentials appearing in two consecutive slices. [sent-323, score-0.168]

97 The previously used approximate inference methods in dynamic MLNs either relied on belief propagation or assumed that approximating the distribution at every time step by the product of the marginals would not cause any error. [sent-324, score-0.23]

98 Although training the mixed model might have a higher computational cost than training a conditional random ﬁeld, but this cost is amortized over time, since in applications inference is performed many times, while weight learning only once. [sent-327, score-0.093]

99 Machine reading using markov logic networks for collective probabilistic inference. [sent-339, score-0.41]

100 Maximum entropy markov models for information extraction and segmentation. [sent-384, score-0.082]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.433), ('logic', 0.328), ('yi', 0.294), ('crf', 0.271), ('smokes', 0.259), ('smoking', 0.173), ('riends', 0.163), ('formulas', 0.159), ('pr', 0.152), ('habits', 0.147), ('unr', 0.147), ('slice', 0.134), ('ntr', 0.13), ('unrolled', 0.13), ('mar', 0.123), ('unrolling', 0.115), ('hang', 0.101), ('dmlns', 0.098), ('mln', 0.086), ('markov', 0.082), ('hangout', 0.082), ('prt', 0.082), ('sn', 0.078), ('people', 0.076), ('mlns', 0.072), ('particle', 0.069), ('crfs', 0.066), ('wtr', 0.065), ('dynamic', 0.063), ('grounding', 0.062), ('marginals', 0.062), ('mrf', 0.062), ('ground', 0.061), ('atoms', 0.059), ('st', 0.057), ('predicate', 0.055), ('hidden', 0.053), ('inference', 0.052), ('xi', 0.052), ('xt', 0.051), ('chain', 0.051), ('ld', 0.048), ('limt', 0.047), ('formula', 0.047), ('slices', 0.045), ('sequential', 0.045), ('potentials', 0.045), ('ep', 0.044), ('domains', 0.044), ('diverging', 0.043), ('smoker', 0.043), ('conditional', 0.041), ('compactly', 0.041), ('transition', 0.041), ('directed', 0.041), ('zt', 0.04), ('base', 0.04), ('length', 0.038), ('weighted', 0.038), ('observable', 0.038), ('truth', 0.037), ('emission', 0.037), ('transitions', 0.036), ('undirected', 0.035), ('partition', 0.034), ('dmln', 0.033), ('geier', 0.033), ('pce', 0.033), ('smokers', 0.033), ('unroll', 0.033), ('dbn', 0.032), ('stays', 0.032), ('particles', 0.031), ('pedro', 0.031), ('relational', 0.03), ('friends', 0.03), ('elds', 0.029), ('climbing', 0.029), ('weaknesses', 0.029), ('predicates', 0.029), ('every', 0.029), ('uninformative', 0.028), ('minutes', 0.028), ('clique', 0.027), ('nath', 0.027), ('clauses', 0.027), ('suffer', 0.026), ('heuristic', 0.026), ('sampling', 0.026), ('could', 0.026), ('days', 0.025), ('exp', 0.025), ('saying', 0.025), ('kersting', 0.025), ('rochester', 0.025), ('domingos', 0.025), ('variables', 0.025), ('domain', 0.025), ('belief', 0.024), ('normalization', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

2 0.38947511 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been ﬁrst studied in [1]. The paper extends this work in two directions: ﬁrst, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identiﬁability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.33410904 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-conﬁdence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-conﬁdence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

4 0.22360305 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve ﬁnding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of ﬁrst, doing inference over the function space and second, ﬁnding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efﬁciently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

5 0.19592947 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

Author: Deepak Venugopal, Vibhav Gogate

Abstract: First-order probabilistic models combine the power of ﬁrst-order logic, the de facto tool for handling relational structure, with probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the accuracy and scalability of existing graphical models’ inference algorithms by exploiting symmetry in the ﬁrst-order representation. In this paper, we consider blocked Gibbs sampling, an advanced MCMC scheme, and lift it to the ﬁrst-order level. We propose to achieve this by partitioning the ﬁrst-order atoms in the model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing the clusters and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy, scalability and convergence.

6 0.17011862 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

7 0.14748119 293 nips-2012-Relax and Randomize : From Value to Algorithms

8 0.13967903 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.13834348 72 nips-2012-Cocktail Party Processing via Structured Prediction

10 0.13449308 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

11 0.12351439 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

12 0.10287363 283 nips-2012-Putting Bayes to sleep

13 0.099462546 303 nips-2012-Searching for objects driven by context

14 0.09809877 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

15 0.094750471 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

16 0.086220093 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

17 0.082562342 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

18 0.079360582 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

19 0.078617901 41 nips-2012-Ancestor Sampling for Particle Gibbs

20 0.076989956 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.213), (1, -0.016), (2, 0.125), (3, 0.327), (4, 0.032), (5, -0.213), (6, -0.01), (7, -0.061), (8, 0.007), (9, 0.069), (10, -0.011), (11, -0.056), (12, 0.09), (13, 0.06), (14, 0.037), (15, -0.119), (16, 0.038), (17, -0.074), (18, 0.124), (19, -0.051), (20, 0.004), (21, 0.04), (22, -0.003), (23, 0.079), (24, 0.046), (25, -0.017), (26, 0.031), (27, -0.107), (28, -0.037), (29, 0.11), (30, 0.067), (31, 0.094), (32, -0.037), (33, 0.124), (34, -0.012), (35, -0.048), (36, 0.003), (37, -0.009), (38, -0.032), (39, -0.032), (40, 0.028), (41, -0.037), (42, -0.006), (43, 0.089), (44, -0.017), (45, 0.054), (46, 0.082), (47, -0.038), (48, -0.012), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96178317 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

2 0.79807079 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

3 0.79588127 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

4 0.74469352 283 nips-2012-Putting Bayes to sleep

Author: Dmitry Adamskiy, Manfred K. Warmuth, Wouter M. Koolen

Abstract: We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, ﬁtting such “sparse composite models” is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efﬁcient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the “specialist” framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efﬁcient update that achieves a signiﬁcantly better bound. 1

5 0.67290068 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the ﬁrst —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

6 0.6273964 72 nips-2012-Cocktail Party Processing via Structured Prediction

7 0.6082623 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

8 0.60663903 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

9 0.56969595 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.55434817 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

11 0.53059101 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

12 0.52424061 293 nips-2012-Relax and Randomize : From Value to Algorithms

13 0.49700364 292 nips-2012-Regularized Off-Policy TD-Learning

14 0.43635511 303 nips-2012-Searching for objects driven by context

15 0.4057503 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

16 0.38710731 11 nips-2012-A Marginalized Particle Gaussian Process Regression

17 0.38309887 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models

18 0.38081431 339 nips-2012-The Time-Marginalized Coalescent Prior for Hierarchical Clustering

19 0.37431109 222 nips-2012-Multi-Task Averaging

20 0.36988586 41 nips-2012-Ancestor Sampling for Particle Gibbs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (21, 0.024), (22, 0.011), (38, 0.069), (42, 0.025), (54, 0.02), (55, 0.014), (74, 0.03), (76, 0.11), (80, 0.556), (92, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95474249 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

2 0.95269507 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

3 0.95242345 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

4 0.94997549 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefﬁcient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been signiﬁcant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

5 0.92342782 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classiﬁcation tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

6 0.90680093 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

7 0.8121348 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.78091228 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.75295937 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

10 0.74280596 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

11 0.73264366 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

12 0.73126668 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.72521222 293 nips-2012-Relax and Randomize : From Value to Algorithms

14 0.71115464 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

15 0.70606214 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.70383084 279 nips-2012-Projection Retrieval for Classification

17 0.69327074 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

18 0.69212091 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

19 0.68010008 168 nips-2012-Kernel Latent SVM for Visual Recognition

20 0.67590606 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback