nips nips2003 nips2003-30 knowledge-graph by maker-knowledge-mining

30 nips-2003-Approximability of Probability Distributions


Source: pdf

Author: Alina Beygelzimer, Irina Rish

Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. [sent-10, score-0.292]

2 We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. [sent-11, score-0.496]

3 1 Introduction One of the major concerns in probabilistic reasoning using graphical models, such as Bayesian networks, is the computational complexity of inference. [sent-12, score-0.173]

4 In general, probabilistic inference is NP-hard and a typical approach to handling this complexity is to use an approximate inference algorithm that trades accuracy for efficiency. [sent-13, score-0.266]

5 However, as demonstrated in [2], such criteria are unable to capture the inference complexity: two models that have similar representation complexity and fit data equally well may have quite different graph structures, making one model exponentially slower for inference than the other. [sent-22, score-0.343]

6 Thus the treewidth arises as a natural measure of inference complexity in graphical models. [sent-26, score-0.866]

7 Intuitively, a probability distribution is approximable, or easy, if it is close to a distribution represented by an efficient, low-treewidth graphical model. [sent-28, score-0.219]

8 , Xn }, and let our target distribution P be the uniform distribution on the values to which it assigns 1 (i. [sent-36, score-0.223]

9 It is easy to see that any approximation Q that decomposes over a network whose moralized graph misses at least one edge, is precisely as inaccurate as the one that assumes all variables to be independent (i. [sent-39, score-0.401]

10 This follows from the fact that the probability distribution induced on any proper subset of the 1 variables is uniform, and thus for any subset {Xi1 , . [sent-42, score-0.227]

11 , xir ) = treewidth n−2 n log i=1 Q(xi ) = log 2−n = −n, 2 and dKL (P, Q) = 0 n−1 (empty graph) (clique) −H(P )+n = 1 since H(P ) = n−1. [sent-52, score-0.757]

12 , absolutely no gain in accuracy and a potentially exponential loss of efficiency) in using a model more complex than the empty graph (i. [sent-55, score-0.286]

13 dKL On the other hand, one can easily construct a distribution with large weak dependencies such that representing this distribution exactly requires a network with large treewidth; however, if we are willing to sacrifice just a bit of accuracy, we get a very simple model. [sent-59, score-0.232]

14 Or, what is the best achievable approximation accuracy given a constraint on the complexity (i. [sent-78, score-0.215]

15 The tradeoff between the complexity and accuracy is monotonic; however, it may be far from linear. [sent-81, score-0.201]

16 complexity trade-offs is based on the results from random graph theory which suggest that graph properties monotone in edge addition (e. [sent-84, score-0.65]

17 , such as graph connectivity) appear rather suddenly: the transition from the property being very unlikely to it being very likely occurs during a small change of the edge probability p (density) in the random graph [7, 8]. [sent-86, score-0.543]

18 First, we show that both important properties of random graphical models, the property of “being efficient” (i. [sent-88, score-0.178]

19 , having treewidth at most some fixed integer k) and the property of “being accurate” (i. [sent-90, score-0.819]

20 , being at distance at most some δ from the target distribution), are monotone and demonstrate a threshold behavior, giving us two families of threshold curves parameterized by k and by δ, respectively. [sent-92, score-0.442]

21 Second, we introduce the notion of effective treewidth k(δ), which denotes the smallest 1 Note that minimizing dKL from the empirical distribution (induced by a given set of samples) also corresponds to maximizing the likelihood of observed data. [sent-93, score-0.822]

22 achievable treewidth k given a constraint δ on KL-divergence (error) from the target (we also introduce a notion of -achievable k(δ) which requires at least -fraction of models in a given set to achieve treewidth k and error δ). [sent-95, score-1.585]

23 The effective treewidth captures the approximability of the distribution, and is determined by relative position of the threshold curves, an inherent property of the target distribution. [sent-96, score-1.026]

24 2 Preliminaries and Related Work Let P be a probability distribution on n discrete random variables X1 , X2 , . [sent-99, score-0.185]

25 A Bayesian network exploits the independences among the Xi to provide a compact representation of P as a product of low-order conditional probability distributions. [sent-103, score-0.168]

26 The independences are encoded by a directed acyclic graph (DAG) G with nodes corresponding to X1 , X2 , . [sent-104, score-0.326]

27 Each Xi is independent of its non-descendants given its parents in the graph [12]. [sent-108, score-0.222]

28 The dependencies are quantified by associating each node Xi with a local conditional probability distribution PB (Xi | Πi ), where Πi is the set of parents of Xi in G. [sent-109, score-0.24]

29 We say that a distribution P decomposes over a DAG G if there exist local conditional probability distributions corresponding to G such that P can be written in such a form. [sent-114, score-0.243]

30 In order to use this algorithm in the presence of cycles, one typically constructs a junction tree of the network and runs the algorithm on this tree [12]. [sent-119, score-0.211]

31 Exact inference can then be done in time and space linear in the representation of clique marginals in the junction tree, which is exponential in the size of the largest clique induced during triangulation. [sent-126, score-0.365]

32 The minimum width over all possible triangulations is called the treewidth of the graph. [sent-128, score-0.729]

33 The triangulation procedure is defined for undirected graphs, so we must first make the network undirected while preserving the set of independence assumptions; this can be done by moralizing the network, i. [sent-129, score-0.276]

34 Restricting the size of dependencies controls both overfitting and the complexity of inference in the resulting model. [sent-133, score-0.184]

35 Given a target distribution P (X) and an approximation Q(X), the information divergence (or Kullback-Leibler P (x) distance) between P and Q is defined as dKL (P, Q) = x P (x) log Q(x) , where x ranges over all possible assignments to the variables in X (See [5]. [sent-141, score-0.264]

36 Let Dk denote the class of distributions decomposable on graphs with treewidth at most k (0 ≤ k < n), with D1 corresponding to the set of tree-decomposable distributions. [sent-144, score-0.936]

37 The distribution within Dk minimizing the information divergence from the target distribution P is called the projection of P onto Dk . [sent-145, score-0.359]

38 For a fixed tree T , the projection of P onto the set of T -decomposable distributions is uniquely given by the distribution in which the conditional probabilities along the edges of T coincide with those computed from P . [sent-148, score-0.425]

39 Fix a network structure G, and let Q be a distribution decomposable over G. [sent-153, score-0.203]

40 If P is the empirical distribution induced by the given sample of size N (i. [sent-155, score-0.216]

41 Hence if G is fixed, the projection onto the set of Gdecomposable distributions is uniquely defined, and we will identify G with this projection (ignoring some notational abuse). [sent-161, score-0.198]

42 Srebro [13] showed that a similar undirected decomposition holds for bounded treewidth Markov networks (probabilistic models that use undirected graphs to represent dependencies). [sent-165, score-1.041]

43 Threshold behavior of random graphs We use the model of random directed acyclic graphs (DAGs) defined by Barak and Erd˝ s [1]. [sent-170, score-0.387]

44 Consider the probability space G(n, p) o of random undirected graphs on n nodes with edge probability p (i. [sent-171, score-0.425]

45 Let Gn,p stand for a random graph from this probability space. [sent-174, score-0.242]

46 We will also occasionally use Gn,m to denote a graph chosen randomly from among all graphs with n nodes and m edges. [sent-175, score-0.322]

47 3 Since the true distribution P is given only by the sample, we let P also denote the empirical distribution induced by the sample, ignoring some abuse of notation. [sent-180, score-0.282]

48 A graph property P is naturally associated with the set of graphs having P. [sent-181, score-0.37]

49 A property is monotone increasing if it is preserved under edge addition: If a graph G satisfies the property, then every graph on the same set of nodes containing G as a subgraph must satisfy it as well. [sent-182, score-0.704]

50 It is easy to see (and intuitively clear) that if P is a monotone increasing property then the probability that Gn,p satisfies P is a non-decreasing function of p. [sent-183, score-0.362]

51 For example, the property of having treewidth at most some fixed integer k is monotone decreasing: adding edges can only increase the treewidth. [sent-185, score-1.08]

52 The theory of random graphs was initiated by Erd˝ s and R´ nyi [7], o e and one of the main observations they made was that many natural monotone properties appear rather suddenly, i. [sent-186, score-0.322]

53 Friedgut [8] proved that every monotone graph property of undirected graphs has such a threshold behavior. [sent-189, score-0.684]

54 Random DAGs (corresponding to random partially ordered sets) have received less attention then random undirected graphs, partially because of the additional structure that prevents the completely independent choice of edges. [sent-190, score-0.203]

55 Accuracy Recall that the information divergence of a given DAG G from the target distribution P is given by dKL (P, G) = W (G) − H(P ), where W (G) = n − i=1 xi ,πi P (xi , πi ) log P (xi | πi ). [sent-195, score-0.271]

56 (In our case, P is the empirical distribution induced by the given sample S of size N . [sent-196, score-0.216]

57 Notice that Pδ is monotone increasing: Adding edges to a graph can only bring the graph closer to the target distribution, since any distribution decomposable on the original graph is also decomposable on the augmented one. [sent-199, score-1.071]

58 Complexity Fix an integer k, and consider the property of n-node DAGs of having treewidth of their moralized graph at most k. [sent-201, score-1.01]

59 Call this property Pk and note that it is a structural property of a DAG, which does not depend on the target distribution and its projection onto the DAG. [sent-202, score-0.441]

60 It is also a monotone decreasing property, since if a graph has treewidth at most k, then certainly any of its subgraphs does. [sent-203, score-1.049]

61 Recall that we identify each graph with the projection of the target distribution onto the graph. [sent-204, score-0.428]

62 We call a pair (k, δ) achievable for a distribution P , if there exists a distribution Q decomposable on a graph with treewidth at most k such that dKL (P, Q) ≤ δ. [sent-205, score-1.194]

63 The effective treewidth of P , with respect to a given δ, is defined as the smallest k(δ) such that the pair (k, δ) is achievable, i. [sent-206, score-0.751]

64 , if all distributions at distance at most δ from P are not decomposable on graphs with treewidth less than k(δ). [sent-208, score-0.936]

65 4 Main Idea Consider, for each treewidth bound k, the curve given by µk (p) = Pr[width(Gn,p ) ≤ k], 1 and let pk be such that µk (pk ) = 1/2 + , where 0 < < 2 is some fixed constant. [sent-216, score-0.882]

66 For reasons that will become clear in a moment, our goal will be to find, for each feasible treewidth k, the value of δ such that pδ = pk . [sent-218, score-0.839]

67 To find each pk , the algorithm will simply do a binary search on the interval (0, 1): For the current value of edge probability p, the algorithm estimates µk (Gn,p ) by random sampling and branches according to the estimate. [sent-219, score-0.272]

68 To estimate µk (Gn,p ) within an additive error ρ with probability at least 1 − γ, the algorithm samples m = ln(2/γ) independent copies of Gn,p , and outputs the average value 2ρ2 of the 0/1 random variable indicating whether the treewidth of the sampled DAG is at most k. [sent-221, score-0.868]

69 Note that the values related to treewidth are independent of the target distribution and can be precomputed offline. [sent-223, score-0.88]

70 To find δ = δ(k) for a given value of k, the algorithm computes the values of W (Gn,pk ) for the m sampled random DAGs in G(n, pk ), orders them and chooses the median. [sent-224, score-0.185]

71 We know that at least a (1/2 + )-fraction of the DAGs in G(n, pk ) satisfy Pk . [sent-226, score-0.173]

72 Moreover, there is a very simple probabilistic algorithm for finding a model realizing the tradeoff: We just need to sample O(1/ ) DAGs in G(n, pk ) and choose the closest one. [sent-228, score-0.203]

73 Clearly we are overcounting, since the same DAGs may contribute to both probabilities; however not absurdly, since intuitively the graphs in G(n, pk ) with small treewidth will not fit the distribution better than the ones with larger treewidth. [sent-229, score-1.047]

74 A distribution is called k-wise independent if any subset of k variables is mutually independent (however, there may exist dependencies on larger subsets). [sent-231, score-0.214]

75 Figure 1 shows the curves for a 3wise independent distribution on 8 random variables. [sent-232, score-0.218]

76 number of edges m, the y-axis denotes the probability that Gn,m satisfies the property corresponding to a given curve. [sent-235, score-0.235]

77 The monotone decreasing curves correspond to the properties Pk for k = {1, . [sent-236, score-0.273]

78 The monotone increasing curves correspond to the property of having dKL at most δ. [sent-241, score-0.37]

79 As m increases, the probability of having small treewidth decreases, while the probability of getting close to the target increases. [sent-247, score-0.866]

80 ) As the random graph evolves, we want to capture the moment when the first probability is still high, while the second is already high. [sent-249, score-0.242]

81 As expected, graphs with treewidth at most 2 are as inaccurate as the empty graph since all triples are independent. [sent-250, score-1.066]

82 Given the desired level of closeness δ, we want to find the smallest treewidth k such that the corresponding curves meet above some cut-off probability. [sent-251, score-0.772]

83 7, we may suggest, say, projecting onto graphs with treewidth 4 (cutting at 0. [sent-253, score-0.852]

84 Let the target distribution be the empirical distribution P induced by a given sample. [sent-261, score-0.339]

85 Recall that dKL (P, G) decomposes into sum of conditional entropies induced by G (minus the entropy of P ). [sent-262, score-0.217]

86 H¨ ffgen [9] o showed how to estimate these conditional entropies with any fixed additive precision ρ using polynomially many samples. [sent-263, score-0.176]

87 More precisely, he showed that a sample of size m = k+1 m(γ, ρ) = O(( n )2 log2 n log n γ ) suffices to obtain good estimations of all induced ρ ρ conditional entropies with probability at least 1 − γ, which in turn suffices to estimate dKL (P, G) with the additive precision ρ. [sent-264, score-0.355]

88 Estimating Treewidth We, of course, will not attempt to compute the treewidth of the randomly generated graphs exactly. [sent-265, score-0.808]

89 There are no theoretical guarantees in general, but heuristics tend to perform reasonably well: used in combination with various lower bound techniques, they can often pin down the treewidth to a small range, or even identify it exactly 5 . [sent-268, score-0.722]

90 We stress that the values related to treewidth are independent of the target distribution and can be precomputed. [sent-269, score-0.88]

91 24 22 20 accuracy (W) The network has 37 nodes, 46 directed edges, 19 additional undirected edges induced by moralization; the treewidth is 4. [sent-273, score-1.096]

92 For each treewidth bound k, the curves gives an estimate of the best achievable value of W = dKL − H(P ). [sent-279, score-0.853]

93 ) 18 16 14 12 10 0 5 10 15 20 25 complexity (treewidth) 30 35 Figure 2: Tradeoff curve for A LARM The estimate is based on generating 400 random DAGs with 37 nodes and m edges, for every possible m. [sent-281, score-0.216]

94 Here we see that the highest gain in accuracy from allowing the model to be more complex occurs up to treewidth 4, less so 5 and 6; by further increasing the treewidth we do not gain much in accuracy. [sent-287, score-1.469]

95 We succeed in reconstructing the width in the sense that the distribution was 4 If k is fixed, the problem of determining whether a graph has treewidth k has a linear time algorithm. [sent-288, score-0.956]

96 2 0 0 20 40 60 80 100 120 number of edges Figure 3: Threshold curves for A LARM simulated from a treewidth-4 model. [sent-297, score-0.17]

97 6 Such tradeoff curves are similar to commonly used ROC (Receiver Operating Characteristic) curves; the techniques for finding the cutoff value in ROC curves can be used here as well. [sent-298, score-0.221]

98 Instead of plotting the best achievable distance, we can plot the best distance achievable by at least an -fraction of models in the class, parameterizing the tradeoff curve by . [sent-299, score-0.301]

99 On the maximal number of strongly independent vertices in a random o acyclic directed graph. [sent-306, score-0.177]

100 6 Note, however, that it does not imply that the empirical distribution itself decomposes on a treewidth-4 model. [sent-380, score-0.152]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('treewidth', 0.695), ('dkl', 0.272), ('dags', 0.232), ('monotone', 0.168), ('graph', 0.158), ('pk', 0.144), ('dag', 0.116), ('graphs', 0.113), ('property', 0.099), ('edges', 0.093), ('decomposable', 0.091), ('undirected', 0.09), ('clique', 0.089), ('xn', 0.085), ('target', 0.085), ('induced', 0.083), ('complexity', 0.081), ('achievable', 0.081), ('curves', 0.077), ('distribution', 0.069), ('dk', 0.069), ('tradeoff', 0.067), ('approximability', 0.066), ('erd', 0.066), ('tree', 0.058), ('friedgut', 0.057), ('larm', 0.057), ('fix', 0.056), ('threshold', 0.056), ('accuracy', 0.053), ('triangulation', 0.053), ('junction', 0.052), ('inference', 0.052), ('nodes', 0.051), ('dependencies', 0.051), ('decomposes', 0.05), ('pb', 0.05), ('hypertree', 0.05), ('divergence', 0.047), ('empty', 0.045), ('projection', 0.045), ('edge', 0.044), ('onto', 0.044), ('conditional', 0.044), ('network', 0.043), ('curve', 0.043), ('probability', 0.043), ('ll', 0.043), ('random', 0.041), ('acyclic', 0.04), ('entropies', 0.04), ('forcing', 0.04), ('directed', 0.039), ('xi', 0.039), ('graphical', 0.038), ('approximable', 0.038), ('beygelzimer', 0.038), ('ffgen', 0.038), ('independences', 0.038), ('nonapproximable', 0.038), ('distributions', 0.037), ('coincide', 0.035), ('width', 0.034), ('parents', 0.033), ('suddenly', 0.033), ('tradeoffs', 0.033), ('hawthorne', 0.033), ('moralization', 0.033), ('moralized', 0.033), ('rish', 0.033), ('srebro', 0.033), ('xik', 0.033), ('empirical', 0.033), ('variables', 0.032), ('sample', 0.031), ('log', 0.031), ('pair', 0.031), ('independent', 0.031), ('absolutely', 0.03), ('chow', 0.03), ('hardly', 0.03), ('triples', 0.03), ('least', 0.029), ('additive', 0.029), ('decreasing', 0.028), ('abuse', 0.028), ('barak', 0.028), ('probabilistic', 0.028), ('networks', 0.028), ('identify', 0.027), ('increasing', 0.026), ('intuitively', 0.026), ('reasoning', 0.026), ('vertices', 0.026), ('inaccurate', 0.025), ('ibm', 0.025), ('watson', 0.025), ('effective', 0.025), ('showed', 0.025), ('integer', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 30 nips-2003-Approximability of Probability Distributions

Author: Alina Beygelzimer, Irina Rish

Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1

2 0.10017472 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

Author: Xuanlong Nguyen, Michael I. Jordan

Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1

3 0.098438874 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

Author: Quaid D. Morris, Brendan J. Frey

Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.

4 0.094631761 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1

5 0.083590358 169 nips-2003-Sample Propagation

Author: Mark A. Paskin

Abstract: Rao–Blackwellization is an approximation technique for probabilistic inference that flexibly combines exact inference with sampling. It is useful in models where conditioning on some of the variables leaves a simpler inference problem that can be solved tractably. This paper presents Sample Propagation, an efficient implementation of Rao–Blackwellized approximate inference for a large class of models. Sample Propagation tightly integrates sampling with message passing in a junction tree, and is named for its simple, appealing structure: it walks the clusters of a junction tree, sampling some of the current cluster’s variables and then passing a message to one of its neighbors. We discuss the application of Sample Propagation to conditional Gaussian inference problems such as switching linear dynamical systems. 1

6 0.082573533 51 nips-2003-Design of Experiments via Information Theory

7 0.078032337 121 nips-2003-Log-Linear Models for Label Ranking

8 0.073883787 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

9 0.069855817 152 nips-2003-Pairwise Clustering and Graphical Models

10 0.066563442 168 nips-2003-Salient Boundary Detection using Ratio Contour

11 0.059772741 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

12 0.056499649 117 nips-2003-Linear Response for Approximate Inference

13 0.05618494 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

14 0.05433568 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays

15 0.053809792 170 nips-2003-Self-calibrating Probability Forecasting

16 0.053209476 32 nips-2003-Approximate Expectation Maximization

17 0.05005433 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

18 0.049944557 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

19 0.049389977 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

20 0.04839398 99 nips-2003-Kernels for Structured Natural Language Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.177), (1, -0.04), (2, -0.047), (3, 0.108), (4, 0.051), (5, -0.091), (6, 0.007), (7, 0.023), (8, -0.032), (9, -0.016), (10, 0.005), (11, 0.047), (12, 0.071), (13, 0.0), (14, 0.044), (15, -0.105), (16, -0.069), (17, -0.071), (18, 0.07), (19, 0.063), (20, -0.045), (21, 0.02), (22, 0.0), (23, -0.047), (24, -0.109), (25, -0.022), (26, 0.078), (27, 0.017), (28, -0.059), (29, 0.066), (30, -0.006), (31, -0.037), (32, -0.04), (33, 0.068), (34, 0.104), (35, -0.125), (36, -0.016), (37, 0.063), (38, 0.017), (39, -0.141), (40, 0.077), (41, -0.021), (42, 0.078), (43, -0.197), (44, -0.09), (45, -0.024), (46, 0.034), (47, -0.014), (48, 0.046), (49, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93686384 30 nips-2003-Approximability of Probability Distributions

Author: Alina Beygelzimer, Irina Rish

Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1

2 0.68522197 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

Author: Quaid D. Morris, Brendan J. Frey

Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.

3 0.6419577 168 nips-2003-Salient Boundary Detection using Ratio Contour

Author: Song Wang, Toshiro Kubota, Jeffrey M. Siskind

Abstract: This paper presents a novel graph-theoretic approach, named ratio contour, to extract perceptually salient boundaries from a set of noisy boundary fragments detected in real images. The boundary saliency is defined using the Gestalt laws of closure, proximity, and continuity. This paper first constructs an undirected graph with two different sets of edges: solid edges and dashed edges. The weights of solid and dashed edges measure the local saliency in and between boundary fragments, respectively. Then the most salient boundary is detected by searching for an optimal cycle in this graph with minimum average weight. The proposed approach guarantees the global optimality without introducing any biases related to region area or boundary length. We collect a variety of images for testing the proposed approach with encouraging results. 1

4 0.48831403 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

Author: Xuanlong Nguyen, Michael I. Jordan

Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1

5 0.48354217 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation

Author: Chen Yanover, Yair Weiss

Abstract: Loopy belief propagation (BP) has been successfully used in a number of difficult graphical models to find the most probable configuration of the hidden variables. In applications ranging from protein folding to image analysis one would like to find not just the best configuration but rather the top M . While this problem has been solved using the junction tree formalism, in many real world problems the clique size in the junction tree is prohibitively large. In this work we address the problem of finding the M best configurations when exact inference is impossible. We start by developing a new exact inference algorithm for calculating the best configurations that uses only max-marginals. For approximate inference, we replace the max-marginals with the beliefs calculated using max-product BP and generalized BP. We show empirically that the algorithm can accurately and rapidly approximate the M best configurations in graphs with hundreds of variables. 1

6 0.48237863 189 nips-2003-Tree-structured Approximations by Expectation Propagation

7 0.4747752 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays

8 0.42845395 118 nips-2003-Link Prediction in Relational Data

9 0.42518833 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

10 0.42011592 169 nips-2003-Sample Propagation

11 0.40700835 99 nips-2003-Kernels for Structured Natural Language Data

12 0.39693138 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

13 0.37747481 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

14 0.36044478 75 nips-2003-From Algorithmic to Subjective Randomness

15 0.35025993 121 nips-2003-Log-Linear Models for Label Ranking

16 0.34769714 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

17 0.34156951 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

18 0.33727959 32 nips-2003-Approximate Expectation Maximization

19 0.33457285 51 nips-2003-Design of Experiments via Information Theory

20 0.33108157 71 nips-2003-Fast Embedding of Sparse Similarity Graphs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (8, 0.255), (11, 0.031), (29, 0.026), (30, 0.027), (35, 0.053), (53, 0.117), (69, 0.023), (71, 0.088), (76, 0.05), (85, 0.056), (91, 0.109), (99, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80638015 30 nips-2003-Approximability of Probability Distributions

Author: Alina Beygelzimer, Irina Rish

Abstract: We consider the question of how well a given distribution can be approximated with probabilistic graphical models. We introduce a new parameter, effective treewidth, that captures the degree of approximability as a tradeoff between the accuracy and the complexity of approximation. We present a simple approach to analyzing achievable tradeoffs that exploits the threshold behavior of monotone graph properties, and provide experimental results that support the approach. 1

2 0.62196833 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

Author: Tong Zhang

Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1

3 0.62042546 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

4 0.61878067 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

5 0.61518425 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

6 0.61176324 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

7 0.61011314 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

8 0.60985911 189 nips-2003-Tree-structured Approximations by Expectation Propagation

9 0.60972005 113 nips-2003-Learning with Local and Global Consistency

10 0.60965848 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

11 0.60906643 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

12 0.60896099 81 nips-2003-Geometric Analysis of Constrained Curves

13 0.60877347 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

14 0.60840756 143 nips-2003-On the Dynamics of Boosting

15 0.6069749 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

16 0.60643017 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

17 0.60638261 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

18 0.60623395 158 nips-2003-Policy Search by Dynamic Programming

19 0.60423255 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

20 0.60293067 73 nips-2003-Feature Selection in Clustering Problems