nips nips2011 nips2011-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
Reference: text
sentIndex sentText sentNum sentScore
1 We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. [sent-20, score-0.222]
2 We identify graph families for which the proposed algorithm has low sample and computational complexities. [sent-22, score-0.263]
3 Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. [sent-23, score-0.458]
4 We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. [sent-24, score-0.21]
5 1 Introduction The formalism of probabilistic graphical models can be employed to represent dependencies among a large set of random variables in the form of a graph [1]. [sent-26, score-0.335]
6 An important challenge in the study of graphical models is to learn the unknown graph using samples drawn from the graphical model. [sent-27, score-0.497]
7 One of the goals is to characterize tractable model classes for which consistent graphical model selection can be guaranteed with low computational and sample complexities. [sent-30, score-0.291]
8 The seminal work by Chow and Liu [3] proposed an efficient algorithm for maximum-likelihood structure estimation in tree-structured graphical models by reducing the problem to a maximum weight spanning tree problem. [sent-31, score-0.259]
9 We characterize the class of Ising and Gaussian graphical models for which we can guarantee efficient and consistent structure estimation using this simple algorithm. [sent-36, score-0.314]
10 These graphs are especially relevant in modeling social networks [10, 11]. [sent-38, score-0.281]
11 1 Summary of Results We propose an algorithm for structure estimation, termed as conditional mutual information thresholding (CMIT), which computes the minimum empirical conditional mutual information of a given node pair over conditioning sets of bounded cardinality η. [sent-40, score-0.544]
12 If the minimum exceeds a given threshold (depending on the number of samples n and the number of nodes p), the node pair is declared as an edge. [sent-41, score-0.302]
13 We establish that under a set of mild and transparent assumptions, structure learning is consistent in high-dimensions for −4 CMIT when the number of samples scales as n = Ω(Jmin log p), for a p-node graph, where Jmin is the minimum (absolute) edge-potential in the model. [sent-44, score-0.357]
14 We also develop novel techniques to obtain necessary conditions for consistent structure estimation of Erd˝ s-R´ nyi o e random graphs. [sent-45, score-0.391]
15 We obtain non-asymptotic bounds on the number of samples n in terms of the expected degree and the number of nodes of the model. [sent-46, score-0.199]
16 Our results have many ramifications: we explicitly characterize the tradeoff between various graph parameters such as the maximum degree, girth and the strength of edge potentials for efficient and consistent structure estimation. [sent-48, score-0.625]
17 1 Graphical Models A p-dimensional graphical model is a family of p-dimensional multivariate distributions Markov on some undirected graph G= (V, E) [1]. [sent-60, score-0.296]
18 Each node in the graph i ∈ V is associated to a random variable Xi taking values in a set X . [sent-61, score-0.218]
19 , Xp ) with joint distribution P satisfies the global Markov property with respect to a graph G, if for all disjoint sets A, B ⊂ V , we have P (xA , xB |xS ) = P (xA |xS )P (xB |xS ). [sent-67, score-0.216]
20 The Hammersley-Clifford theorem states that under the positivity condition, given by P (x) > 0 for all x ∈ X p [1], the model P satisfies the global Markov property according to a graph G if and only if it factorizes according to the cliques of G. [sent-69, score-0.216]
21 (4) Our results on structure learning depend on Jmin and Jmax , which is fairly natural – intuitively, models with edge potentials which are “too small” or “too large” are harder to learn than those with comparable potentials, i. [sent-81, score-0.203]
22 We consider the problem of structure learning, which involves the estimation of the edge set of the graph G given n i. [sent-87, score-0.298]
23 2 Tractable Graph Families We consider the class of graphical models Markov on a graph Gp belonging to some ensemble G(p) of graphs with p nodes. [sent-96, score-0.774]
24 We emphasize that in our formulation the graph ensemble G(p) can either be deterministic or random – in the latter, we also specify a probability measure over the set of graphs in G(p). [sent-97, score-0.608]
25 ) graph G ∼ G(p) satisfies a certain property Q (for example, connectedness) if limp→∞ P [Gp satisfies Q] = 1. [sent-100, score-0.216]
26 Intuitively, this means that graphs that have a vanishing probability of occurrence as p → ∞ are ignored. [sent-105, score-0.281]
27 Our conditions and theoretical guarantees will be based on this notion for random graph ensembles. [sent-106, score-0.21]
28 We now characterize the ensemble of graphs amenable for consistent structure estimation. [sent-107, score-0.559]
29 For γ ∈ N, let Bγ (i; G) denote the set of vertices within distance γ from node i with respect to graph G. [sent-108, score-0.218]
30 Definition 1 (γ-Local Separator) Given a graph G, a γ-local separator Sγ (i, j) between i and j, for (i, j) ∈ G, is a / minimal vertex separator3 with respect to the subgraph Hγ,i . [sent-110, score-0.303]
31 In other words, the γ-local separator Sγ (i, j) separates nodes i and j with respect to paths in G of length at most γ. [sent-112, score-0.201]
32 We now characterize the ensemble of graphs based on the size of local separators. [sent-113, score-0.511]
33 Definition 2 ((η, γ)-Local Separation Property) An ensemble of graphs G(p; η, γ) satisfies (η, γ)-local separation property if for a. [sent-114, score-0.554]
34 (i,j)∈Gp / In Section 3, we propose an efficient algorithm for graphical model selection when the underlying graph belongs to a graph ensemble G(p; η, γ) with sparse local node separators (i. [sent-117, score-0.731]
35 Below we provide examples of three graph families which satisfy (5) for small η. [sent-120, score-0.233]
36 does not apply to deterministic graph ensembles G(p) where no randomness is assumed, and in this setting, we assume that the property Q holds for every graph in the ensemble. [sent-124, score-0.413]
37 3 A minimal separator is a separator of smallest cardinality. [sent-125, score-0.224]
38 3 (Example 1) Bounded Degree: Any (deterministic or random) ensemble of degree-bounded graphs GDeg (p, ∆) satisfies the (η, γ)-local separation property with η = ∆ and every γ ∈ N. [sent-126, score-0.554]
39 Thus, our algorithm consistently recovers graphs with small (bounded) degrees (∆ = O(1)). [sent-127, score-0.313]
40 (Example 2) Bounded Local Paths: The (η, γ)-local separation property also holds when there are at most η paths of length at most γ in G between any two nodes (henceforth, termed as the (η, γ)-local paths property). [sent-131, score-0.261]
41 Thus, a graph with girth g (length of the shortest cycle) satisfies the (η, γ)-local separation property with η = 1 and γ = g. [sent-133, score-0.475]
42 107] and the random Cayley graphs [18] have large girths. [sent-135, score-0.281]
43 The girth condition can be weakened to allow for a small number of short cycles, while not allowing for overlapping cycles. [sent-136, score-0.246]
44 For instance, the ensemble of Erd˝ s-R´ nyi graphs GER (p, c/p), where an edge between o e any node pair appears with a probability c/p, independent of other node pairs, is locally tree-like. [sent-138, score-0.799]
45 Similar observations apply log for the more general scale-free or power-law graphs [8, 19]. [sent-142, score-0.324]
46 Along similar lines, the ensemble of ∆-random regular graphs, denoted by GReg (p, ∆), which is the uniform ensemble of regular graphs with degree ∆ has no overlapping cycles of length at most Θ(log∆−1 p) a. [sent-143, score-0.774]
47 (Example 3) Small-World Graphs: The class of hybrid graphs or augmented graphs [8, Ch. [sent-147, score-0.586]
48 12] consist of graphs which are the union of two graphs: a “local” graph having short cycles and a “global” graph having small average distances between nodes. [sent-148, score-0.722]
49 Since the hybrid graph is the union of these local and global graphs, it simultaneously has large degrees and short cycles. [sent-149, score-0.284]
50 The simplest model GWatts (p, d, c/p), first studied by Watts and Strogatz [9], consists of the union of a d-dimensional grid and an Erd˝ s-R´ nyi random graph with parameter c. [sent-150, score-0.372]
51 o e graph G ∼ GWatts (p, d, c/p) satisfies (η, γ)-local separation property in (5), with η = d + 2 and γ ≤ 4log pc . [sent-153, score-0.284]
52 Similar log observations apply for more general hybrid graphs studied in [8, Ch. [sent-154, score-0.348]
53 (6) We require that the number of nodes p → ∞ to exploit the local separation properties of the class of graphs under consideration. [sent-159, score-0.438]
54 (A3) Local-Separation Property: We assume that the ensemble of graphs G(p; η, γ) satisfies the (η, γ)-local separation property with η, γ ∈ N satisfying: α := η = O(1), Jmin α−γ = ω(1), 4 5 Two cycles are said to overlap if they have common vertices. [sent-172, score-0.627]
55 6 We can weaken the second requirement in (9) as Jmin α−γ = ω(1) for deterministic graph families (rather than random graph ensembles). [sent-175, score-0.402]
56 We provide examples of graphs where the above assumptions are met. [sent-186, score-0.303]
57 Gaussian Models on Girth-bounded Graphs: Consider the ensemble of graphs GDeg,Girth (p; ∆, g) with maximum degree ∆ and girth g. [sent-187, score-0.757]
58 (10) In (10), we notice a natural tradeoff between the girth and the maximum degree of the graph ensemble for successful estimation under our framework: graphs with large degrees can be learned efficiently if their girths are large. [sent-191, score-0.986]
59 Indeed, in the extreme case of trees which have infinite girth, in accordance with (10), there is no constraint on the node degrees for consistent graphical model selection and recall that the Chow-Liu algorithm [3] is an efficient method for model selection on tree-structured graphical models. [sent-192, score-0.43]
60 Note that the condition in (10) allows for the maximum degree bound ∆ to grow with the number of nodes as long as the girth g also grows appropriately. [sent-193, score-0.403]
61 For example, if the maximum degree scales as ∆ = O(poly(log p)) and the girth scales as g = O(log log p), then (10) is satisfied. [sent-194, score-0.413]
62 This implies that graphs with fairly large degrees and short cycles can be recovered successfully consistently using the algorithm in Section 3. [sent-195, score-0.416]
63 Gaussian Models on Erd˝ s-R´ nyi and Small-World Graphs: We can also conclude that a. [sent-197, score-0.203]
64 Erd˝ s-R´ nyi graph o e o e G ∼ GER (p, c/p) satisfies (9) with η = 2 when c = O(poly(log p)) under the best possible scaling for Jmin subject to the walk-summability constraint in (7). [sent-199, score-0.372]
65 For the ensemble of graphs GDeg,Girth (p; ∆, g) with degree ∆ and girth g, we can establish that J ∗ = Θ(1/∆). [sent-202, score-0.793]
66 Similarly, for both the Erd˝ s-R´ nyi graph ensemble o e GER (p, c/p) and small-world ensemble GWatts (p, d, c/p), we can establish that the threshold J ∗ = Θ(1/c), and thus, the observations made for the Gaussian setting hold for the Ising model as well. [sent-206, score-0.835]
67 If the above minimum value exceeds the given threshold ξn,p , then the node pair is declared as an edge. [sent-214, score-0.207]
68 log 7 The empirical conditional mutual information is obtained by first computing the empirical distribution and then computing its conditional mutual information. [sent-217, score-0.347]
69 , node pairs which can be separated in the unknown graph G. [sent-220, score-0.218]
70 Given a Gaussian graphical model or an Ising model Markov on a graph Gp ∼ G(p; η, γ), CMIT(xn ; ξn,p , η) is structurally consistent. [sent-228, score-0.32]
71 n,p→∞ (13) Consistency guarantee The CMIT algorithm consistently recovers the structure of the graphical models with probability tending to one and the probability measure in (4) is with respect to both the graph and the samples. [sent-230, score-0.377]
72 −4 Sample-complexity The sample complexity of the CMIT scales as Ω(Jmin log p) and is favorable when the minimum edge potential Jmin is large. [sent-231, score-0.277]
73 It can be established that for both Gaussian and Ising models Markov on a degree-bounded graph ensemble GDeg (p, ∆) with maximum degree ∆ and satisfying assumption (A3), the minimum sample complexity is given by n = Ω(∆4 log p) i. [sent-235, score-0.658]
74 We can have improved guarantees for the Erd˝ s-R´ nyi random graphs GER (p, c/p). [sent-238, score-0.484]
75 Specifically, when the Erd˝ s-R´ nyi random graphs have a bounded average degree (c = O(1)), we can obtain a minimum sample o e complexity of n = Ω(log p) for structure estimation of Ising models. [sent-246, score-0.78]
76 Thus, the complexity of learning sparse Erd˝ s-R´ nyi random graphs is akin to learning o e trees in certain parameter regimes. [sent-248, score-0.529]
77 −2 The sample complexity of structure estimation can be improved to n = Ω(Jmin log p) by employing empirical conditional covariances for Gaussian models and empirical conditional variation distances in place of empirical conditional mutual information. [sent-249, score-0.487]
78 Our sample complexity −2 (using conditional covariances) n = Ω(Jmin log p) is the same in terms of Jmin , while there is no explicit dependence on the maximum degree ∆. [sent-255, score-0.299]
79 For structure estimation of Ising models, the work in [6] considers 1 -penalized logistic regression which has a sample complexity of n = Ω(∆3 log p) for a degree-bounded ensemble GDeg (p, ∆) satisfying certain “incoherence” conditions. [sent-257, score-0.346]
80 The sample complexity of CMIT, given by n = Ω(∆4 log p), is slightly worse, while the modified algorithm described previously has a sample complexity of n = Ω(∆2 log p), for general degree-bounded ensembles. [sent-258, score-0.236]
81 Additionally, under the CMIT algorithm, we can guarantee an improved sample complexity of n = Ω(c4 log p) for Erd˝ s-R´ nyi o e 6 random graphs GER (p, c/p) and small-world graphs GWatts (p, d, c/p), since the average degree c/2 is typically much smaller than the maximum degree ∆. [sent-259, score-1.114]
82 (i) We establish that for any two non-neighbors (i, j) ∈ G, the minimum conditional mutual information in (11) (based on exact statistics) does not / exceed the threshold ξn,p . [sent-264, score-0.321]
83 (ii) Similarly, we also establish that the conditional mutual information in (11) exceeds the threshold ξn,p for all neighbors (i, j) ∈ G. [sent-265, score-0.298]
84 To this end, we analyze the conditional mutual information when the conditioning set is a local separator between i and j and establish that it decays as p → ∞. [sent-269, score-0.392]
85 We focus on the ensemble of Erd˝ s-R´ nyi o e graphs GER (p, c/p). [sent-274, score-0.642]
86 For the class of degree-bounded graphs GDeg (p, ∆), necessary conditions on sample complexity have been characterized previously [26] by considering a certain (restricted) set of ensembles. [sent-275, score-0.439]
87 2]) turns out to be too weak for the class of Erd˝ s-R´ nyi graphs GER (p, c/p). [sent-277, score-0.484]
88 o e We provide novel necessary conditions for structure learning of Erd˝ s-R´ nyi graphs. [sent-278, score-0.328]
89 Recall that a graph G is drawn from the ensemble of Erd˝ s-R´ nyi graphs GER (p, c/p). [sent-280, score-0.811]
90 It is desired to derive tight necessary conditions on the number of samples n (as a function of average degree c/2 and number of (p) nodes p) so that the probability of error Pe := P (G(Xn ) = G) → 0 as the number of nodes p tends to infinity. [sent-289, score-0.342]
91 Again, note that the probability measure P is with respect to both the Erd˝ s-R´ nyi graph and the samples. [sent-290, score-0.372]
92 Our bound involves only the average degree c/2 and not the maximum degree of the graph, which is typically much larger than c [7]. [sent-299, score-0.231]
93 2], which (lower) bounds the probability of error Pe as a function of the n conditional entropy H(G|X ) and the size of the set of all graphs with p nodes. [sent-310, score-0.335]
94 To ameliorate this problem, we focus our attention on the typical graphs for applying Fano’s inequality and not all graphs. [sent-312, score-0.332]
95 The set of typical graphs has a small cardinality but high probability when p is large. [sent-313, score-0.34]
96 We can show that (i) the probability of the typical set tends to one as p → ∞, (ii) the graphs in the typical set are almost uniformly distributed (the asymptotic equipartition property), (iii) the cardinality of the typical set is small relative to the set of all graphs. [sent-315, score-0.452]
97 We presented a simple local algorithm for structure estimation with low computational and sample complexities under a set of mild and transparent conditions. [sent-318, score-0.199]
98 This algorithm succeeds on a wide range of graph ensembles such as the Erd˝ s-R´ nyi ensemble, smallo e world networks etc. [sent-319, score-0.4]
99 We also employed novel information-theoretic techniques for establishing necessary conditions for graphical model selection. [sent-320, score-0.21]
100 Virag, “On the girth of random cayley graphs,” Random Structures & Algorithms, vol. [sent-423, score-0.226]
wordName wordTfidf (topN-words)
[('jmin', 0.539), ('ising', 0.286), ('graphs', 0.281), ('cmit', 0.244), ('nyi', 0.203), ('erd', 0.203), ('girth', 0.191), ('ger', 0.183), ('graph', 0.169), ('ensemble', 0.158), ('graphical', 0.127), ('jg', 0.127), ('separator', 0.112), ('jmax', 0.104), ('degree', 0.104), ('fano', 0.098), ('mutual', 0.098), ('gwatts', 0.087), ('cycles', 0.073), ('transparent', 0.07), ('willsky', 0.07), ('gdeg', 0.07), ('rgp', 0.07), ('xs', 0.069), ('gp', 0.069), ('separation', 0.068), ('families', 0.064), ('threshold', 0.063), ('potentials', 0.063), ('nodes', 0.06), ('establish', 0.059), ('edge', 0.059), ('gaussian', 0.057), ('conditional', 0.054), ('strogatz', 0.052), ('node', 0.049), ('minimum', 0.047), ('property', 0.047), ('markov', 0.046), ('watts', 0.046), ('complexity', 0.045), ('characterize', 0.043), ('xn', 0.043), ('log', 0.043), ('anandkumar', 0.042), ('structure', 0.042), ('necessary', 0.042), ('conditions', 0.041), ('conditioning', 0.04), ('pe', 0.039), ('incoherence', 0.039), ('models', 0.039), ('consistent', 0.035), ('satis', 0.035), ('cayley', 0.035), ('equipartition', 0.035), ('samples', 0.035), ('tan', 0.034), ('hb', 0.034), ('cardinality', 0.034), ('poly', 0.032), ('degrees', 0.032), ('summable', 0.031), ('sample', 0.03), ('selection', 0.03), ('short', 0.03), ('paths', 0.029), ('local', 0.029), ('ensembles', 0.028), ('termed', 0.028), ('clarendon', 0.028), ('estimation', 0.028), ('asymptotic', 0.027), ('potential', 0.027), ('scales', 0.026), ('tractable', 0.026), ('ravikumar', 0.026), ('inequality', 0.026), ('hold', 0.025), ('chung', 0.025), ('condition', 0.025), ('correlation', 0.025), ('typical', 0.025), ('hybrid', 0.024), ('decay', 0.024), ('preprint', 0.024), ('discrete', 0.024), ('author', 0.024), ('chow', 0.024), ('converse', 0.024), ('structurally', 0.024), ('declared', 0.024), ('banerjee', 0.024), ('exceeds', 0.024), ('edges', 0.024), ('maximum', 0.023), ('xb', 0.023), ('assumptions', 0.022), ('subgraph', 0.022), ('tanh', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
2 0.14710341 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
3 0.11358344 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
4 0.10880911 213 nips-2011-Phase transition in the family of p-resistances
Author: Morteza Alamgir, Ulrike V. Luxburg
Abstract: We study the family of p-resistances on graphs for p 1. This family generalizes the standard resistance distance. We prove that for any fixed graph, for p = 1 the p-resistance coincides with the shortest path distance, for p = 2 it coincides with the standard resistance distance, and for p ! 1 it converges to the inverse of the minimal s-t-cut in the graph. Secondly, we consider the special case of random geometric graphs (such as k-nearest neighbor graphs) when the number n of vertices in the graph tends to infinity. We prove that an interesting phase transition takes place. There exist two critical thresholds p⇤ and p⇤⇤ such that if p < p⇤ , then the p-resistance depends on meaningful global properties of the graph, whereas if p > p⇤⇤ , it only depends on trivial local quantities and does not convey any useful information. We can explicitly compute the critical values: p⇤ = 1 + 1/(d 1) and p⇤⇤ = 1 + 1/(d 2) where d is the dimension of the underlying space (we believe that the fact that there is a small gap between p⇤ and p⇤⇤ is an artifact of our proofs). We also relate our findings to Laplacian regularization and suggest to use q-Laplacians as regularizers, where q satisfies 1/p⇤ + 1/q = 1. 1
5 0.10184041 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
Author: Jiarong Jiang, Piyush Rai, Hal Daume
Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
6 0.092176393 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
7 0.082493968 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
8 0.080743261 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
9 0.076834261 123 nips-2011-How biased are maximum entropy models?
10 0.075395495 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.071926281 199 nips-2011-On fast approximate submodular minimization
12 0.069577388 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
13 0.069440112 67 nips-2011-Data Skeletonization via Reeb Graphs
14 0.065528512 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
15 0.064652756 265 nips-2011-Sparse recovery by thresholded non-negative least squares
16 0.063092582 229 nips-2011-Query-Aware MCMC
17 0.061637554 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
18 0.058968231 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
19 0.056719001 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
20 0.056694031 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
topicId topicWeight
[(0, 0.164), (1, -0.002), (2, -0.035), (3, -0.068), (4, -0.057), (5, -0.082), (6, -0.118), (7, -0.045), (8, 0.097), (9, -0.117), (10, -0.064), (11, -0.003), (12, -0.103), (13, -0.005), (14, 0.012), (15, 0.071), (16, 0.136), (17, -0.081), (18, -0.049), (19, -0.007), (20, -0.099), (21, 0.101), (22, 0.029), (23, 0.032), (24, 0.119), (25, -0.064), (26, 0.022), (27, 0.031), (28, -0.024), (29, 0.025), (30, 0.021), (31, -0.066), (32, 0.051), (33, 0.135), (34, -0.034), (35, -0.107), (36, 0.016), (37, -0.057), (38, -0.073), (39, -0.097), (40, 0.102), (41, 0.01), (42, -0.028), (43, -0.008), (44, -0.012), (45, -0.022), (46, -0.015), (47, 0.021), (48, -0.052), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.95705062 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
2 0.76871932 67 nips-2011-Data Skeletonization via Reeb Graphs
Author: Xiaoyin Ge, Issam I. Safa, Mikhail Belkin, Yusu Wang
Abstract: Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a lowdimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional ”skeleton” from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.
3 0.74580383 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
Author: Shilin Ding, Grace Wahba, Xiaojin Zhu
Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is specified by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be flexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1
4 0.696468 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
Author: Pablo M. Olmos, Luis Salamanca, Juan Fuentes, Fernando Pérez-Cruz
Abstract: We show an application of a tree structure for approximate inference in graphical models using the expectation propagation algorithm. These approximations are typically used over graphs with short-range cycles. We demonstrate that these approximations also help in sparse graphs with long-range loops, as the ones used in coding theory to approach channel capacity. For asymptotically large sparse graph, the expectation propagation algorithm together with the tree structure yields a completely disconnected approximation to the graphical model but, for for finite-length practical sparse graphs, the tree structure approximation to the code graph provides accurate estimates for the marginal of each variable. Furthermore, we propose a new method for constructing the tree structure on the fly that might be more amenable for sparse graphs with general factors. 1
5 0.69240236 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
6 0.68802196 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
7 0.67355961 213 nips-2011-Phase transition in the family of p-resistances
8 0.62862319 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
9 0.59128469 274 nips-2011-Structure Learning for Optimization
10 0.57507747 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
11 0.5684517 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
12 0.52864796 158 nips-2011-Learning unbelievable probabilities
13 0.51157045 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
14 0.46010041 55 nips-2011-Collective Graphical Models
15 0.43192658 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
16 0.42330113 60 nips-2011-Confidence Sets for Network Structure
17 0.41297612 150 nips-2011-Learning a Distance Metric from a Network
18 0.41106048 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
19 0.40926847 265 nips-2011-Sparse recovery by thresholded non-negative least squares
20 0.40736273 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
topicId topicWeight
[(0, 0.028), (4, 0.073), (20, 0.031), (26, 0.033), (31, 0.068), (33, 0.018), (43, 0.109), (45, 0.109), (57, 0.059), (74, 0.04), (77, 0.014), (83, 0.032), (90, 0.272), (99, 0.021)]
simIndex simValue paperId paperTitle
1 0.94019574 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference
Author: Guy Broeck
Abstract: Probabilistic logics are receiving a lot of attention today because of their expressive power for knowledge representation and learning. However, this expressivity is detrimental to the tractability of inference, when done at the propositional level. To solve this problem, various lifted inference algorithms have been proposed that reason at the first-order level, about groups of objects as a whole. Despite the existence of various lifted inference approaches, there are currently no completeness results about these algorithms. The key contribution of this paper is that we introduce a formal definition of lifted inference that allows us to reason about the completeness of lifted inference algorithms relative to a particular class of probabilistic models. We then show how to obtain a completeness result using a first-order knowledge compilation approach for theories of formulae containing up to two logical variables. 1 Introduction and related work Probabilistic logic models build on first-order logic to capture relational structure and on graphical models to represent and reason about uncertainty [1, 2]. Due to their expressivity, these models can concisely represent large problems with many interacting random variables. While the semantics of these logics is often defined through grounding the models [3], performing inference at the propositional level is – as for first-order logic – inefficient. This has motivated the quest for lifted inference methods that exploit the structure of probabilistic logic models for efficient inference, by reasoning about groups of objects as a whole and avoiding repeated computations. The first approaches to exact lifted inference have upgraded the variable elimination algorithm to the first-order level [4, 5, 6]. More recent work is based on methods from logical inference [7, 8, 9, 10], such as knowledge compilation. While these approaches often yield dramatic improvements in runtime over propositional inference methods on specific problems, it is still largely unclear for which classes of models these lifted inference operators will be useful and for which ones they will eventually have to resort to propositional inference. One notable exception in this regard is lifted belief propagation [11], which performs exact lifted inference on any model whose factor graph representation is a tree. A first contribution of this paper is that we introduce a notion of domain lifted inference, which formally defines what lifting means, and which can be used to characterize the classes of probabilistic models to which lifted inference applies. Domain lifted inference essentially requires that probabilistic inference runs in polynomial time in the domain size of the logical variables appearing in the model. As a second contribution we show that the class of models expressed as 2-WFOMC formulae (weighted first-order model counting with up to 2 logical variables per formula) can be domain lifted using an extended first-order knowledge compilation approach [10]. The resulting approach allows for lifted inference even in the presence of (anti-) symmetric or total relations in a theory. These are extremely common and useful concepts that cannot be lifted by any of the existing first-order knowledge compilation inference rules. 1 2 Background We will use standard concepts of function-free first-order logic (FOL). An atom p(t1 , . . . , tn ) consists of a predicate p/n of arity n followed by n arguments, which are either constants or logical variables. An atom is ground if it does not contain any variables. A literal is an atom a or its negation ¬a. A clause is a disjunction l1 ∨ ... ∨ lk of literals. If k = 1, it is a unit clause. An expression is an atom, literal or clause. The pred(a) function maps an atom to its predicate and the vars(e) function maps an expression to its logical variables. A theory in conjunctive normal form (CNF) is a conjunction of clauses. We often represent theories by their set of clauses and clauses by their set of literals. Furthermore, we will assume that all logical variables are universally quantified. In addition, we associate a set of constraints with each clause or atom, either of the form X = t, where X is a logical variable and t is a constant or variable, or of the form X ∈ D, where D is a domain, or the negation of these constraints. These define a finite domain for each logical variable. Abusing notation, we will use constraints of the form X = t to denote a substitution of X by t. The function atom(e) maps an expression e to its atoms, now associating the constraints on e with each atom individually. To add the constraint c to an expression e, we use the notation e ∧ c. Two atoms unify if there is a substitution which makes them identical and if the conjunction of the constraints on both atoms with the substitution is satisfiable. Two expressions e1 and e2 are independent, written e1 ⊥ e2 , if no atom a1 ∈ atom(e1 ) unifies with an atom a2 ∈ atom(e2 ). ⊥ We adopt the Weighted First-Order Model Counting (WFOMC) [10] formalism to represent probabilistic logic models, building on the notion of a Herbrand interpretation. Herbrand interpretations are subsets of the Herbrand base HB (T ), which consists of all ground atoms that can be constructed with the available predicates and constant symbols in T . The atoms in a Herbrand interpretation are assumed to be true. All other atoms in HB (T ) are assumed to be false. An interpretation I satisfies a theory T , written as I |= T , if it satisfies all the clauses c ∈ T . The WFOMC problem is defined on a weighted logic theory T , which is a logic theory augmented with a positive weight function w and a negative weight function w, which assign a weight to each predicate. The WFOMC problem involves computing wmc(T, w, w) = w(pred(a)) I|=T a∈I 3 3.1 w(pred(a)). (1) a∈HB(T )\I First-order knowledge compilation for lifted probabilistic inference Lifted probabilistic inference A first-order probabilistic model defines a probability distribution P over the set of Herbrand interpretations H. Probabilistic inference in these models is concerned with computing the posterior probability P(q|e) of query q given evidence e, where q and e are logical expressions in general: P(q|e) = h∈H,h|=q∧e P(h) h∈H,h|=e P(h) (2) We propose one notion of lifted inference for first-order probabilistic models, defined in terms of the computational complexity of inference w.r.t. the domains of logical variables. It is clear that other notions of lifted inference are conceivable, especially in the case of approximate inference. Definition 1 (Domain Lifted Probabilistic Inference). A probabilistic inference procedure is domain lifted for a model m, query q and evidence e iff the inference procedure runs in polynomial time in |D1 |, . . . , |Dk | with Di the domain of the logical variable vi ∈ vars(m, q, e). Domain lifted inference does not prohibit the algorithm to be exponential in the size of the vocabulary, that is, the number of predicates, arguments and constants, of the probabilistic model, query and evidence. In fact, the definition allows inference to be exponential in the number of constants which occur in arguments of atoms in the theory, query or evidence, as long as it is polynomial in the cardinality of the logical variable domains. This definition of lifted inference stresses the ability to efficiently deal with the domains of the logical variables that arise, regardless of their size, and formalizes what seems to be generally accepted in the lifted inference literature. 2 A class of probabilistic models is a set of probabilistic models expressed in a particular formalism. As examples, consider Markov logic networks (MLN) [12] or parfactors [4], or the weighted FOL theories for WFOMC that we introduced above, when the weights are normalized. Definition 2 (Completeness). Restricting queries to atoms and evidence to a conjunction of literals, a procedure that is domain lifted for all probabilistic models m in a class of models M and for all queries q and evidence e, is called complete for M . 3.2 First-order knowledge compilation First-order knowledge compilation is an approach to lifted probabilistic inference consisting of the following three steps (see Van den Broeck et al. [10] for details): 1. Convert the probabilistic logical model to a weighted CNF. Converting MLNs or parfactors requires adding new atoms to the theory that represent the (truth) value of each factor or formula. set-disjunction 2 friends(X, Y ) ∧ smokes(X) ⇒ smokes(Y ) Smokers ⊆ People decomposable conjunction unit clause leaf (a) MLN Model ∧ smokes(X), X ∈ Smokers smokes(Y ) ∨ ¬ smokes(X) ∨¬ friends(X, Y ) ∨ ¬ f(X, Y ) friends(X, Y ) ∨ f(X, Y ) smokes(X) ∨ f(X, Y ) ¬ smokes(Y ) ∨ f(X, Y ). ∧ f(X, Y ), Y ∈ Smokers ∧ ¬ smokes(Y ), Y ∈ Smokers / ∧ f(X, Y ), X ∈ Smokers, Y ∈ Smokers / / x ∈ Smokers (b) CNF Theory deterministic disjunction Predicate friends smokes f w 1 1 e2 w 1 1 1 y ∈ Smokers / ∨ set-conjunction ∧ f(x, y) (c) Weight Functions ¬ friends(x, y) ∧ friends(x, y) ¬ f(x, y) (d) First-Order d-DNNF Circuit Figure 1: Friends-smokers example (taken from [10]) Example 1. The MLN in Figure 1a assigns a weight to a formula in FOL. Figure 1b represents the same model as a weighted CNF, introducing a new atom f(X, Y ) to encode the truth value of the MLN formula. The probabilistic information is captured by the weight functions in Figure 1c. 2. Compile the logical theory into a First-Order d-DNNF (FO d-DNNF) circuit. Figure 1d shows an example of such a circuit. Leaves represent unit clauses. Inner nodes represent the disjunction or conjunction of their children l and r, but with the constraint that disjunctions must be deterministic (l ∧ r is unsatisfiable) and conjunctions must be decomposable (l ⊥ r). ⊥ 3. Perform WFOMC inference to compute posterior probabilities. In a FO d-DNNF circuit, WFOMC is polynomial in the size of the circuit and the cardinality of the domains. To compile the CNF theory into a FO d-DNNF circuit, Van den Broeck et al. [10] propose a set of compilation rules, which we will refer to as CR 1 . We will now briefly describe these rules. Unit Propagation introduces a decomposable conjunction when the theory contains a unit clause. Independence creates a decomposable conjunction when the theory contains independent subtheories. Shannon decomposition applies when the theory contains ground atoms and introduces a deterministic disjunction between two modified theories: one where the ground atom is true, and one where it is false. Shattering splits clauses in the theory until all pairs of atoms represent either a disjoint or identical set of ground atoms. Example 2. In Figure 2a, the first two clauses are made independent from the friends(X, X) clause and split off in a decomposable conjunction by unit propagation. The unit clause becomes a leaf of the FO d-DNNF circuit, while the other operand requires further compilation. 3 dislikes(X, Y ) ∨ friends(X, Y ) fun(X) ∨ ¬ friends(X, Y ) friends(X, Y ) ∨ dislikes(X, Y ) ¬ friends(X, Y ) ∨ likes(X, Y ) friends(X, X) fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) ∧ FunPeople ⊆ People x ∈ People friends(X, X) friends(X, Y ) ∨ dislikes(X, Y ), X = Y ¬ friends(X, Y ) ∨ likes(X, Y ), X = Y likes(X, X) dislikes(x, Y ) ∨ friends(x, Y ) fun(x) ∨ ¬ friends(x, Y ) fun(X), X ∈ FunPeople ¬ fun(X), X ∈ FunPeople / fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) (a) Unit propagation of friends(X, X) (b) Independent partial grounding (c) Atom counting of fun(X) Figure 2: Examples of compilation rules. Circles are FO d-DNNF inner nodes. White rectangles show theories before and after applying the rule. All variable domains are People. (taken from [10]) Independent Partial Grounding creates a decomposable conjunction over a set of child circuits, which are identical up to the value of a grounding constant. Since they are structurally identical, only one child circuit is actually compiled. Atom Counting applies when the theory contains an atom with a single logical variable X ∈ D. It explicitly represents the domain D ⊆ D of X for which the atom is true. It compiles the theory into a deterministic disjunction between all possible such domains. Again, these child circuits are identical up to the value of D and only one is compiled. Example 3. The theory in Figure 2b is compiled into a decomposable set-conjunction of theories that are independent and identical up to the value of the x constant. The theory in Figure 2c contains an atom with one logical variable: fun(X). Atom counting compiles it into a deterministic setdisjunction over theories that differ in FunPeople, which is the domain of X for which fun(X) is true. Subsequent steps of unit propagation remove the fun(X) atoms from the theory entirely. 3.3 Completeness We will now characterize those theories where the CR 1 compilation rules cannot be used, and where the inference procedure has to resort to grounding out the theory to propositional logic. For these, first-order knowledge compilation using CR 1 is not yet domain lifted. When a logical theory contains symmetric, anti-symmetric or total relations, such as friends(X, Y ) ⇒ friends(Y, X), parent(X, Y ) ⇒ ¬ parent(Y, X), X = Y, ≤ (X, Y) ∨ ≤ (Y, X), or more general formulas, such as enemies(X, Y ) ⇒ ¬ friend(X, Y ) ∧ ¬ friend(Y, X), (3) (4) (5) (6) none of the CR 1 rules apply. Intuitively, the underlying problem is the presence of either: • Two unifying (not independent) atoms in the same clause which contain the same logical variable in different positions of the argument list. Examples include (the CNF of) Formulas 3, 4 and 5, where the X and Y variable are bound by unifying two atoms from the same clause. • Two logical variables that bind when unifying one pair of atoms but appear in different positions of the argument list of two other unifying atoms. Examples include Formula 6, which in CNF is ¬ friend(X, Y ) ∨ ¬ enemies(X, Y ) ¬ friend(Y, X) ∨ ¬ enemies(X, Y ) Here, unifying the enemies(X, Y ) atoms binds the X variables from both clauses, which appear in different positions of the argument lists of the unifying atoms friend(X, Y ) and friend(Y, X). Both of these properties preclude the use of CR 1 rules. Also in the context of other model classes, such as MLNs, probabilistic versions of the above formulas cannot be processed by CR 1 rules. 4 Even though first-order knowledge compilation with CR 1 rules does not have a clear completeness result, we can show some properties of theories to which none of the compilation rules apply. First, we need to distinguish between the arity of an atom and its dimension. A predicate with arity two might have atoms with dimension one, when one of the arguments is ground or both are identical. Definition 3 (Dimension of an Expression). The dimension of an expression e is the number of logical variables it contains: dim(e) = | vars(e)|. Lemma 1 (CR 1 Postconditions). The CR 1 rules remove all atoms from the theory T which have zero or one logical variable arguments, such that afterwards ∀a ∈ atom(T ) : dim(a) > 1. When no CR 1 rule applies, the theory is shattered and contains no independent subtheories. Proof. Ground atoms are removed by the Shannon decomposition operator followed by unit propagation. Atoms with a single logical variable (including unary relations) are removed by the atom counting operator followed by unit propagation. If T contains independent subtheories, the independence operator can be applied. Shattering is always applied when T is not yet shattered. 4 Extending first-order knowledge compilation In this section we introduce a new operator which does apply to the theories from Section 3.3. 4.1 Logical variable properties To formally define the operator we propose, and prove its correctness, we first introduce some mathematical concepts related to the logical variables in a theory (partly after Jha et al. [8]). Definition 4 (Binding Variables). Two logical variables X, Y are directly binding b(X, Y ) if they are bound by unifying a pair of atoms in the theory. The binding relationship b+ (X, Y ) is the transitive closure of the directly binding relation b(X, Y ). Example 4. In the theory ¬ p(W, X) ∨ ¬ q(X) r(Y ) ∨ ¬ q(Y ) ¬ r(Z) ∨ s(Z) the variable pairs (X, Y ) and (Y, Z) are directly binding. The variables X, Y and Z are binding. Variable W does not bind to any other variable. Note that the binding relationship b+ (X, Y ) is an equivalence relation that defines two equivalence classes: {X, Y, Z} and {W }. Lemma 2 (Binding Domains). After shattering, binding logical variables have identical domains. Proof. During shattering (see Section 3.2), when two atoms unify, binding two variables with partially overlapping domains, the atoms’ clauses are split up into clauses where the domain of the variables is identical, and clauses where the domains are disjoint and the atoms no longer unify. Definition 5 (Root Binding Class). A root variable is a variable that appears in all the atoms in its clause. A root binding class is an equivalence class of binding variables where all variables are root. Example 5. In the theory of Example 4, {X, Y, Z} is a root binding class and {W } is not. 4.2 Domain recursion We will now introduce the new domain recursion operator, starting with its preconditions. Definition 6. A theory allows for domain recursion when (i) the theory is shattered, (ii) the theory contains no independent subtheories and (iii) there exists a root binding class. From now on, we will denote with C the set of clauses of the theory at hand and with B a root binding class guaranteed to exist if C allows for domain recursion. Lemma 2 states that all variables in B have identical domains. We will denote the domain of these variables with D. The intuition behind the domain recursion operator is that it modifies D by making one element explicit: D = D ∪ {xD } with xD ∈ D . This explicit domain element is introduced by the S PLIT D / function, which splits clauses w.r.t. the new subdomain D and element xD . 5 Definition 7 (S PLIT D). For a clause c and given set of variables Vc ⊆ vars(c) with domain D, let S PLIT D(c, Vc ) = c, if Vc = ∅ S PLIT D(c1 , Vc \ {V }) ∪ S PLIT D(c2 , Vc \ {V }), if Vc = ∅ (7) where c1 = c ∧ (V = xD ) and c2 = c ∧ (V = xD ) ∧ (V ∈ D ) for some V ∈ Vc . For a set of clauses C and set of variables V with domain D: S PLIT D(C, V) = c∈C S PLIT D(c, V ∩ vars(c)). The domain recursion operator creates three sets of clauses: S PLIT D(C, B) = Cx ∪ Cv ∪ Cr , with Cx = {c ∧ (V = xD )|c ∈ C}, (8) V ∈B∩vars(c) Cv = {c ∧ (V = xD ) ∧ (V ∈ D )|c ∈ C}, (9) V ∈B∩vars(c) Cr = S PLIT D(C, B) \ Cx \ Cv . (10) Proposition 3. The conjunction of the domain recursion sets is equivalent to the original theory: c∈Cr c . c∈Cv c ∧ c∈C c ≡ c∈S PLIT D(C,B) c and therefore c∈C c ≡ c∈Cx c ∧ We will now show that these sets are independent and that their conjunction is decomposable. Theorem 4. The theories Cx , Cv and Cr are independent: Cx ⊥ Cv , Cx ⊥ Cr and Cv ⊥ Cr . ⊥ ⊥ ⊥ The proof of Theorem 4 relies on the following Lemma. Lemma 5. If the theory allows for domain recursion, all clauses and atoms contain the same number of variables from B: ∃n, ∀c ∈ C, ∀a ∈ atom(C) : | vars(c) ∩ B | = | vars(a) ∩ B | = n. c Proof. Denote with Cn the clauses in C that contain n logical variables from B and with Cn its compliment in C. If C is nonempty, there is a n > 0 for which Cn is nonempty. Then every atom in Cn contains exactly n variables from B (Definition 5). Since the theory contains no independent c c subtheories, there must be an atom a in Cn which unifies with an atom ac in Cn , or Cn is empty. After shattering, all unifications bind one variable from a to a single variable from ac . Because a contains exactly n variables from B, ac must also contain exactly n (Definition 4), and because B is c a root binding class, the clause of ac also contains exactly n, which contradicts the definition of Cn . c Therefore, Cn is empty, and because the variables in B are root, they also appear in all atoms. Proof of Theorem 4. From Lemma 5, all atoms in C contain the same number of variables from B. In Cx , these variables are all constrained to be equal to xD , while in Cv and Cr at least one variable is constrained to be different from xD . An attempt to unify an atom from Cx with an atom from Cv or Cr therefore creates an unsatisfiable set of constraints. Similarly, atoms from Cv and Cr cannot be unified. Finally, we extend the FO d-DNNF language proposed in Van den Broeck et al. [10] with a new node, the recursive decomposable conjunction ∧ r , and define the domain recursion compilation rule. Definition 8 ( ∧ r ). The FO d-DNNF node ∧ r (nx , nr , D, D , V) represents a decomposable conjunction between the d-DNNF nodes nx , nr and a d-DNNF node isomorphic to the ∧ r node itself. In particular, the isomorphic operand is identical to the node itself, except for the size of the domain of the variables in V, which becomes one smaller, going from D to D in the isomorphic operand. We have shown that the conjunction between sets Cx , Cv and Cr is decomposable (Theorem 4) and logically equivalent to the original theory (Proposition 3). Furthermore, Cv is identical to C, up to the constraints on the domain of the variables in B. This leads us to the following definition of domain recursion. Definition 9 (Domain Recursion). The domain recursion compilation rule compiles C into ∧ r (nx , nr , D, D , B), where nx , nr are the compiled circuits for Cx , Cr . The third set Cv is represented by the recursion on D, according to Definition 8. 6 nv Cr ∧r ¬ friends(x, X) ∨ friends(X, x), X = x ¬ friends(X, x) ∨ friends(x, X), X = x P erson ← P erson \ {x} nr nx ∨ Cx ¬ friends(x, x) ∨ friends(x, x) x ∈P erson x =x ¬ friends(x, x) friends(x, x) ∨ ∧ ¬ friends(x, x ) ¬ friends(x , x) ∧ friends(x, x ) friends(x , x) Figure 3: Circuit for the symmetric relation in Equation 3, rooted in a recursive conjunction. Example 6. Figure 3 shows the FO d-DNNF circuit for Equation 3. The theory is split up into three independent theories: Cr and Cx , shown in the Figure 3, and Cv = {¬ friends(X, Y ) ∨ friends(Y, X), X = x, Y = x}. The conjunction of these theories is equivalent to Equation 3. Theory Cv is identical to Equation 3, up to the inequality constraints on X and Y . Theorem 6. Given a function size, which maps domains to their size, the weighted first-order model count of a ∧ r (nx , nr , D, D , V) node is size(D) wmc( ∧ r (nx , nr , D, D , V), size) = wmc(nx , size)size(D) wmc(nr , size ∪{D → s}), s=0 (11) where size ∪{D → s} adds to the size function that the subdomain D has cardinality s. Proof. If C allows for domain recursion, due to Theorem 4, the weighted model count is wmc(C, size) = 1, if size(D) = 0 wmc(Cx ) · wmc(Cv , size ) · wmc(Cr , size ) if size(D) > 0 (12) where size = size ∪{D → size(D) − 1}. Theorem 7. The Independent Partial Grounding compilation rule is a special case of the domain recursion rule, where ∀c ∈ C : | vars(c) ∩ B | = 1 (and therefore Cr = ∅). 4.3 Completeness In this section, we introduce a class of models for which first-order knowledge compilation with domain recursion is complete. Definition 10 (k-WFOMC). The class of k-WFOMC consist of WFOMC theories with clauses that have up to k logical variables. A first completeness result is for 2-WFOMC, using the set of knowledge compilation rules CR 2 , which are the rules in CR 1 extended with domain recursion. Theorem 8 (Completeness for 2-WFOMC). First-order knowledge compilation using the CR 2 compilation rules is a complete domain lifted probabilistic inference algorithm for 2-WFOMC. Proof. From Lemma 1, after applying the CR 1 rules, the theory contains only atoms with dimension larger than or equal to two. From Definition 10, each clause has dimension smaller than or equal to two. Therefore, each logical variable in the theory is a root variable and according to Definition 5, every equivalence class of binding variables is a root binding class. Because of Lemma 1, the theory allows for domain recursion, which requires further compilation of two theories: Cx and Cr into nx and nr . Both have dimension smaller than 2 and can be lifted by CR 1 compilation rules. The properties of 2-WFOMC are a sufficient but not necessary condition for first-order knowledge compilation to be domain lifted. We can obtain a similar result for MLNs or parfactors by reducing them to a WFOMC problem. If an MLN contains only formulae with up to k logical variables, then its WFOMC representation will be in k-WFOMC. 7 This result for 2-WFOMC is not trivial. Van den Broeck et al. [10] showed in their experiments that counting first-order variable elimination (C-FOVE) [6] fails to lift the “Friends Smoker Drinker” problem, which is in 2-WFOMC. We will show in the next section that the CR 1 rules fail to lift the theory in Figure 4a, which is in 2-WFOMC. Note that there are also useful theories that are not in 2-WFOMC, such as those containing the transitive relation friends(X, Y ) ∧ friends(Y, Z) ⇒ friends(X, Z). 5 Empirical evaluation To complement the theoretical results of the previous section, we extended the WFOMC implementation1 with the domain recursion rule. We performed experiments with the theory in Figure 4a, which is a version of the friends and smokers model [11] extended with the symmetric relation of Equation 3. We evaluate the performance querying P(smokes(bob)) with increasing domain size, comparing our approach to the existing WFOMC implementation and its propositional counterpart, which first grounds the theory and then compiles it with the c2d compiler [13] to a propositional d-DNNF circuit. We did not compare to C-FOVE [6] because it cannot perform lifted inference on this model. 2 smokes(X) ∧ friends(X, Y ) ⇒ smokes(Y ) friends(X, Y ) ⇒ friends(Y, X). Runtime [s] Propositional inference quickly becomes intractable when there are more than 20 people. The lifted inference algorithms scale much better. The CR 1 rules can exploit some regularities in the model. For example, they eliminate all the smokes(X) atoms from the theory. They do, however, resort to grounding at a later stage of the compilation process. With the domain recursion rule, there is no need for grounding. This advantage is clear in the experiments, our approach having an almost constant inference time in this range of domains sizes. Note that the runtimes for c2d include compilation and evaluation of the circuit, whereas the WFOMC runtimes only represent evaluation of the FO d-DNNF. After all, propositional compilation depends on the domain size but first-order compilation does not. First-order compilation takes a constant two seconds for both rule sets. 10000 1000 100 10 1 0.1 0.01 c2d WFOMC - CR1 WFOMC - CR2 10 20 30 40 50 60 Number of People 70 80 (b) Evaluation Runtime (a) MLN Model Figure 4: Symmetric friends and smokers experiment, comparing propositional knowledge compilation (c2d) to WFOMC using compilation rules CR 1 and CR 2 (which includes domain recursion). 6 Conclusions We proposed a definition of complete domain lifted probabilistic inference w.r.t. classes of probabilistic logic models. This definition considers algorithms to be lifted if they are polynomial in the size of logical variable domains. Existing first-order knowledge compilation turns out not to admit an intuitive completeness result. Therefore, we generalized the existing Independent Partial Grounding compilation rule to the domain recursion rule. With this one extra rule, we showed that first-order knowledge compilation is complete for a significant class of probabilistic logic models, where the WFOMC representation has up to two logical variables per clause. Acknowledgments The author would like to thank Luc De Raedt, Jesse Davis and the anonymous reviewers for valuable feedback. This work was supported by the Research Foundation-Flanders (FWO-Vlaanderen). 1 http://dtai.cs.kuleuven.be/wfomc/ 8 References [1] Lise Getoor and Ben Taskar, editors. An Introduction to Statistical Relational Learning. MIT Press, 2007. [2] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen Muggleton, editors. Probabilistic inductive logic programming: theory and applications. Springer-Verlag, Berlin, Heidelberg, 2008. [3] Daan Fierens, Guy Van den Broeck, Ingo Thon, Bernd Gutmann, and Luc De Raedt. Inference in probabilistic logic programs using weighted CNF’s. In Proceedings of UAI, pages 256–265, 2011. [4] David Poole. First-order probabilistic inference. In Proceedings of IJCAI, pages 985–991, 2003. [5] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted first-order probabilistic inference. In Proceedings of IJCAI, pages 1319–1325, 2005. [6] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, and Leslie Pack Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of AAAI, pages 1062–1068, 2008. [7] Vibhav Gogate and Pedro Domingos. Exploiting Logical Structure in Lifted Probabilistic Inference. In Proceedings of StarAI, 2010. [8] Abhay Jha, Vibhav Gogate, Alexandra Meliou, and Dan Suciu. Lifted Inference Seen from the Other Side: The Tractable Features. In Proceedings of NIPS, 2010. [9] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of UAI, pages 256–265, 2011. [10] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis, and Luc De Raedt. Lifted Probabilistic Inference by First-Order Knowledge Compilation. In Proceedings of IJCAI, pages 2178–2185, 2011. [11] Parag Singla and Pedro Domingos. Lifted first-order belief propagation. In Proceedings of AAAI, pages 1094–1099, 2008. [12] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62(1): 107–136, 2006. [13] Adnan Darwiche. New advances in compiling CNF to decomposable negation normal form. In Proceedings of ECAI, pages 328–332, 2004. 9
same-paper 2 0.7581116 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
Author: Animashree Anandkumar, Vincent Tan, Alan S. Willsky
Abstract: We consider the problem of Ising and Gaussian graphical model selection given n i.i.d. samples from the model. We propose an efficient threshold-based algorithm for structure estimation based on conditional mutual information thresholding. This simple local algorithm requires only loworder statistics of the data and decides whether two nodes are neighbors in the unknown graph. We identify graph families for which the proposed algorithm has low sample and computational complexities. Under some transparent assumptions, we establish that the proposed algorithm is −4 structurally consistent (or sparsistent) when the number of samples scales as n = Ω(Jmin log p), where p is the number of nodes and Jmin is the minimum edge potential. We also develop novel non-asymptotic techniques for obtaining necessary conditions for graphical model selection. Keywords: Graphical model selection, high-dimensional learning, local-separation property, necessary conditions, typical sets, Fano’s inequality.
3 0.57652134 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
4 0.57603347 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
Author: Po-ling Loh, Martin J. Wainwright
Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1
5 0.57446694 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
6 0.57366574 231 nips-2011-Randomized Algorithms for Comparison-based Search
7 0.57174289 199 nips-2011-On fast approximate submodular minimization
8 0.57045734 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
9 0.5701108 139 nips-2011-Kernel Bayes' Rule
10 0.56890547 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
11 0.56815904 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
12 0.56801939 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
13 0.56733721 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
14 0.56612259 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
15 0.56528628 265 nips-2011-Sparse recovery by thresholded non-negative least squares
16 0.56484026 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
17 0.56480128 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
18 0.56448901 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database
19 0.56394142 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
20 0.56372422 29 nips-2011-Algorithms and hardness results for parallel large margin learning