jmlr jmlr2010 jmlr2010-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
Reference: text
sentIndex sentText sentNum sentScore
1 We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. [sent-12, score-0.252]
2 In Studen´ (2005), the method of structural imsets has been proposed as a non-graphical algey braic device for describing probabilistic CI structures; its advantage over graphical approaches is that it allows one to describe any discrete probabilistic CI structure. [sent-39, score-0.377]
3 The intended use is • computer testing of implications between CI statements, and • checking equivalence of structural imsets by an algebraic method. [sent-43, score-0.48]
4 2 Some Former Algorithms Given structural imsets u and v (over a set of variables N), the intuitive meaning of the implication u ⇀ v (≡ u independence implies v) is that the CI structure induced by u contains the CI structure induced by v. [sent-51, score-0.549]
5 It was the question y whether every structural imset is a combinatorial imset, that is, whether it can be written as the sum of elementary imsets (= imsets corresponding to elementary CI statements). [sent-74, score-1.424]
6 (2008), where an example of a structural imset over 5 variables was found that is not combinatorial. [sent-76, score-0.436]
7 This fact naturally leads to a more advanced question motivated by the topic of CI inference: what is the so-called Hilbert basis of the cone generated by standard imsets (cf. [sent-77, score-0.477]
8 In fact, the CI inference problem is transformed to testing whether a given imset is combinatorial, respectively structural. [sent-82, score-0.461]
9 Given acyclic directed graphs G and H over N, we show that the BN structure induced by H is contained in the one induced by G iff the difference uG − uH of their standard imsets is a combinatorial imset, which happens iff it is a structural imset. [sent-90, score-0.584]
10 Thus, testing the inclusion can be transformed to testing combinatorial imsets (or structural ones). [sent-91, score-0.546]
11 A consequence of this observation is an elegant algebraic procedure for reading CI statements represented in a standard imset uG , a kind of counterpart of the graphical separation criterion. [sent-92, score-0.518]
12 Since the standard imset uG is quite simple we have reasons to conjecture that our procedure can be implemented with polynomial complexity with respect to |N|. [sent-93, score-0.395]
13 Since trivial statements are always valid, it follows from the semi-graphoid properties that, for any discrete probability distribution P, the CI structure induced by P is uniquely determined by the list of elementary CI statements valid with respect to P. [sent-112, score-0.348]
14 2 Imsets An imset over N is an integer-valued function on the power set of N, denoted by P (N) in the rest of the paper. [sent-114, score-0.395]
15 An imset associated with a CI statement A ⊥ B |C is then the combination ⊥ u A,B|C ≡ δABC + δC − δAC − δBC . [sent-117, score-0.427]
16 The class of elementary imsets (over N), denoted by E (N), consists of imsets associated with elementary CI statements (over N). [sent-118, score-1.013]
17 A combinatorial imset is an imset u that can directly be decomposed into elementary imsets, that is, for some kw ∈ Z+ . [sent-119, score-1.022]
18 u = ∑ kw · w w∈ E (N) We denote the class of combinatorial imsets over N by C (N). [sent-120, score-0.44]
19 An imset u over N will be called structural if there exists n ∈ N such that the multiple n · u is a combinatorial imset, that is, n·u = ∑ kw · w for some n ∈ N, kw ∈ Z+ . [sent-121, score-0.584]
20 w∈ E (N) In other words, a structural imset is an imset which is a combination of elementary ones with nonnegative rational coefficients. [sent-122, score-1.008]
21 The class of structural imsets over N will be denoted by S (N); this 1. [sent-123, score-0.377]
22 The word imset is an abbreviation for integer-valued multiset. [sent-124, score-0.395]
23 3457 ´ B OUCKAERT, H EMMECKE , L INDNER AND S TUDEN Y class of imsets is intended to describe CI structures. [sent-125, score-0.336]
24 3) that there exists a smallest n∗ ∈ N, y depending on |N|, such that u ∈ S (N) ⇔ n∗ · u ∈ C (N) for every imset u over N. [sent-130, score-0.395]
25 3 Independence Implication Let u, v be structural imsets over N. [sent-135, score-0.377]
26 The importance of structural imsets follows from the fact that, for every discrete probability distribution P over N, there exists u ∈ S (N) such that the CI structure induced by P coincides with M (u) (use Studen´ , 2005, Theorem 5. [sent-138, score-0.377]
27 The scalar product of m and an imset u over N is then m, u ≡ ∑ m(S) · u(S) . [sent-143, score-0.395]
28 More specifically, given an input list L of CI statements, let us “translate” every statement s in L into the associated imset us and introduce a combinatorial imset uL ≡ ∑s∈L us . [sent-147, score-0.929]
29 If we assume that the “input” u is a combinatorial imset and the “output” v an elementary one then such a limit exists. [sent-155, score-0.583]
30 Given an acyclic directed graph G over N, the standard imset for G, denoted by uG , is given by the formula uG = δN − δ0 + ∑ { δpaG (i) − δ{i}∪paG (i) }, (3) / i∈N where paG (i) ≡ { j ∈ N; j → i in G} denotes the set of parents of a node i in G. [sent-170, score-0.456]
31 1 in Studen´ (2005) says that uG is always a combinatorial imset and M (uG ) = I (G). [sent-172, score-0.455]
32 That means, the standard imset is a unique representative of the equivalence class of graphs. [sent-176, score-0.395]
33 Note that uG need not be the only combinatorial imset defining I (G); it is the simplest such imset, a kind of “standard” representative. [sent-177, score-0.455]
34 Using standard imsets we can easily characterize the inclusion for Bayesian networks: Proposition 2 Let G and H be acyclic directed graphs over N. [sent-178, score-0.422]
35 We have some reasons to conjecture that testing whether a difference of two standard imsets is a combinatorial imset can be done with polynomial complexity in |N|; see Conclusions for the discussion. [sent-185, score-0.833]
36 Note that indirect transformational graphical description of independence inclusion by Chickering (2002) does not provide an algorithm for testing the inclusion for a given pair of 3459 ´ B OUCKAERT, H EMMECKE , L INDNER AND S TUDEN Y acyclic directed graphs G and H. [sent-187, score-0.216]
37 Proposition 2 has the following consequence, which allows one to replace the graphical separation criterion by testing whether an imset is combinatorial. [sent-189, score-0.437]
38 If the polyhedral cone is rational then it is the conical hull of finitely many rational points. [sent-210, score-0.338]
39 A pointed polyhedral cone C has just one vertex 0, its edges are half-lines, called extreme rays. [sent-242, score-0.312]
40 If C is a pointed rational polyhedral cone then each of its extreme rays contains a non-zero lattice point; this lattice point is unique if it is normalized, that is, if its components have no common integer divisor. [sent-247, score-0.488]
41 3461 ´ B OUCKAERT, H EMMECKE , L INDNER AND S TUDEN Y By a Hilbert basis of a rational polyhedral cone C (see Schrijver, 1986, § 16. [sent-251, score-0.291]
42 4) says that every rational polyhedral cone possesses a Hilbert basis. [sent-254, score-0.267]
43 An input list L of elementary CI statements and another elementary CI statement t over 2. [sent-282, score-0.42]
44 1 Method 1: using Skeletal Characterization This is the first ever implemented method for testing independence implication (Studen´ et al. [sent-291, score-0.214]
45 The motivation source for this method was the dual definition (2) of independence implication in terms of supermodular functions (see Section 2. [sent-294, score-0.272]
46 The columns in the table correspond to structural imsets ut and uL , the items are corresponding scalar products. [sent-315, score-0.449]
47 The independence implication can equivalently be defined in terms of (inclusion of) facets of the cone R (N) ≡ cone (E (N)), the cone spanned by elementary imsets. [sent-319, score-0.688]
48 These facets correspond to the extreme rays of the (dual) cone of ℓ-standardized supermodular functions. [sent-328, score-0.333]
49 2 Method 2: Racing Algorithms The idea of the paper (Bouckaert and Studen´ , 2007) was to combine two algorithms for testing the y independence implication u ⇀ v. [sent-332, score-0.214]
50 1 V ERIFICATION : D ECOMPOSING INTO E LEMENTARY I MSETS Consider a combinatorial imset u ∈ C (N), an elementary imset v ∈ E (N) and the task to decide whether u ⇀ v. [sent-344, score-0.978]
51 That is, by (1), testing whether k · u − v is a structural imset for some k ∈ N. [sent-345, score-0.478]
52 Now, (1) implies that n · (k · u − v) is a combinatorial imset for some k, n ∈ N (see Section 2. [sent-348, score-0.455]
53 y The characterization (6) allows one to transform testing independence implication to the task to decide whether a given candidate imset y = l · u − v is combinatorial. [sent-357, score-0.633]
54 A combinatorial imset y may have many decompositions y = ∑w∈ E (N) kw · w, kw ∈ Z+ into elementary imsets. [sent-358, score-0.671]
55 Because there is a simple formula for the degree of the candidate imset y 3464 E FFICIENT A LGORITHMS FOR C ONDITIONAL I NDEPENDENCE I NFERENCE y, the search space, the tree of potential decompositions, is known. [sent-362, score-0.395]
56 The point is that the degree of a candidate imset y = l · u − v grows (linearly) with l; consequently, the size of the corresponding tree of possible decompositions grows exponentially with l. [sent-367, score-0.395]
57 If this is the case then the degree of y must be 3, which means we search for a decomposition into elementary imsets with 3 summands. [sent-378, score-0.5]
58 There are 6 elementary imsets over {a, b, c, d} with this property and two of them are excluded by sanity checks. [sent-380, score-0.464]
59 For example, for v′ = u c,d|ab and y′ = y − v′ one has −1 = ∑{c,d}⊆T y′ (T ) < 0, which is impossible for a combinatorial imset in place of y′ . [sent-381, score-0.455]
60 Thus, the implication uL ⇀ ut has been confirmed by the decomposition method. [sent-384, score-0.217]
61 To explain the geometric interpretation (of the method) note that the cone R (N) = cone (E (N)) is slightly special. [sent-385, score-0.234]
62 The lattice points in this cone are just structural imsets. [sent-386, score-0.216]
63 3, lattice points in R (N) are just non-negative rational combinations of elementary imsets E (N), that is, structural imsets (see Section 2. [sent-389, score-0.948]
64 In particular, every structural imset belongs to one of parallel hyperplanes (to this basic one) and its “degree” says how far it is from the origin (= zero imset). [sent-392, score-0.436]
65 The condition (6) is a minor modification of (1): it requires a multiple of u minus v is a sum of elementary imsets (with possible repetition). [sent-394, score-0.464]
66 The algorithm, therefore, looks for the decomposition and the degree (of the candidate imset y) serves as the measure of its complexity. [sent-395, score-0.431]
67 To disprove the implication u ⇀ v it is enough to find a supermodular function m : P (N) → R such that m, u = 0 and m, v > 0. [sent-399, score-0.234]
68 Actually, one can limit oneself to ℓ-standardized integer-valued supermodular functions, that is, supermodular imsets. [sent-400, score-0.222]
69 These imsets have a special form which allows one to generate them randomly. [sent-401, score-0.336]
70 If our random procedure generates the supermodular imset ⊥ m = 2 · δabcd + δabc + δabd + δacd + 2 · δbcd + δbc + δbd + δcd , then one can observe that m, uL = 0 while m, u b,c|a = 1. [sent-408, score-0.495]
71 3 Method 3: Decomposition via Hilbert Basis An alternative to testing whether an imset is combinatorial is testing whether it is structural. [sent-415, score-0.539]
72 Since structural imsets coincide with the lattice points in the cone R (N) ≡ cone (E (N)), each of them can be written as a sum (with possible repetition) of the elements of the Hilbert basis H (N) of the cone R (N) (see Section 3. [sent-416, score-0.81]
73 (2008) imply that the set of elementary imsets does not constitute a Hilbert basis of R (N) for |N| ≥ 5. [sent-419, score-0.488]
74 Thus, having the Hilbert basis of R (N) at hand, one can test the independence implication u ⇀ v for u, v ∈ S (N) through (1): the task is to find out whether there exists a decomposition of y ≡ k · u − v into Hilbert basis elements for some k ∈ N. [sent-425, score-0.256]
75 1, where the set of elementary imsets E (N) was used instead of H (N). [sent-428, score-0.464]
76 4 Method 4: Linear Programming The basic idea is to re-formulate (the definition of) independence implication in terms of the (pointed rational polyhedral) cone R (N) ≡ cone (E (N)) spanned by elementary imsets. [sent-434, score-0.583]
77 Proof The cone R (N) consists of conic combinations of representatives of its extreme rays, that is, of elementary imsets. [sent-443, score-0.318]
78 This imset equality, specified for any S ⊆ N, yields (9). [sent-445, score-0.395]
79 Indeed, the rows of A correspond to subsets of N, while the columns to elementary imsets and the factor k. [sent-447, score-0.464]
80 For example, consider the cone of ℓ-standardized supermodular functions. [sent-462, score-0.217]
81 3468 E FFICIENT A LGORITHMS FOR C ONDITIONAL I NDEPENDENCE I NFERENCE containing 3, up to 11 elementary CI statements were generated and, for each elementary CI statement outside the input list, it was verified whether it was implicated or not. [sent-483, score-0.395]
82 So, the thousand 3-input cases result in verification of a thousand times the total number of elementary CI statements (80 for 5 variables) minus the 3 statements already given in the input list, that is, 1000 × (80 − 3) = 77000 inference problems. [sent-484, score-0.344]
83 Comparing the Hilbert racer with the original racer algorithm shows that more problems are resolved in the time available, so the HB decomposition algorithm appears to perform better at resolving problems. [sent-547, score-0.342]
84 However, taking in account the number of unresolved problems, the Hilbert racer is winning out, so it is to be expected that a slightly longer time-out period will bring the performance of the triple racer on the same level as the Hilbert racer. [sent-565, score-0.391]
85 If L is an input list of elementary CI statements, then one single experiment comprises the testing of |E (N)| many independence implications uL ⇀ u for all u ∈ E (N). [sent-575, score-0.303]
86 7 For |N| = 4, 5, 6 we have considered input lists L containing 3 up to 11 different elementary CI statements over N and performed 1000 such experiments for each combination of |N| and |L|. [sent-576, score-0.235]
87 The number of LP problems to be tested within a single experiment and the dimension of these problems depends on the number |E (N)| = |N| · 2|N|−2 of elementary imsets and on the number 2 2|N| of subsets of N (compare with Table 2 in Section 4. [sent-579, score-0.464]
88 More specifically, by Proposition 2 (and Corollary 3), testing independence inclusion (and reading CI restrictions from an acyclic directed graph) can be transformed to the task whether a difference of two standard imsets is a combinatorial imset. [sent-618, score-0.587]
89 Since, by (3), every standard imset has at most 2 · |N| non-zero values, and, also, its degree is at most |N| − 1, we have good reasons to believe that the decomposition algorithm from Section 4. [sent-619, score-0.431]
90 Actually, a relevant result (Studen´ y and Vomlel, 2011, Lemma 3) implies that if the difference of two standard imsets is a combinatorial imset then it is a plain sum of elementary imsets (= a combination with coefficients +1). [sent-622, score-1.255]
91 Thus, we 3472 E FFICIENT A LGORITHMS FOR C ONDITIONAL I NDEPENDENCE I NFERENCE dare to conjecture that testing whether a difference of two standard imsets is a combinatorial imset can be done with polynomial complexity in |N| using the procedure from Section 4. [sent-623, score-0.833]
92 9) is, in fact, equivalent to our task (9) with fixed factor k = 1, if one uses a suitable one-to-one linear transformation to involved imsets u, v and w. [sent-629, score-0.336]
93 ⊥ (11) Now, assuming uL ⇀ ut , the equivalent characterization (2) of the independence implication implies that, for every discrete probability distribution P over N, whenever mP , uL = 0 then mP , ut = 0 . [sent-655, score-0.34]
94 Observe that elementary imsets are closed under (the composition with) reflection. [sent-663, score-0.464]
95 (2010) is as follows: Proposition 6 For |N| = 5, the least factor n∗ ∈ N such that, for any imset u over N, it holds u ∈ S (N) ⇔ n∗ · u ∈ C (N), is n∗ = 2. [sent-670, score-0.395]
96 Here, the rows of A correspond to subsets of N (d = |P (N)|), the columns to elementary imsets (n = |E (N)|) and b has b(S), S ⊆ N as components. [sent-715, score-0.464]
97 In other words, the first step of the simplex method, the move from z0 to z1 , means subtracting a non-negative integral multiple of an elementary imset w from b = z0 so that the result remains a non-negative vector. [sent-731, score-0.566]
98 Although further steps already cannot be interpreted as elementary imset subtraction, the followˆ / ing still holds. [sent-732, score-0.523]
99 If Q = 0, the simplex method finds iteratively the decomposition of b into elementary imsets with non-negative coefficients. [sent-733, score-0.543]
100 On the conditional independence implication problem, a lattice theoretic approach. [sent-806, score-0.23]
wordName wordTfidf (topN-words)
[('studen', 0.482), ('imset', 0.395), ('imsets', 0.336), ('lp', 0.234), ('ci', 0.176), ('racer', 0.153), ('abc', 0.15), ('elementary', 0.128), ('ug', 0.125), ('polyhedron', 0.119), ('cone', 0.117), ('ul', 0.111), ('implication', 0.109), ('schrijver', 0.102), ('polyhedral', 0.101), ('supermodular', 0.1), ('emmecke', 0.095), ('indner', 0.095), ('milan', 0.095), ('onditional', 0.095), ('tuden', 0.095), ('bouckaert', 0.094), ('hilbert', 0.088), ('statements', 0.085), ('ndependence', 0.081), ('ouckaert', 0.081), ('skeletal', 0.08), ('bc', 0.075), ('fficient', 0.073), ('falsi', 0.073), ('hemmecke', 0.073), ('ut', 0.072), ('lgorithms', 0.067), ('ab', 0.067), ('independence', 0.063), ('hb', 0.062), ('combinatorial', 0.06), ('nference', 0.059), ('rays', 0.058), ('lattice', 0.058), ('uh', 0.057), ('ax', 0.055), ('representatives', 0.052), ('rational', 0.049), ('vertex', 0.047), ('bruns', 0.044), ('niepert', 0.044), ('pivoting', 0.044), ('kw', 0.044), ('unresolved', 0.044), ('simplex', 0.043), ('iff', 0.043), ('testing', 0.042), ('triple', 0.041), ('structural', 0.041), ('mp', 0.041), ('algebraic', 0.038), ('acyclic', 0.038), ('facets', 0.037), ('electronicly', 0.037), ('raymond', 0.037), ('decomposition', 0.036), ('faces', 0.034), ('statement', 0.032), ('pearl', 0.032), ('remco', 0.031), ('ac', 0.03), ('abcd', 0.029), ('gucht', 0.029), ('vomlel', 0.029), ('prague', 0.028), ('elapsed', 0.028), ('bd', 0.028), ('pointed', 0.026), ('valid', 0.025), ('list', 0.025), ('rn', 0.025), ('inclusion', 0.025), ('capped', 0.025), ('disprove', 0.025), ('basis', 0.024), ('inference', 0.024), ('characterization', 0.024), ('implications', 0.023), ('directed', 0.023), ('veri', 0.023), ('racing', 0.022), ('input', 0.022), ('conical', 0.022), ('cplex', 0.022), ('ichim', 0.022), ('ilog', 0.022), ('mathias', 0.022), ('oneself', 0.022), ('polyhedra', 0.022), ('tableau', 0.022), ('cd', 0.022), ('proposition', 0.021), ('extreme', 0.021), ('team', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
2 0.07131999 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
3 0.04794405 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
4 0.037330367 53 jmlr-2010-Inducing Tree-Substitution Grammars
Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater
Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process
5 0.034705713 44 jmlr-2010-Graph Kernels
Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt
Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks
6 0.032007463 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
7 0.031911481 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation
8 0.030279562 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
9 0.028539428 27 jmlr-2010-Consistent Nonparametric Tests of Independence
10 0.027958613 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
11 0.026206074 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
12 0.025898496 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
13 0.023102477 82 jmlr-2010-On Learning with Integral Operators
14 0.021205571 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
15 0.021083852 56 jmlr-2010-Introduction to Causal Inference
16 0.020826949 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
17 0.020735729 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
18 0.020311607 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
19 0.020156214 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
20 0.019771583 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
topicId topicWeight
[(0, -0.108), (1, 0.006), (2, -0.01), (3, -0.025), (4, -0.014), (5, -0.035), (6, -0.057), (7, -0.032), (8, -0.031), (9, 0.133), (10, -0.012), (11, 0.039), (12, 0.008), (13, 0.064), (14, -0.023), (15, -0.071), (16, -0.04), (17, -0.014), (18, -0.014), (19, 0.012), (20, -0.047), (21, 0.088), (22, -0.06), (23, -0.088), (24, 0.004), (25, -0.008), (26, 0.006), (27, -0.036), (28, -0.04), (29, 0.095), (30, -0.059), (31, 0.166), (32, 0.152), (33, 0.029), (34, 0.129), (35, -0.054), (36, -0.059), (37, -0.209), (38, -0.099), (39, 0.113), (40, -0.111), (41, 0.27), (42, 0.075), (43, -0.111), (44, -0.116), (45, 0.039), (46, -0.217), (47, 0.151), (48, 0.07), (49, -0.138)]
simIndex simValue paperId paperTitle
same-paper 1 0.94565207 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
2 0.42272395 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
Author: Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan
Abstract: Many structured information extraction tasks employ collective graphical models that capture interinstance associativity by coupling them with various clique potentials. We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials. We present three families of symmetric potentials—MAX, SUM, and MAJORITY . We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials. Our first message computation algorithm, called α-pass, is sub-quadratic in the clique size, outputs exact messages for 13 MAX , and computes 15 -approximate messages for Potts, a popular member of the SUM family. Empirically, it is upto two orders of magnitude faster than existing algorithms based on graph-cuts or belief propagation. Our second algorithm, based on Lagrangian relaxation, operates on MAJORITY potentials and provides close to exact solutions while being two orders of magnitude faster. We show that the cluster message passing framework is more principled, accurate and converges faster than competing approaches. We extend our collective inference framework to exploit associativity of more general intradomain properties of instance labelings, which opens up interesting applications in domain adaptation. Our approach leads to significant error reduction on unseen domains without incurring any overhead of model retraining. Keywords: passing graphical models, collective inference, clique potentials, cluster graphs, message
3 0.3684828 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
4 0.31622127 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search
5 0.25990266 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
6 0.243836 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
7 0.23205459 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
8 0.21680027 27 jmlr-2010-Consistent Nonparametric Tests of Independence
9 0.19661976 82 jmlr-2010-On Learning with Integral Operators
10 0.19285657 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
11 0.192799 35 jmlr-2010-Error-Correcting Output Codes Library
12 0.17974931 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
13 0.1756247 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
14 0.17181119 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
15 0.16332322 56 jmlr-2010-Introduction to Causal Inference
16 0.15844563 53 jmlr-2010-Inducing Tree-Substitution Grammars
17 0.15671389 44 jmlr-2010-Graph Kernels
18 0.14948872 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
19 0.14512871 70 jmlr-2010-MOA: Massive Online Analysis
20 0.14343973 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
topicId topicWeight
[(3, 0.022), (4, 0.014), (8, 0.015), (21, 0.013), (24, 0.021), (32, 0.046), (33, 0.012), (36, 0.044), (37, 0.052), (40, 0.496), (75, 0.109), (81, 0.013), (85, 0.037), (96, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.66293752 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
2 0.49212146 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
Author: Erik Ĺ trumbelj, Igor Kononenko
Abstract: We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method’s initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful. Keywords: data postprocessing, classification, explanation, visualization
3 0.27219051 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.26866788 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
5 0.26640108 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
6 0.26338756 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
7 0.26227778 69 jmlr-2010-Lp-Nested Symmetric Distributions
8 0.26221249 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
9 0.26180279 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
10 0.26099816 63 jmlr-2010-Learning Instance-Specific Predictive Models
11 0.26061645 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
12 0.2603074 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
13 0.26024294 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
14 0.26017135 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
15 0.25962096 109 jmlr-2010-Stochastic Composite Likelihood
16 0.25931293 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
17 0.25915679 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.25880522 44 jmlr-2010-Graph Kernels
19 0.2579619 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
20 0.25773877 66 jmlr-2010-Linear Algorithms for Online Multitask Classification