jmlr jmlr2013 jmlr2013-44 knowledge-graph by maker-knowledge-mining

44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints


Source: pdf

Author: Pekka Parviainen, Mikko Koivisto

Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . [sent-9, score-0.604]

2 Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. [sent-11, score-0.541]

3 Specifically, the time and space requirements can be made to grow roughly linearly in the number of node subsets that are closed under taking predecessors, called ideals (or downsets) of the partial order. [sent-56, score-0.692]

4 As the number of ideals can be much smaller than the number of all node subsets, depending on the width of the partial order, there is potential for a significant improvement over the existing DP algorithms, which are ignorant of precedence constraints. [sent-57, score-0.653]

5 The second part of our observation addresses this requirement: Instead of assuming a single given partial order, we may consider multiple partial orders that together cover all possible DAGs, that is, every DAG is compatible with at least one of the partial orders. [sent-59, score-0.744]

6 We are left with the freedom to choose either many partial orders, each with only few ideals (the extreme is to consider all linear orders), or just a few partial orders, each with many ideals, or something in between. [sent-62, score-0.589]

7 The generic approach being quite abstract, its more concrete implications are derived in Section 6 for a particular class of partial orders, namely (parallel compositions of) bucket orders. [sent-71, score-0.523]

8 Some of the combinatorial analyses concerning bucket orders build on our on-going work on related permutation problems—some results have been published in a preliminary form in a conference proceedings (Koivisto and Parviainen, 2010) that we will cite in Section 6. [sent-72, score-0.523]

9 To this end, we make the usual assumption that the scoring function is decomposable, that is, for each node v there exist a local scoring function fv such that f (A) = fv (Av ) , v∈N for all DAGs A on N ; each fv is a function from the subsets of N \ {v} to real numbers. [sent-105, score-1.906]

10 We call f (A) and fv (Av ) the score of A and the (local) score of Av , respectively. [sent-106, score-0.656]

11 First, while the notion of decomposability concerns the mere existence of local scoring functions, we naturally assume that the functions fv are given explicitly as input. [sent-123, score-0.609]

12 In practice, the values fv (Av ) are computed for every relevant Av based on the data and the chosen scoring function. [sent-124, score-0.609]

13 Second, we allow fv (Av ) to take the value −∞ to indicate that Av cannot be the parent set of v. [sent-125, score-0.692]

14 We assume that each fv is provided as an array of tuples (Z, fv (Z), U ), where Z ∈ Fv and U consists of the nodes u outside Z satisfying Z ∪ {u} ∈ Fv . [sent-158, score-1.163]

15 Proposition 2 (interval queries) Given a downward closed collection Fv in the augmented representation and sets X, Y ⊆ N \ {v}, the scores fv (Z) for all Z ∈ Fv in the interval [X, Y ] can be listed in O(n) time per score. [sent-163, score-0.696]

16 Proof First, search for the tuple (X, fv (X), U ). [sent-164, score-0.538]

17 Otherwise let Z = X, list fv (Z), and proceed recursively to the tuples of Z ′ = Z ∪ {u} for each u ∈ U that belongs to Y \ Z and succeeds the maximum node in Z \ X. [sent-166, score-0.66]

18 In the first phase, the algorithm computes the best-score for each node v and set of predecessors Y ⊆ N \ {v}, defined as ˆ fv (Y ) = max fv (X) . [sent-192, score-1.249]

19 X⊆Y ˆ The direct computation of fv (Y ) for any fixed v and Y requires 2|Y | basic operations. [sent-193, score-0.538]

20 Lemma 3 (Ott and Miyano 2003) If v ∈ N and Y ⊆ N \ {v}, then ˆ ˆ fv (Y ) = max fv (Y ), max fv (Y \ {u}) . [sent-196, score-1.664]

21 If the values ˆ ˆ fv (X) have already been computed for all X ⊂ Y and stored in an array, computing fv (Y ) takes no more than n comparisons. [sent-198, score-1.076]

22 ˆ Thus, the values fv (Y ) for all v and Y can be computed in O(n2 2n ) time. [sent-200, score-0.538]

23 One easily finds the recurrence ˆ g(Y ) = max g(Y \ {v}) + fv (Y \ {v}) , v∈Y (1) ˆ with g(∅) = 0. [sent-204, score-0.653]

24 In other words, g(Y \ {v}) + fv (Y \ {v}) is the maximum score over all DAGs on Y such that v is the “last node”, that is, the parents of v are selected from Y \ {v}. [sent-205, score-0.668]

25 Given the values ˆ fv (Y ) computed in the first phase, the values g(Y ) can be computed in O(n2n ) basic operations. [sent-206, score-0.538]

26 Like the time requirement, also the space requirement is dominated by the first phase, the manˆ agement of the values fv (Y ). [sent-208, score-0.659]

27 , 2011): Note that the computation of both ˆ g(Y ) and fv (Y ) requires information only about the sets of size |Y | − 1. [sent-211, score-0.538]

28 Thereby, at any level ℓ we need to keep n ˆ in memory the values g(Y ) and fv (Y ) only for O n + ℓ−1 sets Y of size ℓ and ℓ − 1. [sent-213, score-0.58]

29 Starting with Y = N , first a node v ∈ Y such ˆ that g(Y ) = g(Y \ {v}) + fv (Y \ {v}) can be found in O(n) time, since the required values of g ˆ have already been computed and are available. [sent-216, score-0.66]

30 An optimal parent set for v can then be found and fv ˆ by brute-force in O(|Fv |) time. [sent-217, score-0.692]

31 If the values of g and fv are kept in memory for all node subsets, 1393 PARVIAINEN AND KOIVISTO ˆ then this step can be just repeated n times. [sent-218, score-0.702]

32 Otherwise, if the values of g and fv are kept in memory only for the node subsets at the last two levels, one can resolve the entire problem on the smaller ˆ node subset Y \ {v} to recompute the needed values of g and fv . [sent-219, score-1.39]

33 ˆ Computing g1 can be more expensive, since evaluating the term fv (N0 ∪ Y \ {v}) requires considering all possible subsets of N0 as parents of v, in addition to a subset of Y \ {v}. [sent-232, score-0.637]

34 To this end, at each X1 ⊆ N1 \ {v} define ′ fv (X1 ) = max fv (X) : X ∩ N1 = X1 , X ∈ Fv and ′ ˆ′ fv (Y1 ) = max fv (X1 ) . [sent-233, score-2.202]

35 X1 ⊆Y1 ′ Observe that computing fv (X1 ) for all X1 takes O((F + 2n−s )n) time in total, where F is the size ˆ ˆ′ of Fv . [sent-234, score-0.58]

36 Now, because fv (N0 ∪ Y1 ) = fv (Y1 ), the algorithm of the previous section again applies to n−s )n2 + 2n−s n2 ) time and O(2n−s n) space in total. [sent-235, score-1.154]

37 Informally speaking, we replace a two-bucket partition by a partial order and, accordingly, the consideration of all fixed-size partitions by the consideration of sufficiently many partial orders so as to “cover” the linear orders. [sent-272, score-0.489]

38 A notion of compatibility between DAGs and partial orders will be central in our developments: Definition 7 (compatibility) A DAG A and a partial order P are said to be compatible with each other if there exists a partial order Q that is a superset of both A and P . [sent-293, score-0.752]

39 We will show that, when maximizing a decomposable scoring function over DAGs that are compatible with a given a partial order, the basic dynamic programming algorithm can be trimmed to run only across the ideals. [sent-303, score-0.467]

40 Continuing the same reasoning to smaller subproblems shows that it is sufficient to tabulate the optimal scores (DAGs) for the ideals of the partial order. [sent-320, score-0.501]

41 ˆ Recall that in the first phase, the task is to compute the values fv (Y ) for all nodes v and node subsets Y ⊆ N \ {v}. [sent-334, score-0.745]

42 x∈X 2 ˆ In terms of the set functions fv and fv , for an arbitrary v, Lemma 10 amounts to the following generalization of Lemma 3 (one obtains Lemma 3 at X = Y ): Lemma 11 Let X and Y be subsets of N \ {v} with X ⊆ Y . [sent-343, score-1.104]

43 Then ˆ fv (Y ) = max ˆ max fv (Z), max fv (Y \ {u}) . [sent-344, score-1.689]

44 X⊆Z⊆Y u∈X ˆ Proof By Lemma 10(ii) and the definition of fv , the maximum of fv (Z) over Z ∈ u∈X 2Y \{u} ˆ ˆ equals maxu∈X fv (Y \{u}). [sent-345, score-1.614]

45 By Lemma 10(i), the larger of maxX⊆Z⊆Y fv (Z) and maxu∈X fv (Y \ ˆ (Y ). [sent-346, score-1.076]

46 For this choice, we make use of the fact that in the second phase of dynamic programming, as it will ˆ turn out in the next subsection, we need the values fv (Y ) only for sets Y that are ideals of P . [sent-352, score-0.916]

47 By letting X = Y and noting that fv (Z) = −∞ for Z ∈ Fv , we may rephrase the equation in Lemma 11 as ˆ fv (Y ) = max ˆ max fv (Z), max fv (Y \ {u}) . [sent-360, score-2.227]

48 4 to perform the recurrence of Lemma 11 over the ideals of the partial order and their tails. [sent-386, score-0.545]

49 Now that we restrict our attention to DAGs that are compatible with the given partial order P , we may restrict the recurrence to only those sets that can begin some linear extension of P , that is, to the ideals of P . [sent-389, score-0.616]

50 Formally, define the function g P by g P (∅) = 0 and for nonempty Y ∈ I(P ) recursively: g P (Y ) = max v∈Y Y \{v}∈I(P ) ˆ g P (Y \ {v}) + fv (Y \ {v}) . [sent-390, score-0.563]

51 Here the score f (A′ ) is naturally defined as v∈Y fv (A′ ). [sent-397, score-0.597]

52 Clearly, there is exactly one DAG A′ on {v} and it is compatible with P [{v}] = {vv}; the score of the DAG is f (A′ ) = fv (∅) = ˆ fv (∅). [sent-399, score-1.206]

53 Now, write max A is compatible with P f (A) = max max L⊇P A⊆L = max L⊇P max fv (Av ) v∈N Av ⊆Lv ˆ fv (Lv ) = max L⊇P fv (Av ) v∈N v∈N ˆ = max fv (N \ {v}) + v∈N ˆ fu (L′ ) u max L′ ⊇P [N \{v}] u∈N \{v} ˆ = max fv (N \ {v}) + g P [N \{v}] (N \ {v}) . [sent-403, score-2.986]

54 For the second statement, it suffices to observe that ˆ fv (Lv ) = max max max P ∈P L⊇P v∈N L ˆ fv (Lv ) = max f (A) , v∈N A since P is a cover on N . [sent-407, score-1.226]

55 In Algorithm 1 below, g P [Y ] and ˆ ˆ fv [Y ] denote program variables that correspond to the respective target values g P (Y ) and fv (Y ) to be computed. [sent-410, score-1.076]

56 For each nonempty Y ∈ I(P ), in increasing order of cardinality: (a) let ˆ g P [Y ] ← max g P [Y \ {v}] + fv [Y \ {v}] ; ˇ v∈Y ˆ (b) for each v ∈ Y , let fv [Y ] be the larger of ˆ max fv (Z) and max fv [Y \ {u}] . [sent-417, score-2.256]

57 ˆ Proof Observe first that the algorithm correctly computes fv for each v ∈ N , that is, after the ˆ [Y ] = fv (Y ) for all v ∈ Y ∈ I(P ). [sent-419, score-1.076]

58 To this end, it suffices to ˆ execution of the algorithm we have fv note that step 3(b) implements the recurrence of Lemma 11 as rephrased in (4). [sent-420, score-0.628]

59 1402 O PTIMAL BAYESIAN N ETWORKS U SING P RECEDENCE C ONSTRAINTS The space requirement for a fixed partial order P is O(|I(P )|n), since by Lemma 12 the values ˆ g P [Y ] and fv [Y ] are only needed for Y ∈ I(P ). [sent-439, score-0.809]

60 Unfortunately, designing such an optimal scheme seems difficult due to the combinatorial challenge of simultaneously controlling the required covering property of the partial order system and the number of ideals of the members of the system. [sent-448, score-0.549]

61 The ease of the parallelization stems from the fact that dynamic programming over the ideals can be done independently for each partial order P in P. [sent-455, score-0.622]

62 Bucket Order Schemes To apply the partial order approach in practice, it is essential to find partial order systems that provide a good tradeoff between time and space requirements. [sent-459, score-0.514]

63 This section concentrates on a class of partial orders we call parallel bucket orders, which turn out to be relatively easy to analyze and seem to yield good tradeoff in practice. [sent-460, score-0.792]

64 A series-parallel partial order is defined recursively: a partial order is a series-parallel partial order if it is either a singleton order or a parallel or series composition of two or more series-parallel partial orders. [sent-468, score-0.961]

65 For example, the partial order {AA, BB, CC, AB, AC} is a series composition of {AA} and {BB, CC}, of which the former is a singleton order and the latter is a parallel composition of two singleton orders. [sent-469, score-0.524]

66 We study two special classes of series-parallel partial orders, namely bucket orders and parallel bucket orders. [sent-471, score-1.1]

67 A bucket order is a series composition of the trivial orders on some ground sets B1 , B2 , . [sent-472, score-0.662]

68 For instance, the series-parallel partial order in the previous paragraph is a bucket order {A}{B, C}, thus of length two and type 1 ∗ 2. [sent-478, score-0.581]

69 Likewise, the partial order in Figure 1(b) is a bucket order of length three and type 2 ∗ 3 ∗ 1, the trivial order on some ground set M is of length one and type |M |, and a linear order on M is of length |M | and type 1 ∗ 1 ∗ · · · ∗ 1. [sent-479, score-0.703]

70 By taking a parallel composition of some number of bucket orders we obtain a parallel bucket order. [sent-480, score-1.095]

71 Note that the same parallel bucket order may be obtained by different collections of bucket orders. [sent-481, score-0.858]

72 It is, however, easy to observe that for each parallel bucket order P there is a unique collection of bucket orders P1 , P2 , . [sent-482, score-0.998]

73 , Pp , called the bucket orders of P , such that their parallel composition is P and they are the connected components of P . [sent-485, score-0.652]

74 The following lemma states a well-known result concerning the number of ideals of a seriesparallel partial order (see, for example, Steiner’s article, 1990, and references therein). [sent-486, score-0.51]

75 Then (i) the series composition of P1 and P2 has |I(P1 )| + |I(P2 )| − 1 ideals and (ii) the parallel composition of P1 and P2 has |I(P1 )||I(P2 )| ideals. [sent-488, score-0.496]

76 The next two lemmas state the number of ideals of a parallel bucket order. [sent-489, score-0.706]

77 Lemma 20 Let P be the parallel composition of bucket orders P1 , P2 , . [sent-499, score-0.652]

78 Furthermore, we say two parallel bucket orders are reorderings of each other if their bucket orders can be labelled as P1 , P2 , . [sent-507, score-1.121]

79 We denote the collection of reorderings of a parallel bucket order P by R(P ). [sent-514, score-0.554]

80 We note that two bucket orders are reorderings of each other if and only if they are automorphic to each other. [sent-515, score-0.544]

81 However, this equivalence does not hold for parallel bucket orders in general, for reordering only allows shuffling within each bucket order, but not between different bucket orders. [sent-516, score-1.325]

82 Theorem 22 Let P be a parallel composition of bucket orders. [sent-520, score-0.518]

83 It suffices to construct a parallel bucket order Q ∈ R(P ) such that L is an extension of Q. [sent-523, score-0.472]

84 , Pp be the bucket orders of P , with respective ground sets M1 , M2 , . [sent-527, score-0.558]

85 When P is a parallel composition of p bucket orders, each of type b1 ∗ b2 ∗ · · · ∗ bℓ , we find it convenient to denote the POS R(P ) by (b1 ∗ b2 ∗ · · · ∗ bℓ )p . [sent-545, score-0.518]

86 It p is instructive to notice that a POS (b1 ∗ b2 ∗ · · · ∗ bℓ )p consists of b1b1 +b22 +···+bℓbℓ different partial b ··· +b2 +···+b orders, since any bucket order of type b1 ∗ b2 ∗ · · · ∗ bℓ has exactly b1b1 +b22 +···+bℓbℓ = (b1b1 ! [sent-548, score-0.552]

87 Let the bucket order on N1 be the one shown in Figure 1(b), denoted as P1 , and let the bucket order on N2 be the trivial order {GG, HH}, denoted as P2 . [sent-554, score-0.807]

88 2 Bucket Order Schemes A partial order system (b1 ∗ b2 ∗ · · · ∗ bℓ )p is associated with some natural parameters, such as the number of parallel bucket orders p. [sent-560, score-0.769]

89 When all or some of the parameters are treated as variables 1405 PARVIAINEN AND KOIVISTO that can take different values, we refer to the implied family of partial order systems as a bucket order scheme. [sent-561, score-0.581]

90 Another example of a bucket order scheme is the two-bucket scheme of Section 3, which corresponds to partial order systems (s ∗ (n − s))1 , with s treated as the parameter. [sent-567, score-0.769]

91 Via Theorem 16, any fixed bucket order scheme implies parameterized time and space complexity bounds for the OBN problem. [sent-569, score-0.561]

92 Our prior study (Koivisto and Parviainen, 2010) shows that no such scheme exists among bucket order schemes. [sent-574, score-0.483]

93 Since the partial orders in (m/2 ∗ m/2)n/m have as many or fewer ideals than the partial orders in (m/2 ∗ m/2)p , for p ≤ n/m, we have the following characterization. [sent-618, score-0.857]

94 5 We tested our implementation varying the number of nodes n, bucket order sizes m, and the number of parallel bucket orders p. [sent-634, score-1.023]

95 First we estimated the smallest bucket order size m that yields a memory requirement of 16 GB or less. [sent-652, score-0.474]

96 We analyzed two specializations of the generalized bucket order scheme (⌈m/2⌉ ∗ ⌊m/2⌋)p : the practical scheme, where p = 1, and the pairwise scheme, where m = 2. [sent-677, score-0.483]

97 Compared to the naive two-bucket scheme, the more general bucket order schemes yield significantly more efficient exchange of time and space resources. [sent-754, score-0.513]

98 Columns: input size is the number of parent sets per node; regular accesses is the number of accesses to input when the parent set is an ideal; tail accesses is the number of accesses to input when the parent set is in the tail of an ideal; total time is the running time in hours. [sent-766, score-0.805]

99 Namely, the number of potential parent sets, particularly after pruning, is typically relatively small compared to the amount of memory available—the bottleneck regarding memory consumption is the number of ideals of the partial orders. [sent-786, score-0.699]

100 In such a case, one needs to run the presented DP algorithm on just that partial order, thereby using much less time and space than the basic DP algorithm, which essentially cannot exploit given precedence constraints (besides excluding some parent sets for some nodes). [sent-823, score-0.465]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fv', 0.538), ('bucket', 0.36), ('ideals', 0.263), ('dag', 0.193), ('parviainen', 0.191), ('koivisto', 0.173), ('partial', 0.163), ('parent', 0.154), ('orders', 0.134), ('dags', 0.13), ('node', 0.122), ('onstraints', 0.116), ('recedence', 0.116), ('obn', 0.1), ('etworks', 0.1), ('scheme', 0.094), ('pos', 0.092), ('recurrence', 0.09), ('dp', 0.084), ('parallel', 0.083), ('ptimal', 0.077), ('dynamic', 0.076), ('bayesian', 0.075), ('composition', 0.075), ('scoring', 0.071), ('compatible', 0.071), ('parents', 0.071), ('precedence', 0.07), ('sing', 0.069), ('indegree', 0.066), ('av', 0.064), ('ground', 0.064), ('score', 0.059), ('malone', 0.058), ('nodes', 0.057), ('accesses', 0.057), ('campos', 0.057), ('downward', 0.053), ('tradeoff', 0.052), ('ty', 0.052), ('bns', 0.051), ('reorderings', 0.05), ('cover', 0.05), ('xy', 0.047), ('parallelization', 0.047), ('schemes', 0.046), ('subproblems', 0.044), ('programming', 0.044), ('requirement', 0.043), ('conquer', 0.043), ('processors', 0.042), ('time', 0.042), ('decomposable', 0.042), ('memory', 0.042), ('bodlaender', 0.042), ('salesman', 0.042), ('silander', 0.042), ('tamada', 0.042), ('phase', 0.039), ('requirements', 0.038), ('space', 0.036), ('ott', 0.036), ('potential', 0.035), ('pruned', 0.035), ('singleton', 0.035), ('cussens', 0.033), ('miyano', 0.033), ('uxy', 0.033), ('collection', 0.032), ('ideal', 0.032), ('arc', 0.032), ('running', 0.031), ('scores', 0.031), ('analytical', 0.031), ('bd', 0.03), ('array', 0.03), ('myllym', 0.03), ('aa', 0.03), ('heckerman', 0.03), ('concerning', 0.029), ('order', 0.029), ('hasse', 0.028), ('traveling', 0.028), ('pp', 0.028), ('bb', 0.028), ('reordering', 0.028), ('subsets', 0.028), ('closure', 0.027), ('solved', 0.027), ('networks', 0.027), ('lemma', 0.026), ('mp', 0.026), ('collections', 0.026), ('predecessors', 0.026), ('hours', 0.025), ('max', 0.025), ('etminani', 0.025), ('gurevich', 0.025), ('muxy', 0.025), ('rklund', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints

Author: Pekka Parviainen, Mikko Koivisto

Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning

2 0.13104956 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

3 0.11511786 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

Author: Naftali Harris, Mathias Drton

Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution

4 0.071742453 78 jmlr-2013-On the Learnability of Shuffle Ideals

Author: Dana Angluin, James Aspnes, Sarah Eisenstat, Aryeh Kontorovich

Abstract: PAC learning of unrestricted regular languages is long known to be a difficult problem. The class of shuffle ideals is a very restricted subclass of regular languages, where the shuffle ideal generated by a string u is the collection of all strings containing u as a subsequence. This fundamental language family is of theoretical interest in its own right and provides the building blocks for other important language families. Despite its apparent simplicity, the class of shuffle ideals appears quite difficult to learn. In particular, just as for unrestricted regular languages, the class is not properly PAC learnable in polynomial time if RP = NP, and PAC learning the class improperly in polynomial time would imply polynomial time algorithms for certain fundamental problems in cryptography. In the positive direction, we give an efficient algorithm for properly learning shuffle ideals in the statistical query (and therefore also PAC) model under the uniform distribution. Keywords: PAC learning, statistical queries, regular languages, deterministic finite automata, shuffle ideals, subsequences

5 0.060322795 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

Author: Rami Mahdi, Jason Mezey

Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min

6 0.054147005 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

7 0.051899377 95 jmlr-2013-Ranking Forests

8 0.051224697 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

9 0.045449924 82 jmlr-2013-Optimally Fuzzy Temporal Memory

10 0.039285634 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

11 0.036705703 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

12 0.035194803 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

13 0.033174943 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

14 0.03305354 120 jmlr-2013-Variational Algorithms for Marginal MAP

15 0.03226842 86 jmlr-2013-Parallel Vector Field Embedding

16 0.031232435 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

17 0.030671306 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

18 0.029881328 68 jmlr-2013-Machine Learning with Operational Costs

19 0.02818431 8 jmlr-2013-A Theory of Multiclass Boosting

20 0.028146364 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, 0.013), (2, -0.035), (3, 0.038), (4, 0.08), (5, 0.198), (6, 0.088), (7, 0.042), (8, -0.143), (9, 0.223), (10, -0.09), (11, -0.018), (12, -0.016), (13, -0.077), (14, -0.087), (15, -0.153), (16, 0.119), (17, -0.094), (18, -0.004), (19, 0.007), (20, -0.055), (21, 0.06), (22, -0.102), (23, 0.02), (24, 0.025), (25, 0.047), (26, 0.03), (27, 0.06), (28, 0.258), (29, 0.016), (30, -0.003), (31, 0.092), (32, 0.117), (33, 0.034), (34, -0.014), (35, -0.036), (36, 0.064), (37, 0.043), (38, 0.047), (39, 0.135), (40, 0.037), (41, 0.042), (42, -0.015), (43, 0.172), (44, -0.078), (45, 0.188), (46, -0.013), (47, -0.054), (48, -0.046), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95727336 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints

Author: Pekka Parviainen, Mikko Koivisto

Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning

2 0.55882746 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

3 0.54337877 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

Author: Rami Mahdi, Jason Mezey

Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min

4 0.48596537 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

Author: Naftali Harris, Mathias Drton

Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution

5 0.40287685 82 jmlr-2013-Optimally Fuzzy Temporal Memory

Author: Karthik H. Shankar, Marc W. Howard

Abstract: Any learner with the ability to predict the future of a structured time-varying signal must maintain a memory of the recent past. If the signal has a characteristic timescale relevant to future prediction, the memory can be a simple shift register—a moving window extending into the past, requiring storage resources that linearly grows with the timescale to be represented. However, an independent general purpose learner cannot a priori know the characteristic prediction-relevant timescale of the signal. Moreover, many naturally occurring signals show scale-free long range correlations implying that the natural prediction-relevant timescale is essentially unbounded. Hence the learner should maintain information from the longest possible timescale allowed by resource availability. Here we construct a fuzzy memory system that optimally sacrifices the temporal accuracy of information in a scale-free fashion in order to represent prediction-relevant information from exponentially long timescales. Using several illustrative examples, we demonstrate the advantage of the fuzzy memory system over a shift register in time series forecasting of natural signals. When the available storage resources are limited, we suggest that a general purpose learner would be better off committing to such a fuzzy memory system. Keywords: temporal information compression, forecasting long range correlated time series

6 0.35530287 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

7 0.29218206 33 jmlr-2013-Dimension Independent Similarity Computation

8 0.27750453 95 jmlr-2013-Ranking Forests

9 0.25966057 78 jmlr-2013-On the Learnability of Shuffle Ideals

10 0.23245414 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

11 0.23136155 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

12 0.21914947 106 jmlr-2013-Stationary-Sparse Causality Network Learning

13 0.21566825 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

14 0.20908178 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

15 0.184395 15 jmlr-2013-Bayesian Canonical Correlation Analysis

16 0.18109186 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

17 0.17929202 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction

18 0.17444715 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

19 0.17221645 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

20 0.16537581 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.516), (5, 0.123), (6, 0.049), (10, 0.038), (20, 0.013), (23, 0.028), (41, 0.012), (68, 0.026), (70, 0.022), (75, 0.034), (85, 0.014), (87, 0.014), (89, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89243478 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

same-paper 2 0.88700455 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints

Author: Pekka Parviainen, Mikko Koivisto

Abstract: We consider the problem of finding a directed acyclic graph (DAG) that optimizes a decomposable Bayesian network score. While in a favorable case an optimal DAG can be found in polynomial time, in the worst case the fastest known algorithms rely on dynamic programming across the node subsets, taking time and space 2n , to within a factor polynomial in the number of nodes n. In practice, these algorithms are feasible to networks of at most around 30 nodes, mainly due to the large space requirement. Here, we generalize the dynamic programming approach to enhance its feasibility in three dimensions: first, the user may trade space against time; second, the proposed algorithms easily and efficiently parallelize onto thousands of processors; third, the algorithms can exploit any prior knowledge about the precedence relation on the nodes. Underlying all these results is the key observation that, given a partial order P on the nodes, an optimal DAG compatible with P can be found in time and space roughly proportional to the number of ideals of P , which can be significantly less than 2n . Considering sufficiently many carefully chosen partial orders guarantees that a globally optimal DAG will be found. Aside from the generic scheme, we present and analyze concrete tradeoff schemes based on parallel bucket orders. Keywords: exact algorithm, parallelization, partial order, space-time tradeoff, structure learning

3 0.4587914 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

Author: Vinod K. Valsalam, Risto Miikkulainen

Abstract: Sorting networks are an interesting class of parallel sorting algorithms with applications in multiprocessor computers and switching networks. They are built by cascading a series of comparisonexchange units called comparators. Minimizing the number of comparators for a given number of inputs is a challenging optimization problem. This paper presents a two-pronged approach called Symmetry and Evolution based Network Sort Optimization (SENSO) that makes it possible to scale the solutions to networks with a larger number of inputs than previously possible. First, it uses the symmetry of the problem to decompose the minimization goal into subgoals that are easier to solve. Second, it minimizes the resulting greedy solutions further by using an evolutionary algorithm to learn the statistical distribution of comparators in minimal networks. The final solutions improve upon half-century of results published in patents, books, and peer-reviewed literature, demonstrating the potential of the SENSO approach for solving difficult combinatorial problems. Keywords: symmetry, evolution, estimation of distribution algorithms, sorting networks, combinatorial optimization

4 0.41352415 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont

Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction

5 0.39865488 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

Author: Rami Mahdi, Jason Mezey

Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min

6 0.38862067 120 jmlr-2013-Variational Algorithms for Marginal MAP

7 0.3886036 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

8 0.38363802 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

9 0.38256708 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

10 0.37960941 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

11 0.37609935 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

12 0.37394333 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

13 0.37125027 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

14 0.3695443 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

15 0.3678776 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

16 0.36265418 68 jmlr-2013-Machine Learning with Operational Costs

17 0.36231375 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

18 0.36169752 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

19 0.35860249 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

20 0.3581616 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs