jmlr jmlr2011 jmlr2011-30 knowledge-graph by maker-knowledge-mining

30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Source: pdf

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 (2008) use structural constraints (creating an undirected super-structure from which the undirected subjacent graph of the optimal structure must be a subgraph) to reduce the search space, showing that such direction is promising when one wants to learn structures of large data sets. [sent-34, score-0.258]

2 Section 2 describes the notation and introduces Bayesian networks and the structure learning problem based on score functions. [sent-67, score-0.248]

3 Furthermore, rΠi = |ΩΠi | = ∏Xt ∈Πi rt is the number of possible instantiations of the parent set Πi of Xi , and θ = (θi jk )∀i jk is the entire vector of parameters such that the elements are θi jk = p(xik |πi j ), with i ∈ {1, . [sent-88, score-1.654]

4 Note that the values (ni jk )∀i jk depend on the graph G (more speciﬁcally, they depend on the parent set Πi of each Xi ), so a more precise notation would be to use nΠi instead of ni jk . [sent-112, score-2.001]

5 We avoid this heavy notation for simplicity uni jk n less necessary in the context. [sent-113, score-0.483]

6 Moreover, we know that θ∗ = (θ∗jk )∀i jk = ( niijk )∀i jk = argmaxθ LG ,D (θ), i j with ni j = ∑k ni jk . [sent-114, score-2.13]

7 Hyper-parameters (αi jk )∀i jk also depend on the graph G , and we indicate it by αΠi if necessary in the context. [sent-117, score-1.015]

8 We assume that there i jk is no preference for any graph, so p(G ) is uniform and vanishes in the computations. [sent-119, score-0.483]

9 Under the assumptions, it has been shown (Cooper and Herskovits, 1992) that for multinomial distributions, n rΠi ri Γ(αi jk + ni jk ) Γ(αi j ) ∏ Γ(αi jk ) . [sent-120, score-1.906]

10 i=1 j=1 Γ(αi j + ni j ) k=1 s(G ) = log ∏ ∏ The BDe score (Heckerman et al. [sent-121, score-0.508]

11 , 1995) assumes that αi jk = α∗ · p(θi jk |G ), where α∗ is the hyperparameter known as the Equivalent Sample Size (ESS), and p(θi jk |G ) is the prior probability for 1. [sent-122, score-1.449]

12 If ni j = 0, then ni jk = 0 and we assume the fraction niijk to be equal to one. [sent-125, score-1.164]

13 The BDeu score (Buntine, 1991; Cooper and Herskovits, α∗ 1992) assumes further that local priors are such that αi jk becomes rΠ ri and α∗ is the only free i hyper-parameter. [sent-127, score-0.766]

14 And similarly, k=1 rΠi BD : si (Πi ) = ∑ j=1 log ri Γ(αi jk + ni jk ) Γ(αi j ) + ∑ log Γ(αi j + ni j ) k=1 Γ(αi jk ) . [sent-129, score-2.354]

15 (2) In the case of BIC and AIC, Equation (1) is used to compute the global score of a graph using the local scores at each node, while Equation (2) is employed for BD, BDe and BDeu, using the respective hyper-parameters α. [sent-130, score-0.311]

16 We work with structural constraints that specify where arcs may or may not be included. [sent-133, score-0.269]

17 1 Learning Naive and TAN structures For example, the constraints ∀i=c, j=c ¬arc(Xi , X j ) and indegree(Xc , 0, eq) impose that only arcs from node Xc to the others are possible, and that Xc is a root node, that is, a Naive Bayes structure will be learned. [sent-151, score-0.306]

18 As another example, the constraints ∀ j=c indegree(X j , 3, lt), indegree(Xc , 0, eq), and ∀ j=c indegree(X j , 0, eq) ∨ arc(Xc , X j ) ensure that all nodes have Xc as parent, or no parent at all. [sent-154, score-0.286]

19 If Xit ∈ X t are the variables at time t, a DBN may have arcs between nodes Xit of the same time t and arcs from nodes Xit−1 (previous time) to nodes Xit of time t. [sent-179, score-0.487]

20 With complete data, learning parameters of DBNs is similar to learning parameters of Bayesian networks, but we deal with counts ni jk for both B 0 and B . [sent-200, score-0.851]

21 i G′ i′ This holds because of the decomposability of the score function among nodes, so that the scores of the nodes X1 , . [sent-235, score-0.313]

22 , Xn (which are arcs correlating variables of the same time slice) as well as arcs from the previous slice to the ′ ′ nodes X1 , . [sent-247, score-0.422]

23 Because of decomposability, we can avoid to compute such functions several times by creating a cache that contains si (Πi ) for each Xi and each parent set Πi . [sent-257, score-0.454]

24 First, Lemma 1 is quite simple but very useful to discard elements from the cache of each node Xi . [sent-264, score-0.342]

25 Lemma 1 Let Xi be a node of G ′ , a candidate DAG for a Bayesian network where the parent set of Xi is Π′ . [sent-268, score-0.294]

26 Any DAG G ′ with parent i set Π′ for Xi has a subgraph G with a better score than that of G ′ , and thus Π′ is not the optimal i i parent conﬁguration for Xi in G ∗ . [sent-276, score-0.505]

27 Unfortunately Lemma 1 does not tell us anything about supersets of Π′ , that is, we still need i to compute scores for all the possible parent sets and later verify which of them can be removed. [sent-277, score-0.301]

28 1 BIC and AIC Score Properties Next theorems handle the issue of having to compute scores for all possible parent sets, when one is using BIC or AIC criteria. [sent-282, score-0.246]

29 Theorem 2 Using BIC or AIC as score function, suppose that Xi , Πi are such that rΠi > Π′ is a proper superset of Πi , then Π′ is not the parent set of Xi in an optimal structure. [sent-284, score-0.36]

30 Hence si (Π′ ) < si (Πi ), and Lemma 1 guarantees that Π′ i i cannot be the parent set of Xi in an optimal structure. [sent-294, score-0.306]

31 Because every variable has at least two states, we know that rΠi ≥ 2|Πi | ≥ N > N logri , and by Theorem 2 we know that no proper superset of Πi can w ri −1 be an optimal parent set. [sent-298, score-0.383]

32 Theorem 4 Let BIC or AIC be the score criterion and let Xi be a node with Πi ⊂ Π′ two possible i parent sets such that ti (Π′ ) + si (Πi ) > 0. [sent-301, score-0.514]

33 Proof We have that ti (Π′ ) + si (Πi ) > 0 ⇒ −ti (Π′ ) − si (Πi ) < 0, and because Li (·) is a negative i i function, it implies ⇒ (Li (Π′ ) − ti (Π′ )) − si (Πi ) < 0 ⇒ si (Π′ ) < si (Πi ). [sent-303, score-0.445]

34 The idea is to verify the assumptions of Theorem 4 every time the score of a parent set Πi of Xi is about to be computed by taking the best score of any subset and testing it against the theorem. [sent-307, score-0.482]

35 Only subsets that have been checked against the structural constraints can be used, that is, a subset with high score but that violates constraints cannot be used as the “certiﬁcate” to discard its supersets (in fact, it is not a valid parent set at ﬁrst). [sent-308, score-0.589]

36 2 BD Score Properties First note that the BD scores can be rewritten as: si (Πi ) = ∑ j∈Ji log Γ(αi jk + ni jk ) Γ(αi j ) + ∑ log Γ(αi j + ni j ) k∈Ki j Γ(αi jk ) 672 , E FFICIENT S TRUCTURE L EARNING OF BAYESIAN N ETS . [sent-315, score-2.294]

37 where Ji = JiΠi = {1 ≤ j ≤ rΠi : ni j = 0}, because ni j = 0 implies that all terms cancel each other. [sent-317, score-0.654]

38 In the same manner, ni jk = 0 implies that the terms of the internal summation cancel out, so let . [sent-318, score-0.81]

39 Ki j = KiΠi = {1 ≤ k ≤ ri : ni jk = 0} be the indices of the categories of Xi such that ni jk = 0. [sent-320, score-1.75]

40 The counts ni jk (and consequently ni j = ∑k ni jk ) are completely deﬁned if we know the parent set Πi . [sent-323, score-2.164]

41 Rewrite the score as follows: si (Πi ) = ∑ f (Ki j , (αi jk )∀k ) + g((ni jk )∀k , (αi jk )∀k ) , j∈Ji with f (Ki j , (αi jk )∀k ) = log Γ(αi j ) − ∑ log Γ(αi jk ), k∈Ki j g((ni jk )∀k , (αi jk )∀k ) = − log Γ(αi j + ni j ) + ∑ log Γ(αi jk + ni jk ). [sent-324, score-5.331]

42 k∈Ki j We do not need Ki j as argument of g(·) because the set of non-zero ni jk is known from the counts (ni jk )∀k that are already available as arguments of g(·). [sent-325, score-1.334]

43 Lemma 5 Let Πi be the parent set of Xi , (αi jk )∀i jk > 0 be the hyper-parameters, and integers (ni jk )∀i jk ≥ 0 be counts obtained from data. [sent-327, score-2.149]

44 We have that g((ni jk )∀k , (αi jk )∀k ) ≤ − log Γ(v) ≈ 0. [sent-328, score-0.994]

45 1214 if ni j ≥ 1, where v = argmaxx>0 − log Γ(x) ≈ 1. [sent-329, score-0.355]

46 Furthermore, g((ni jk )∀k , (αi jk )∀k ) ≤ − log αi j + log αi jk − f (Ki j , (αi jk )∀k ) if |Ki j | = 1. [sent-331, score-1.988]

47 Thus g((ni jk )∀k , (αi jk )∀k ) = − log Γ(αi j + ni j ) ≤ − log Γ(1 + ∑ αi jk ). [sent-336, score-1.832]

48 ∏k∈Ki j Γ(αi jk + ni jk ) k∈Ki j / Because v = argmaxx>0 − log Γ(x), we have − log Γ(1 + ∑k∈Ki j αi jk ) ≤ − log Γ(v). [sent-337, score-1.86]

49 Lemma 6 Let Πi be the parent set of Xi , (αi jk )∀i jk > 0 be the hyper-parameters, and integers (ni jk )∀i jk ≥ 0 be counts obtained from data. [sent-341, score-2.149]

50 We have that g((ni jk )∀k , (αi jk )∀k ) ≤ 0 if ni j ≥ 2. [sent-342, score-1.293]

51 Proof If ni j ≥ 2, we use the relation Γ(x + ∑k ak ) ≥ Γ(x + 2) ∏k Γ(ak ), for x ≥ 0, ∀k ak ≥ 1 and ∑k ak ≥ 2. [sent-343, score-0.474]

52 Finally, g((ni jk )∀k , (αi jk )∀k ) = − log Γ(αi j + ni j ) ≤ − log Γ(2 + ∑ αi jk ) ≤ 0, ∏k∈Ki j Γ(αi jk + ni jk ) k∈Ki j / because Γ(2 + ∑k∈Ki j αi jk ) ≥ 1. [sent-346, score-3.608]

53 Thus, we conclude that the parent set i Π0 has better score than Πi , and the desired result follows from Lemma 1. [sent-349, score-0.329]

54 i Lemma 8 Given the BDeu score, (αi jk )∀i jk > 0, and integers (ni jk )∀i jk ≥ 0 such that αi j ≤ 0. [sent-350, score-1.932]

55 8349 and |Ki j | ≥ 2 for a given j, then f (Ki j , (αi jk )∀k ) ≤ −|Ki j | · log ri . [sent-351, score-0.641]

56 8349 (for all k), we have αi j ) ri αi j αi j = log Γ(αi j ) − |Ki j | log Γ( + 1) + |Ki j | log ri ri αi j Γ( ri + 1) = log Γ(αi j ) − |Ki j | log − |Ki j | log ri αi j f (Ki j , (αi jk )∀k ) = log Γ(αi j ) − |Ki j | log Γ( = |Ki j | log Γ(αi j )1/|Ki j | αi j − |Ki j | log ri . [sent-353, score-1.543]

57 8349 is a bound for αi j that ensures this last inequality to hold when ri = |Ki j | = 2, which is the worst-case scenario (greater values of ri and |Ki j | make the left-hand side decrease and the right-hand side increase). [sent-357, score-0.26]

58 675 DE C AMPOS AND J I Theorem 9 Given the BDeu score and two parent sets Π0 and Πi for a node Xi such that Π0 ⊂ Πi i i and αΠi ≤ 0. [sent-359, score-0.389]

59 8349 for every j, if si (Π0 ) > −|KiΠi | log ri then neither Πi nor any superset Π′ ⊃ Πi i i ij are optimal parent sets for Xi . [sent-360, score-0.43]

61 Π |Ki j i |=1 Now, Lemma 7 sufﬁces to show that Πi is not a optimal parent set, because − log ri = log Π′ i Π i αi jk Π αi j i for any k. [sent-362, score-0.845]

62 As we see later in the experiments, the practical size of the cache after the application of the properties is small even for considerably large networks, and both Lemma 1 and Theorem 9 help reducing the cache size, while Theorem 9 also help to reduce computations. [sent-372, score-0.426]

63 In such case, it is a feasible structure for the network and we compare its score against the best score so 676 E FFICIENT S TRUCTURE L EARNING OF BAYESIAN N ETS far (which is updated if needed). [sent-378, score-0.404]

64 Otherwise, there must be a directed cycle in the graph, which is then broken into subcases by forcing some arcs to be absent/present. [sent-379, score-0.287]

65 This is the same idea as in the inclusion-exclusion principle of combinatorics employed over the set of arcs that formed the cycle and ensures that we never process the same graph twice, and also ensures that all subgraphs are covered. [sent-384, score-0.254]

66 The initialization of the algorithm is as follows: • C : (Xi , Πi ) → R is the cache with the scores for all the variables and their possible parent conﬁgurations. [sent-385, score-0.459]

67 This is constructed using a queue and analyzing parent sets according to the properties of Section 4, which saves (in practice) a large amount of space and time. [sent-386, score-0.308]

68 All the structural constraints are considered in this construction so that only valid parent sets are stored. [sent-387, score-0.29]

69 • G is the graph created by taking the best parent conﬁguration for each node without checking for acyclicity (so it is not necessarily a DAG), and s is the score of G . [sent-388, score-0.469]

70 The order is such that the top of the queue contains always the triple of greatest s, while the bottom has the triple of smallest s. [sent-394, score-0.293]

71 bottom is a user parameter that controls how frequent elements will be picked from the bottom of the queue instead of the usual removal from the top. [sent-401, score-0.251]

72 If Gcur is a DAG, then update (Gbest , sbest ) with (Gcur , scur ), discard the current element and start the loop again (if Gcur came from the top of Q, then the algorithm stops—no other graph in the queue can be better than Gcur ). [sent-409, score-0.381]

73 (b) Recompute (G , s) from (Gcur , scur ) such that the new parent set of Xay+1 in G complies with this new Hcur . [sent-421, score-0.258]

74 This is done by searching in the cache C(Xay+1 , Πay+1 ) for the best parent set. [sent-422, score-0.389]

75 If there is a parent set in the cache that satisﬁes Hcur , then – Include the triple (G , Hcur , s) into Q. [sent-423, score-0.447]

76 Note that as Gcur is the graph with the greatest score that respects Hcur , any other graph within the subspace deﬁned by Hcur will have worse score. [sent-438, score-0.251]

77 678 E FFICIENT S TRUCTURE L EARNING OF BAYESIAN N ETS If Gcur has score greater than sbest , then the graph Gcur is checked for cycles, as it may or may not be acyclic (all we know is that Gcur is a relaxed solution within the subspace Hcur ). [sent-446, score-0.248]

78 Moreover, if the acyclic Gcur was extracted from the top of Q, then the algorithm may stop, as all the other elements in the queue have lower score (this is guaranteed by the priority of the queue). [sent-448, score-0.346]

79 Two characteristics must be kept by the branch step: (i) Hcur must be fully represented in the subcases (so we do not miss any graph), and (ii) the subcases must be disjoint (so we do not process the same graph more than once). [sent-451, score-0.264]

80 Because of that, the algorithm runs at most ∏i |C(Xi )| steps, where |C(Xi )| is the size of the cache for Xi (there are not more ways to combine parent sets than that number). [sent-462, score-0.389]

81 The B&B; algorithm as described alternately picks elements from the top and from the bottom of the queue (the percentage of elements from the bottom is controlled by the user parameter bottom). [sent-464, score-0.311]

82 However, there is an important difference between elements from the top and the bottom: top elements improve the upper bound for the global score, because we know that the global score is less than or equal to the highest score in the queue. [sent-466, score-0.442]

83 In fact, we know that an element from the bottom, if not a DAG, will generate new elements of the queue whose subspaces have upper bound score less than that of the originating elements, which certainly put them again in the bottom of the queue. [sent-469, score-0.389]

84 Experiments We perform experiments to show the beneﬁts of the reduced cache and search space. [sent-620, score-0.268]

85 Table 1 presents the used memory in MB (ﬁrst block), the time in seconds (second block) and number of steps in local score evaluations (third block) for the cache construction, using the properties of Section 4. [sent-626, score-0.416]

86 7 2111 2272 Table 2: Final cache characteristics: maximum number of parents (average by node; between parenthesis is presented the actual maximum number), actual cache size, and (approximate) search space implied by the cache. [sent-762, score-0.568]

87 In Table 2 we present the ﬁnal cache characteristics, where we ﬁnd the most attractive results, for instance, the small cache sizes when compared to the worst case. [sent-767, score-0.426]

88 The worst-case is the total number of nodes in the data set minus one, apart from lung (where we have set a limit of at most six parents) and wdbc (with at most eight parents). [sent-769, score-0.258]

89 network adult breast car letter lung mushroom nursery wdbc zoo adult breast car letter lung mushroom nursery wdbc zoo adult breast car letter lung mushroom nursery wdbc zoo adult breast car letter lung mushroom nursery wdbc zoo Score -286902. [sent-780, score-2.274]

90 That is because smoothing the different parent sets by the stronger prior makes harder to see large differences in scores, and consequently the properties that would reduce the cache size are less effective. [sent-1088, score-0.389]

91 The scores obtained by each algorithm (in percentage against the value obtained by B&B;) and the corresponding time are shown in Table 3 (excluding the cache construction). [sent-1101, score-0.314]

92 With the reduced cache presented here, ﬁnding the best structure for a given ordering is very fast, so it is possible to run OS over millions of orderings in a short period of time. [sent-1106, score-0.314]

93 network adult car letter lung mushroom nursery wdbc zoo time(s) 0. [sent-1121, score-0.572]

94 Time (in seconds) to ﬁnd the global optimum, cache size (number of stored scores) and (reduced) space for the search. [sent-1131, score-0.252]

95 Note that the cache size, the search space and consequently the time to solve the problems have substantially decreased. [sent-1135, score-0.268]

96 Because of that, we are able to generate random structural constraints that are certainly valid for this true Bayesian network (approximately n constraints for each case). [sent-1138, score-0.253]

97 Each line contains the mean and standard deviation of ten executions (using random generated networks) for time and cache size with and without constraints (using the same data sets in order to compare them). [sent-1141, score-0.264]

98 This happens because the properties that reduce the cache size and search space are much more effective under small-sized data sets. [sent-1147, score-0.268]

99 Conclusions This paper describes novel properties of decomposable score functions to learn Bayesian network structure from data. [sent-1227, score-0.251]

100 Such properties allow the construction of a cache with all possible local scores of nodes and their parents without large memory consumption, which can later be used by searching algorithms. [sent-1228, score-0.479]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jk', 0.483), ('ni', 0.327), ('ki', 0.25), ('cache', 0.213), ('parent', 0.176), ('gcur', 0.173), ('arcs', 0.155), ('bic', 0.155), ('score', 0.153), ('queue', 0.132), ('ri', 0.13), ('hcur', 0.128), ('ampos', 0.119), ('ess', 0.112), ('arc', 0.105), ('wdbc', 0.101), ('fail', 0.099), ('lung', 0.098), ('bayesian', 0.097), ('xay', 0.093), ('bdeu', 0.091), ('tructure', 0.091), ('parents', 0.087), ('scur', 0.082), ('subcases', 0.082), ('dag', 0.078), ('fficient', 0.078), ('aic', 0.078), ('dags', 0.077), ('zoo', 0.073), ('ets', 0.073), ('scores', 0.07), ('si', 0.065), ('dbn', 0.064), ('structural', 0.063), ('ji', 0.061), ('ti', 0.06), ('node', 0.06), ('nodes', 0.059), ('network', 0.058), ('bd', 0.058), ('triple', 0.058), ('networks', 0.055), ('search', 0.055), ('indegree', 0.055), ('mushroom', 0.055), ('nursery', 0.055), ('supersets', 0.055), ('campos', 0.054), ('dbns', 0.054), ('slice', 0.053), ('constraints', 0.051), ('branch', 0.051), ('adult', 0.051), ('cycle', 0.05), ('memory', 0.05), ('ak', 0.049), ('graph', 0.049), ('dynamic', 0.048), ('koivisto', 0.047), ('logri', 0.046), ('sbest', 0.046), ('os', 0.045), ('bottom', 0.045), ('car', 0.042), ('counts', 0.041), ('breast', 0.04), ('structure', 0.04), ('discard', 0.04), ('global', 0.039), ('prohibited', 0.039), ('xit', 0.039), ('letter', 0.039), ('xc', 0.039), ('cooper', 0.038), ('herskovits', 0.035), ('xi', 0.033), ('priority', 0.032), ('eq', 0.032), ('orderings', 0.032), ('loop', 0.032), ('decomposability', 0.031), ('acyclicity', 0.031), ('parviainen', 0.031), ('superset', 0.031), ('percentage', 0.031), ('optimum', 0.031), ('certainly', 0.03), ('anytime', 0.03), ('earning', 0.029), ('ordering', 0.029), ('elements', 0.029), ('cycles', 0.029), ('log', 0.028), ('silander', 0.028), ('consumption', 0.028), ('dp', 0.028), ('mush', 0.027), ('niijk', 0.027), ('nurse', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

2 0.28550395 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

Abstract: paper1 This presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l1 norm regularization for promoting sparsity within RKHS norms of each group and ls , s ≥ 2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels—hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits efﬁcient Mirror-Descent (MD) based solving techniques. The proposed MD based algorithm optimizes over product of simplices and has a computational complexity of O m2 ntot log nmax /ε2 where m is no. training data points, nmax , ntot are the maximum no. kernels in any group, total no. kernels respectively and ε is the error in approximating the objective. A detailed proof of convergence of the algorithm is also presented. Experimental results show that the VSKL formulations are well-suited for multi-modal learning tasks like object categorization. Results also show that the MD based algorithm outperforms state-of-the-art MKL solvers in terms of computational efﬁciency. Keywords: multiple kernel learning, mirror descent, mixed-norm, object categorization, scalability 1. All authors contributed equally. The author names appear in alphabetical order. c 2011 Jonathan Aﬂalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath and Sankaran Raman. A FLALO , B EN -TAL , B HATTACHARYYA , NATH AND R AMAN

3 0.18960291 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

Abstract: We propose an efﬁcient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classiﬁers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efﬁcient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classiﬁers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classiﬁers achieve at least as good accuracy as the best compared classiﬁers, using signiﬁcantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classiﬁcation, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI

4 0.13824561 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we conﬁrmed that our algorithm can run in parallel using up to 256 processors with a parallelization efﬁciency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm

5 0.13645746 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

6 0.097432539 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

7 0.095864289 55 jmlr-2011-Learning Multi-modal Similarity

8 0.087876283 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

9 0.077475414 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

10 0.058835536 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

11 0.05677478 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

12 0.046190076 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

13 0.04594294 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

14 0.045624994 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

15 0.044284966 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

16 0.042726509 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

17 0.040397175 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

18 0.039077934 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

19 0.036174111 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

20 0.034145467 35 jmlr-2011-Forest Density Estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.25), (1, -0.103), (2, 0.082), (3, -0.213), (4, 0.078), (5, 0.154), (6, 0.007), (7, 0.12), (8, 0.382), (9, 0.353), (10, 0.119), (11, -0.206), (12, 0.011), (13, 0.189), (14, -0.11), (15, 0.184), (16, 0.03), (17, -0.118), (18, -0.12), (19, -0.026), (20, 0.002), (21, -0.071), (22, -0.03), (23, -0.018), (24, 0.07), (25, -0.022), (26, 0.064), (27, -0.032), (28, -0.043), (29, -0.036), (30, 0.05), (31, -0.044), (32, -0.004), (33, 0.046), (34, 0.018), (35, -0.003), (36, -0.072), (37, -0.02), (38, 0.008), (39, -0.009), (40, -0.01), (41, 0.027), (42, -0.024), (43, 0.015), (44, -0.022), (45, 0.063), (46, -0.016), (47, -0.053), (48, 0.008), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97779328 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

2 0.72549099 101 jmlr-2011-Variable Sparsity Kernel Learning

Author: Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman

3 0.6119436 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

4 0.50001985 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

5 0.4576292 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: Gaussian process regression is a ﬂexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be parallelized for faster computation. Moreover, the method can be adaptive to non-stationary features because of its local nature and, in particular, its use of different hyperparameters of the covariance function for different local regions. We illustrate application of the method and demonstrate its advantages over existing methods using two synthetic data sets and two real spatial data sets. Keywords: domain decomposition, boundary value problem, Gaussian process regression, parallel computation, spatial prediction

6 0.43085092 54 jmlr-2011-Learning Latent Tree Graphical Models

7 0.27699658 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

8 0.26789817 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

9 0.26295343 55 jmlr-2011-Learning Multi-modal Similarity

10 0.23664068 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

11 0.20254649 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.18903099 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

13 0.17085293 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

14 0.16431169 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

15 0.15795471 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking

16 0.15726775 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

17 0.15206897 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

18 0.14908819 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

19 0.14629096 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

20 0.14493117 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.039), (9, 0.024), (10, 0.084), (24, 0.032), (31, 0.081), (32, 0.026), (41, 0.027), (49, 0.015), (60, 0.018), (65, 0.013), (71, 0.433), (73, 0.034), (78, 0.052), (90, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86440396 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

Author: Jianxin Wu, Wei-Chian Tan, James M. Rehg

Abstract: Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to signiﬁcantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means. The HIK codebooks have consistently higher recognition accuracy over k-means codebooks by 2–4% in several benchmark object and scene recognition data sets. The algorithm is also generalized to arbitrary additive kernels. Its speed is thousands of times faster than a naive implementation of the kernel k-means algorithm. In addition, we propose a one-class SVM formulation to create more effective visual code words. Finally, we show that the standard kmedian clustering method can be used for visual codebook generation and can act as a compromise between the HIK / additive kernel and the k-means approaches. Keywords: visual codebook, additive kernel, histogram intersection kernel

same-paper 2 0.82013166 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

3 0.34568977 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identiﬁable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efﬁcient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identiﬁable Multivariate modeling), is validated and bench-marked on artiﬁcial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identiﬁability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identiﬁable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identiﬁability, factor models, linear Bayesian networks

4 0.34340838 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

5 0.34137249 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efﬁciently compute marginals and update MAP conﬁgurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP conﬁgurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU

6 0.33344766 16 jmlr-2011-Clustering Algorithms for Chains

7 0.32932866 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

8 0.32826325 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

9 0.32644695 48 jmlr-2011-Kernel Analysis of Deep Networks

10 0.32469252 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

11 0.32295069 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

12 0.31969282 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

13 0.31718555 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

14 0.31537819 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

15 0.31290349 104 jmlr-2011-X-Armed Bandits

16 0.31187338 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

17 0.31183907 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

18 0.31116816 12 jmlr-2011-Bayesian Co-Training

19 0.30900547 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

20 0.30852258 29 jmlr-2011-Efficient Learning with Partially Observed Attributes