jmlr jmlr2011 jmlr2011-25 knowledge-graph by maker-knowledge-mining

25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Source: pdf

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

Abstract: We propose an efﬁcient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classiﬁers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efﬁcient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classiﬁers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classiﬁers achieve at least as good accuracy as the best compared classiﬁers, using signiﬁcantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classiﬁcation, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Box 68 FI-00014 University of Helsinki, Finland Editor: Russell Greiner Abstract We propose an efﬁcient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classiﬁers. [sent-17, score-0.262]

2 The proposed score is an approximation of the conditional log-likelihood criterion. [sent-18, score-0.09]

3 The approximation is devised in order to guarantee decomposability over the network structure, as well as efﬁcient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. [sent-19, score-0.215]

4 The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. [sent-20, score-0.143]

5 Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classiﬁcation, approximation c 2011 Alexandra M. [sent-23, score-0.213]

6 One of the likely causes for this appears to be the use of so called generative learning methods in choosing the Bayesian network structure as well as its parameters. [sent-31, score-0.067]

7 Unfortunately, discriminative learning of Bayesian network classiﬁers has turned out to be computationally much more challenging than generative learning. [sent-33, score-0.128]

8 (1997) to bring up the question: are there heuristic approaches that allow efﬁcient discriminative learning of Bayesian network classiﬁers? [sent-35, score-0.128]

9 During the past years different discriminative approaches have been proposed, which tend to decompose the problem into two tasks: (i) discriminative structure learning, and (ii) discriminative parameter learning. [sent-36, score-0.183]

10 They introduced a discriminative parameter learning algorithm, called the Extended Logistic Regression (ELR) algorithm, that uses gradient descent to maximize the conditional log-likelihood (CLL) of the class variable given the other variables. [sent-38, score-0.106]

11 (2005) have showed that globally optimal solutions can be guaranteed only for network structures that satisfy a certain graph-theoretic property, including for example, the naive Bayes and tree-augmented naive Bayes (TAN) structures (see Friedman et al. [sent-48, score-0.157]

12 (1998) and Grossman and Domingos (2004) propose to choose network structures by maximizing CLL while choosing parameters by maximizing the parameter posterior or the (joint) log-likelihood (LL). [sent-52, score-0.122]

13 The criteria can be used for efﬁcient learning of augmented naive Bayes 2182 FACTORIZED C ONDITIONAL L OG -L IKELIHOOD classiﬁers. [sent-58, score-0.083]

14 Our ﬁrst criterion is the approximated conditional log-likelihood (aCLL). [sent-63, score-0.081]

15 The proposed score is the minimum variance unbiased (MVU) approximation to CLL under a class of uniform priors on certain parameters of the joint distribution of the class variable and the other attributes. [sent-64, score-0.092]

16 Second, the criterion is not well-behaved in the sense that it sometimes diverges when the parameters approach the usual relative frequency estimates (maximizing LL). [sent-68, score-0.079]

17 In order to solve these two shortcomings, we devise a second approximation, the factorized conditional log-likelihood (ˆ fCLL). [sent-69, score-0.094]

18 The ˆ fCLL approximation is uniformly bounded, and moreover, fCLL criterion it is maximized by the easily obtainable relative frequency parameter estimates. [sent-70, score-0.088]

19 To gauge the performance of the proposed criteria in classiﬁcation tasks, we compare them with several popular classiﬁers, namely, tree augmented naive Bayes (TAN), greedy hill-climbing (GHC), C4. [sent-72, score-0.083]

20 The advantage of ˆ fCLL with respect to these latter classiﬁers is that it is computationally as efﬁcient as the LL scoring criterion, and considerably more efﬁcient than ELR and SVMs. [sent-77, score-0.101]

21 In Section 3 we discuss generative and discriminative learning of Bayesian network classiﬁers. [sent-80, score-0.128]

22 In Section 4 we present our scoring criteria, followed by experimental results in Section 5. [sent-81, score-0.101]

23 ,ri } , determine the local distributions in the network via PB (Xi = xik | ΠXi = wi j ) = θi jk . [sent-113, score-0.118]

24 i=1 The conditional independence properties pertaining to the joint distribution are essentially determined by the network structure. [sent-118, score-0.089]

25 Learning unrestricted Bayesian networks from data under typical scoring criteria is NP-hard (Chickering et al. [sent-120, score-0.15]

26 This result has led the Bayesian network community to search for the largest subclass of network structures for which there is an efﬁcient learning algorithm. [sent-122, score-0.151]

27 First attempts conﬁned the network to tree structures and used Edmonds (1967) and Chow and Liu (1968) optimal branching algorithms. [sent-123, score-0.084]

28 Indeed, Chickering (1996) showed that learning the structure of a Bayesian network is NP-hard even for networks constrained to have in-degree at most two. [sent-125, score-0.087]

29 Consequently, the standard methodology for addressing the problem of learning Bayesian networks has become heuristic score-based learning where a scoring criterion φ is considered in order to quantify the capability of a Bayesian network to explain the observed data. [sent-129, score-0.247]

30 , yN } and a scoring criterion φ, the task is to ﬁnd a Bayesian network B that maximizes the score φ(B, D). [sent-133, score-0.266]

31 The most common scoring criteria are reviewed in Carvalho (2009) and Yang and Chang (2002). [sent-135, score-0.13]

32 We refer the interested reader to newly developed scoring criteria to the works of de Campos (2006) and Silander et al. [sent-136, score-0.13]

33 Score-based learning algorithms can be extremely efﬁcient if the employed scoring criterion is decomposable. [sent-138, score-0.16]

34 A scoring criterion φ is said to be decomposable if the score can be expressed as a sum of local scores that depends only on each node and its parents, that is, in the form n φ(B, D) = ∑ φi (ΠXi , D). [sent-139, score-0.229]

35 (1995): n N LL(B | D) = qi ri ∑ log PB (yt1 , . [sent-141, score-0.102]

36 , ytn ) = ∑ ∑ ∑ Ni jk log θi jk , i=1 j=1 k=1 t=1 which is clearly decomposable. [sent-144, score-0.189]

37 The maximum likelihood (ML) parameters that maximize LL are easily obtained as the observed frequency estimates (OFE) given by Ni jk ˆ θi jk = , (1) Ni j where Ni jk denotes the number of instances in D where Xi = xik and ΠXi = wi j , and Ni j = ∑ri Ni jk . [sent-145, score-0.155]

38 k=1 Plugging these estimates back into the LL criterion yields n LL(G | D) = qi ri ∑ ∑ ∑ Ni jk log i=1 j=1 k=1 Ni jk Ni j . [sent-146, score-0.215]

39 The notation with G as the argument instead of B = (G, Θ) emphasizes that once the use of the OFE parameters is decided upon, the criterion is a function of the network structure, G, only. [sent-147, score-0.126]

40 The LL scoring criterion tends to favor complex network structures with many edges since adding an edge never decreases the likelihood. [sent-148, score-0.244]

41 Bayesian Network Classiﬁers A Bayesian network classiﬁer is a Bayesian network over X = (X1 , . [sent-151, score-0.134]

42 We focus on augmented naive Bayes classiﬁers, that is, Bayesian network / classiﬁers where the class variable has no parents, ΠC = 0, and all attributes have at least the class variable as a parent, C ∈ ΠXi for all Xi . [sent-163, score-0.139]

43 i i X i i Xi Similarly to the general case, local distributions are determined by the corresponding parameters P(C = zc ) = θc , P(Xi = xik | Π∗ i = w∗j ,C = zc ) = θi jck . [sent-173, score-0.396]

44 X i We denote by Ni jck the number of instances in the data D where Xi = xik , Π∗ i = w∗j and C = zc . [sent-174, score-0.38]

45 i X Moreover, the following short-hand notations will become useful: ri s Ni j∗k = ∑ Ni jck , Ni j∗ = c=1 ri Ni jc = ∑ Ni jck , Nc = k=1 2185 s ∑∑ Ni jck , k=1 c=1 ∗ n qi ri 1 ∑ ∑ ∑ Ni jck . [sent-175, score-1.751]

46 The ML estimates (1) become now Nc ˆ θc = , N Ni jck ˆ θi jck = , Ni jc and (2) which can again be plugged into the LL criterion: N LL(G | D) = ∑ log PB (yt1 , . [sent-177, score-0.951]

47 , ytn , ct ) t=1 s = ∑ Nc log c=1 Nc N n qi +∑ ∑ ri ∑ Ni jck log i=1 j=1 k=1 Ni jck Ni jc . [sent-180, score-1.192]

48 (4) t=1 Interestingly, the objective of generative learning is precisely to maximize the whole sum, whereas the goal of discriminative learning consists on maximizing only the ﬁrst term in (4). [sent-190, score-0.103]

49 , ytn , 1 − ct ) 2186 (5) FACTORIZED C ONDITIONAL L OG -L IKELIHOOD For convenience, we denote the two terms in the denominator as = PB (yt1 , . [sent-214, score-0.139]

50 , ytn , 1 − ct ), (6) so that Equation (5) becomes simply PB (ct | yt1 , . [sent-220, score-0.139]

51 , ytn , ct ) is the t’th sample in the data set D, the vector (yt1 , . [sent-228, score-0.139]

52 , ytn , 1 − ct ), which we call the dual sample of (yt1 , . [sent-231, score-0.139]

53 t=1 Recall that our goal is to derive a decomposable scoring criterion. [sent-236, score-0.131]

54 f are given by α = β = γ = π2 + 6 , 24 π2 − 18 , and 24 (π2 − 6) log p π2 − 2+ , 12 ln 2 12 (8) (9) (10) where log is the binary logarithm and ln is the natural logarithm. [sent-262, score-0.186]

55 Theorem 2 The approximation fˆ with α, β, γ deﬁned as in Theorem 1 is the minimum variance unbiased (MVU) approximation of f . [sent-265, score-0.082]

56 Moreover, an approximation is (uniformly) minimum variance unbiased if the value E[( fˆ(Ut ,Vt ) − f (Ut ,Vt ))2 ] is the lowest for all unbiased approximations and values of p. [sent-271, score-0.077]

57 Using the approximation in Theorem 1 to approximate CLL yields N CLL(B | D) ≈ ∑ α logUt + β logVt + γ t=1 N = ∑ (α + β) logUt − β log t=1 Ut Vt N = (α + β)LL(B | D) − β ∑ log t=1 +γ Ut Vt + Nγ, (11) where constants α, β and γ are given by Equations (8), (9) and (10), respectively. [sent-284, score-0.101]

58 We are now in the position of having constructed a decomposable approximation of the conditional log-likelihood score that was shown to be very accurate for a wide range of parameters Ut and Vt . [sent-288, score-0.12]

59 Due to the dependency of these parameters on Θ, that is, the parameters of the Bayesian network B (recall Equation (6)), the score still requires that a suitable set of parameters is chosen. [sent-289, score-0.106]

60 To better see why this is the case, substitute the OFE parameters, Equation (2), into the aCLL criterion, Equation (12), to obtain n q∗ i ˆ aCLL(G | D) = (α + β) LL(G | D) − β ∑ ∑ ri 1 Ni jck Ni j(1−c) Ni jc Ni j(1−c)k ∑ ∑ Ni jck log i=1 j=1 k=1 c=0 1 −β ∑ Nc log c=0 Nc N1−c . [sent-336, score-1.032]

61 In LL and CLL criteria, similar expressions where the denominator may be zero are always eliminated by the OFE parameters since they are always multiplied by zero, see, for example, Equation (3), where Ni jc = 0 implies Ni jck = 0. [sent-338, score-0.575]

62 Given a ﬁxed network structure, the parameters that maximize the ﬁrst term, (α + β)LL(B | D), are given by OFE. [sent-344, score-0.09]

63 Let f (B | D) be a score deﬁned by    ∗ n qi ri 1 θi jck  , f (B | D) = ∑ ∑ ∑ ∑ Ni jck λ log  N (14) N j(1−c) i jc θi jck + iNi j∗ θi j(1−c)k i=1 j=1 k=1 c=0 Ni j∗ where λ is an arbitrary positive real value. [sent-352, score-1.396]

64 To clarify the analysis, we introduce the following short-hand notations: A1 = Ni jc θi jck , A2 = Ni jc , B1 = Ni j(1−c) θi j(1−c)k , B2 = Ni j(1−c) . [sent-356, score-0.81]

65 (15) With simple algebra, we can rewrite the logarithm in the second term of Equation (12) using the above notations as log Ni jc θi jck Ni j(1−c) θi j(1−c)k − log Ni jc Ni j(1−c) = log A1 B1 − log A2 B2 . [sent-357, score-0.954]

66 Assumption 2 We assume that Ni jc θi jck Ni jc θi jck + Ni j(1−c) θi j(1−c)k Ni jc Ni jc + Ni j(1−c) ∼ Uniform(0, 1), ∼ Uniform(0, 1). [sent-367, score-1.62]

67 0 w 1 2 3 4 Figure 3: Plot of g(w) = log w 1−w and g(w) = λ log w + ρ. [sent-373, score-0.072]

68 Hence, Assumption 2 reduces to Ni jck Ni jc ∼ Uniform(0, 1), and ∼ Uniform(0, 1). [sent-375, score-0.575]

69 6 ln 2 λ = ρ = (18) (19) Theorem 6 The approximation g with λ and ρ deﬁned as in Theorem 5 is the minimum variance ˆ unbiased (MVU) approximation of g. [sent-381, score-0.139]

70 w In order to get an idea of the accuracy of the approximation g, consider the graph of log 1−w ˆ and λ log w + ρ in Figure 3. [sent-382, score-0.101]

71 A similar analysis can be applied to rewrite the logarithm of the third term in Equation (12) leading to log θc θ(1−c) = log θc 1 − θc 2193 ≈ λ log θc + ρ. [sent-386, score-0.108]

72 c=0 Observe that the third term of Equation (22) is such that 1 1 Nc log θc , c=0 N −βλ ∑ Nc log θc = −βλN ∑ c=0 (23) and, since β < 0, by Gibbs inequality (see Lemma 8 in the Appendix at page 2204) the parameters ˆ that maximize Equation (23) are given by the OFE, that is, θc = Nc . [sent-388, score-0.095]

73 3 Information-Theoretical Interpretation Before we present empirical results illustrating the behavior of the proposed scoring criteria, we point out that the ˆ fCLL criterion has an interesting information-theoretic interpretation based on interaction information. [sent-394, score-0.183]

74 (1997) point out, the local contribution of the i’th variable to LL(B | D) (recall Equation (3)) is given by q∗ i 1 ri Ni jck Ni jck log Ni jc j=1 c=0 k=1 N N∑ ∑∑ = −NHPD (Xi | Π∗ i ,C) ˆ X = −NHPD (Xi | C) + NIPD (Xi ; Π∗ i | C), ˆ ˆ X (25) where HPD (Xi | . [sent-397, score-0.996]

75 Since IPD (Xi ; C) on the last line of Equation (26) does not depend on Π∗ i , ﬁnding the parents of ˆ X Xi that maximize the sum amounts to maximizing the interaction information. [sent-405, score-0.099]

76 X All said, the ˆ fCLL criterion can be written as n ˆ fCLL(G | D) = ∑ (α + β)NIPD (Xi ; Π∗ i | C) − βλNIPD (C ; Xi ; Π∗ i ) + const, ˆ ˆ X X (27) i=1 where const is a constant independent of the network structure and can thus be omitted. [sent-407, score-0.126]

77 7 percent, while the second term that gives ˆ fCLL criterion its discriminative nature has the weight 63. [sent-412, score-0.12]

78 2 In addition to the insight provided by the information-theoretic interpretation of ˆ fCLL, it also provides a practically most useful corollary: the ˆ fCLL criterion is score equivalent. [sent-414, score-0.098]

79 A scoring criterion is said to be score equivalent if it assigns the same score to all network structures encoding the same independence assumptions, see Verma and Pearl (1990), Chickering (2002), Yang and Chang (2002) and de Campos (2006). [sent-415, score-0.322]

80 Theorem 7 The ˆ fCLL criterion is score equivalent for augmented naive Bayes classiﬁers. [sent-416, score-0.152]

81 4 Beyond Binary Classiﬁcation and TAN ˆ Although aCLL and ˆ fCLL scoring criteria were devised having in mind binary classiﬁcation tasks, their application in multi-class problems is straightforward. [sent-426, score-0.13]

82 To adapt it for multi-class problems, we considered Ni j(1−c)k = Ni j∗k − Ni jck and Ni j(1−c) = Ni j − Ni jc . [sent-430, score-0.575]

83 Finally, we point out that despite being derived under the augmented naive Bayes model, the ˆ fCLL score can be readily applied to models where the class variable is not a parent of some of the attributes. [sent-431, score-0.093]

84 Experimental Results We implemented the ˆ fCLL scoring criterion on top of the Weka open-source software (Hall et al. [sent-435, score-0.16]

85 Unfortunately, the Weka implementation of the TAN classiﬁer assumes that the learning ˆ criterion is score equivalent, which rules out the use of our aCLL criterion. [sent-437, score-0.098]

86 4 ˆ We evaluated the performance of aCLL and ˆ fCLL scoring criteria in classiﬁcation tasks comparing them with state-of-the-art classiﬁers. [sent-444, score-0.13]

87 Bayesian network-based classiﬁers (GHC2 and TAN) were included in different ﬂavors, difˆ fering in the scoring criterion used for structure learning (LL, aCLL, ˆ fCLL) and the parameter estimator (OFE, ELR). [sent-481, score-0.16]

88 5, the nearest neighbor classiﬁers, and logistic regression, as well as the generativelytrained Bayesian network classiﬁers, TAN-LL-OFE and GHC-LL-OFE, all differences being statistically signiﬁcant at the p < 0. [sent-523, score-0.085]

89 Conclusions and Future Work We proposed a new decomposable scoring criterion for classiﬁcation tasks. [sent-627, score-0.19]

90 The new score, called factorized conditional log-likelihood, ˆ fCLL, is based on an approximation of conditional log-likelihood. [sent-628, score-0.145]

91 Moreover, the criterion is speciﬁcally designed for discriminative learning. [sent-631, score-0.12]

92 The merits of the new scoring criterion were evaluated and compared to those of common stateof-the-art classiﬁers, on a large suite of benchmark data sets from the UCI repository. [sent-632, score-0.16]

93 Optimal ˆ fCLL-scored tree-augmented naive Bayes (TAN) classiﬁers, as well as somewhat more general structures (referred to above as GHC2), performed better than generatively-trained Bayesian network classiﬁers, as well as C4. [sent-633, score-0.112]

94 Proof (Theorem 2) We have that p p 1 p2 x − (α log(x) + β log(y) + γ) dy dx = 0 x+y log 0 0 for α, β and γ deﬁned as in (8), (9) and (10). [sent-657, score-0.086]

95 Proof (Theorem 3) We have that p p 0 0 1 p2 log x − (α log(x) + β log(y) + γ) x+y 2 dy dx = 36 + 36π2 − π4 −2 288 ln2 (2) for α, β and γ deﬁned as in (8), (9) and (10), which concludes the proof. [sent-659, score-0.086]

96 Moreover, if we take the OFE for the parameters, we have Ni jk(1−c) Ni jkc ˆ ˆ θi jck = and θi j(1−c)k = . [sent-664, score-0.34]

97 Proof (Theorem 5) We have that 1 S(λ, ρ) = 0 x − (λ log (x) + ρ) log 1−x Moreover ∇. [sent-667, score-0.072]

98 X Since ˆ fCLL is a local scoring criterion, ˆ fCLL(G1 | D) can be computed from ˆ fCLL(G0 | D) taking only into account the difference in the contribution of node Y . [sent-679, score-0.101]

99 A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. [sent-782, score-0.161]

100 On discriminative Bayesian network u a classiﬁers and logistic regression. [sent-933, score-0.146]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ni', 0.383), ('fcll', 0.375), ('jck', 0.34), ('ipd', 0.319), ('ofe', 0.248), ('jc', 0.235), ('acll', 0.177), ('cll', 0.163), ('ut', 0.16), ('hpd', 0.156), ('roos', 0.149), ('elr', 0.113), ('arvalho', 0.106), ('liveira', 0.106), ('yllym', 0.106), ('scoring', 0.101), ('vt', 0.101), ('tan', 0.1), ('onditional', 0.099), ('ytn', 0.099), ('ll', 0.091), ('aki', 0.09), ('weka', 0.08), ('nipd', 0.078), ('ers', 0.076), ('factorized', 0.072), ('og', 0.07), ('network', 0.067), ('pb', 0.065), ('discriminative', 0.061), ('criterion', 0.059), ('ikelihood', 0.058), ('ln', 0.057), ('nut', 0.057), ('nvt', 0.057), ('domingos', 0.054), ('greiner', 0.05), ('logut', 0.05), ('bayesian', 0.049), ('grossman', 0.048), ('classi', 0.046), ('ri', 0.045), ('nc', 0.043), ('nrt', 0.043), ('ct', 0.04), ('score', 0.039), ('log', 0.036), ('logr', 0.035), ('parents', 0.034), ('carvalho', 0.032), ('su', 0.03), ('decomposable', 0.03), ('approximation', 0.029), ('criteria', 0.029), ('edmonds', 0.028), ('gibb', 0.028), ('naive', 0.028), ('xi', 0.028), ('myllym', 0.027), ('jk', 0.027), ('dy', 0.026), ('augmented', 0.026), ('friedman', 0.026), ('dx', 0.024), ('unbiased', 0.024), ('xik', 0.024), ('helsinki', 0.024), ('maximize', 0.023), ('rt', 0.023), ('interaction', 0.023), ('conditional', 0.022), ('chow', 0.022), ('discretization', 0.022), ('zhou', 0.021), ('rbf', 0.021), ('qi', 0.021), ('ibk', 0.021), ('logvt', 0.021), ('svmg', 0.021), ('teemu', 0.021), ('chickering', 0.021), ('diverges', 0.02), ('kaufmann', 0.02), ('networks', 0.02), ('maximizing', 0.019), ('arrow', 0.019), ('mutual', 0.018), ('attributes', 0.018), ('logistic', 0.018), ('decomposability', 0.018), ('oliveira', 0.018), ('svm', 0.018), ('heckerman', 0.017), ('structures', 0.017), ('dirichlet', 0.017), ('mvu', 0.016), ('kohavi', 0.016), ('zc', 0.016), ('petri', 0.016), ('silander', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

2 0.18960291 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

3 0.067074239 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are deﬁned as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a speciﬁc set of allowed nonzero patterns for the solutions of such problems. We ﬁrst explore the relationship between the groups deﬁning the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to speciﬁc prior knowledge expressed in terms of nonzero patterns. We also present an efﬁcient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm

4 0.063109636 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

Author: Vladimir V. V'yugin

Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modiﬁcation of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled ﬂuctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled ﬂuctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction

5 0.040817387 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

Author: Trine Julie Abrahamsen, Lars Kai Hansen

Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inﬂation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inﬂation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efﬁciently restore generalizability in kPCA. As for PCA our analysis also suggests a simpliﬁed approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization

6 0.037594017 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

7 0.035668228 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

8 0.033877354 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

9 0.033188697 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

10 0.030710522 50 jmlr-2011-LPmade: Link Prediction Made Easy

11 0.030539036 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

12 0.029867113 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

13 0.029734584 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

14 0.029029369 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

15 0.027833886 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

16 0.02733507 101 jmlr-2011-Variable Sparsity Kernel Learning

17 0.027273208 54 jmlr-2011-Learning Latent Tree Graphical Models

18 0.025883868 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

19 0.025202954 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

20 0.023772333 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.153), (1, -0.032), (2, 0.017), (3, -0.06), (4, 0.015), (5, 0.044), (6, 0.009), (7, 0.011), (8, 0.158), (9, 0.203), (10, 0.011), (11, -0.21), (12, 0.071), (13, 0.125), (14, -0.18), (15, 0.119), (16, -0.006), (17, -0.139), (18, -0.104), (19, -0.151), (20, 0.051), (21, -0.057), (22, 0.159), (23, 0.094), (24, 0.037), (25, -0.013), (26, 0.189), (27, -0.119), (28, -0.163), (29, -0.126), (30, -0.097), (31, 0.293), (32, -0.144), (33, -0.072), (34, -0.008), (35, 0.145), (36, 0.037), (37, 0.074), (38, -0.097), (39, 0.008), (40, -0.038), (41, -0.047), (42, 0.129), (43, 0.172), (44, -0.018), (45, -0.068), (46, 0.031), (47, -0.062), (48, -0.046), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9481529 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

2 0.51464951 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

3 0.48446912 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

Author: Vladimir V. V'yugin

4 0.31923214 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

5 0.26839522 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we conﬁrmed that our algorithm can run in parallel using up to 256 processors with a parallelization efﬁciency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm

6 0.22171065 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

7 0.20306833 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

8 0.19635682 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

9 0.18810308 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

10 0.18759599 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

11 0.1709232 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis

12 0.16004153 101 jmlr-2011-Variable Sparsity Kernel Learning

13 0.15390384 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.142616 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

15 0.14255251 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking

16 0.14154762 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

17 0.13997249 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.13886578 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

19 0.13779707 12 jmlr-2011-Bayesian Co-Training

20 0.13621731 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.038), (9, 0.014), (10, 0.051), (24, 0.018), (31, 0.07), (32, 0.528), (41, 0.029), (60, 0.021), (71, 0.027), (73, 0.026), (78, 0.053), (90, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95051849 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas

Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classiﬁcation, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classiﬁcation, ranking, thresholding, dimensionality reduction, hierarchical classiﬁcation, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a ﬁnite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classiﬁcation and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The ﬁrst group of methods are algorithm independent. They transform the learning task into one or more singlelabel classiﬁcation tasks, for which a large body of learning algorithms exists. The second group of methods extend speciﬁc learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classiﬁers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xiouﬁs, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the beneﬁts of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientiﬁc progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classiﬁcation and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classiﬁcation, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of speciﬁc multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The ﬁrst thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text ﬁles for the speciﬁcation of a data set. The ﬁrst one is in the ARFF format of Weka. The labels should be speciﬁed as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second ﬁle is in XML format. It speciﬁes the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML ﬁle by nesting the label tag. In our example, the two ﬁlenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(

same-paper 2 0.82043338 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

3 0.81677836 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most inﬂuential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classiﬁer in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a reﬁned analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classiﬁer

4 0.75026023 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classiﬁcation. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classiﬁcation, loss functions, divergence, statistical information, regret bounds

5 0.36381501 102 jmlr-2011-Waffles: A Machine Learning Toolkit

Author: Michael Gashler

Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Wafﬂes. The Wafﬂes tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Wafﬂes is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source

6 0.34076408 48 jmlr-2011-Kernel Analysis of Deep Networks

7 0.31127918 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

8 0.30682692 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

9 0.30045381 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

10 0.29112327 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

11 0.28954801 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.2880283 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

13 0.28632227 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

14 0.28347775 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.28277788 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

16 0.28277704 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

17 0.28072152 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

18 0.27833804 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

19 0.27762589 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

20 0.27532408 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination