jmlr jmlr2011 jmlr2011-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
Reference: text
sentIndex sentText sentNum sentScore
1 Box 68 FI-00014 University of Helsinki, Finland Editor: Russell Greiner Abstract We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. [sent-17, score-0.262]
2 The proposed score is an approximation of the conditional log-likelihood criterion. [sent-18, score-0.09]
3 The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. [sent-19, score-0.215]
4 The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. [sent-20, score-0.143]
5 Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. [sent-23, score-0.213]
6 One of the likely causes for this appears to be the use of so called generative learning methods in choosing the Bayesian network structure as well as its parameters. [sent-31, score-0.067]
7 Unfortunately, discriminative learning of Bayesian network classifiers has turned out to be computationally much more challenging than generative learning. [sent-33, score-0.128]
8 (1997) to bring up the question: are there heuristic approaches that allow efficient discriminative learning of Bayesian network classifiers? [sent-35, score-0.128]
9 During the past years different discriminative approaches have been proposed, which tend to decompose the problem into two tasks: (i) discriminative structure learning, and (ii) discriminative parameter learning. [sent-36, score-0.183]
10 They introduced a discriminative parameter learning algorithm, called the Extended Logistic Regression (ELR) algorithm, that uses gradient descent to maximize the conditional log-likelihood (CLL) of the class variable given the other variables. [sent-38, score-0.106]
11 (2005) have showed that globally optimal solutions can be guaranteed only for network structures that satisfy a certain graph-theoretic property, including for example, the naive Bayes and tree-augmented naive Bayes (TAN) structures (see Friedman et al. [sent-48, score-0.157]
12 (1998) and Grossman and Domingos (2004) propose to choose network structures by maximizing CLL while choosing parameters by maximizing the parameter posterior or the (joint) log-likelihood (LL). [sent-52, score-0.122]
13 The criteria can be used for efficient learning of augmented naive Bayes 2182 FACTORIZED C ONDITIONAL L OG -L IKELIHOOD classifiers. [sent-58, score-0.083]
14 Our first criterion is the approximated conditional log-likelihood (aCLL). [sent-63, score-0.081]
15 The proposed score is the minimum variance unbiased (MVU) approximation to CLL under a class of uniform priors on certain parameters of the joint distribution of the class variable and the other attributes. [sent-64, score-0.092]
16 Second, the criterion is not well-behaved in the sense that it sometimes diverges when the parameters approach the usual relative frequency estimates (maximizing LL). [sent-68, score-0.079]
17 In order to solve these two shortcomings, we devise a second approximation, the factorized conditional log-likelihood (ˆ fCLL). [sent-69, score-0.094]
18 The ˆ fCLL approximation is uniformly bounded, and moreover, fCLL criterion it is maximized by the easily obtainable relative frequency parameter estimates. [sent-70, score-0.088]
19 To gauge the performance of the proposed criteria in classification tasks, we compare them with several popular classifiers, namely, tree augmented naive Bayes (TAN), greedy hill-climbing (GHC), C4. [sent-72, score-0.083]
20 The advantage of ˆ fCLL with respect to these latter classifiers is that it is computationally as efficient as the LL scoring criterion, and considerably more efficient than ELR and SVMs. [sent-77, score-0.101]
21 In Section 3 we discuss generative and discriminative learning of Bayesian network classifiers. [sent-80, score-0.128]
22 In Section 4 we present our scoring criteria, followed by experimental results in Section 5. [sent-81, score-0.101]
23 ,ri } , determine the local distributions in the network via PB (Xi = xik | ΠXi = wi j ) = θi jk . [sent-113, score-0.118]
24 i=1 The conditional independence properties pertaining to the joint distribution are essentially determined by the network structure. [sent-118, score-0.089]
25 Learning unrestricted Bayesian networks from data under typical scoring criteria is NP-hard (Chickering et al. [sent-120, score-0.15]
26 This result has led the Bayesian network community to search for the largest subclass of network structures for which there is an efficient learning algorithm. [sent-122, score-0.151]
27 First attempts confined the network to tree structures and used Edmonds (1967) and Chow and Liu (1968) optimal branching algorithms. [sent-123, score-0.084]
28 Indeed, Chickering (1996) showed that learning the structure of a Bayesian network is NP-hard even for networks constrained to have in-degree at most two. [sent-125, score-0.087]
29 Consequently, the standard methodology for addressing the problem of learning Bayesian networks has become heuristic score-based learning where a scoring criterion φ is considered in order to quantify the capability of a Bayesian network to explain the observed data. [sent-129, score-0.247]
30 , yN } and a scoring criterion φ, the task is to find a Bayesian network B that maximizes the score φ(B, D). [sent-133, score-0.266]
31 The most common scoring criteria are reviewed in Carvalho (2009) and Yang and Chang (2002). [sent-135, score-0.13]
32 We refer the interested reader to newly developed scoring criteria to the works of de Campos (2006) and Silander et al. [sent-136, score-0.13]
33 Score-based learning algorithms can be extremely efficient if the employed scoring criterion is decomposable. [sent-138, score-0.16]
34 A scoring criterion φ is said to be decomposable if the score can be expressed as a sum of local scores that depends only on each node and its parents, that is, in the form n φ(B, D) = ∑ φi (ΠXi , D). [sent-139, score-0.229]
35 (1995): n N LL(B | D) = qi ri ∑ log PB (yt1 , . [sent-141, score-0.102]
36 , ytn ) = ∑ ∑ ∑ Ni jk log θi jk , i=1 j=1 k=1 t=1 which is clearly decomposable. [sent-144, score-0.189]
37 The maximum likelihood (ML) parameters that maximize LL are easily obtained as the observed frequency estimates (OFE) given by Ni jk ˆ θi jk = , (1) Ni j where Ni jk denotes the number of instances in D where Xi = xik and ΠXi = wi j , and Ni j = ∑ri Ni jk . [sent-145, score-0.155]
38 k=1 Plugging these estimates back into the LL criterion yields n LL(G | D) = qi ri ∑ ∑ ∑ Ni jk log i=1 j=1 k=1 Ni jk Ni j . [sent-146, score-0.215]
39 The notation with G as the argument instead of B = (G, Θ) emphasizes that once the use of the OFE parameters is decided upon, the criterion is a function of the network structure, G, only. [sent-147, score-0.126]
40 The LL scoring criterion tends to favor complex network structures with many edges since adding an edge never decreases the likelihood. [sent-148, score-0.244]
41 Bayesian Network Classifiers A Bayesian network classifier is a Bayesian network over X = (X1 , . [sent-151, score-0.134]
42 We focus on augmented naive Bayes classifiers, that is, Bayesian network / classifiers where the class variable has no parents, ΠC = 0, and all attributes have at least the class variable as a parent, C ∈ ΠXi for all Xi . [sent-163, score-0.139]
43 i i X i i Xi Similarly to the general case, local distributions are determined by the corresponding parameters P(C = zc ) = θc , P(Xi = xik | Π∗ i = w∗j ,C = zc ) = θi jck . [sent-173, score-0.396]
44 X i We denote by Ni jck the number of instances in the data D where Xi = xik , Π∗ i = w∗j and C = zc . [sent-174, score-0.38]
45 i X Moreover, the following short-hand notations will become useful: ri s Ni j∗k = ∑ Ni jck , Ni j∗ = c=1 ri Ni jc = ∑ Ni jck , Nc = k=1 2185 s ∑∑ Ni jck , k=1 c=1 ∗ n qi ri 1 ∑ ∑ ∑ Ni jck . [sent-175, score-1.751]
46 The ML estimates (1) become now Nc ˆ θc = , N Ni jck ˆ θi jck = , Ni jc and (2) which can again be plugged into the LL criterion: N LL(G | D) = ∑ log PB (yt1 , . [sent-177, score-0.951]
47 , ytn , ct ) t=1 s = ∑ Nc log c=1 Nc N n qi +∑ ∑ ri ∑ Ni jck log i=1 j=1 k=1 Ni jck Ni jc . [sent-180, score-1.192]
48 (4) t=1 Interestingly, the objective of generative learning is precisely to maximize the whole sum, whereas the goal of discriminative learning consists on maximizing only the first term in (4). [sent-190, score-0.103]
49 , ytn , 1 − ct ) 2186 (5) FACTORIZED C ONDITIONAL L OG -L IKELIHOOD For convenience, we denote the two terms in the denominator as = PB (yt1 , . [sent-214, score-0.139]
50 , ytn , 1 − ct ), (6) so that Equation (5) becomes simply PB (ct | yt1 , . [sent-220, score-0.139]
51 , ytn , ct ) is the t’th sample in the data set D, the vector (yt1 , . [sent-228, score-0.139]
52 , ytn , 1 − ct ), which we call the dual sample of (yt1 , . [sent-231, score-0.139]
53 t=1 Recall that our goal is to derive a decomposable scoring criterion. [sent-236, score-0.131]
54 f are given by α = β = γ = π2 + 6 , 24 π2 − 18 , and 24 (π2 − 6) log p π2 − 2+ , 12 ln 2 12 (8) (9) (10) where log is the binary logarithm and ln is the natural logarithm. [sent-262, score-0.186]
55 Theorem 2 The approximation fˆ with α, β, γ defined as in Theorem 1 is the minimum variance unbiased (MVU) approximation of f . [sent-265, score-0.082]
56 Moreover, an approximation is (uniformly) minimum variance unbiased if the value E[( fˆ(Ut ,Vt ) − f (Ut ,Vt ))2 ] is the lowest for all unbiased approximations and values of p. [sent-271, score-0.077]
57 Using the approximation in Theorem 1 to approximate CLL yields N CLL(B | D) ≈ ∑ α logUt + β logVt + γ t=1 N = ∑ (α + β) logUt − β log t=1 Ut Vt N = (α + β)LL(B | D) − β ∑ log t=1 +γ Ut Vt + Nγ, (11) where constants α, β and γ are given by Equations (8), (9) and (10), respectively. [sent-284, score-0.101]
58 We are now in the position of having constructed a decomposable approximation of the conditional log-likelihood score that was shown to be very accurate for a wide range of parameters Ut and Vt . [sent-288, score-0.12]
59 Due to the dependency of these parameters on Θ, that is, the parameters of the Bayesian network B (recall Equation (6)), the score still requires that a suitable set of parameters is chosen. [sent-289, score-0.106]
60 To better see why this is the case, substitute the OFE parameters, Equation (2), into the aCLL criterion, Equation (12), to obtain n q∗ i ˆ aCLL(G | D) = (α + β) LL(G | D) − β ∑ ∑ ri 1 Ni jck Ni j(1−c) Ni jc Ni j(1−c)k ∑ ∑ Ni jck log i=1 j=1 k=1 c=0 1 −β ∑ Nc log c=0 Nc N1−c . [sent-336, score-1.032]
61 In LL and CLL criteria, similar expressions where the denominator may be zero are always eliminated by the OFE parameters since they are always multiplied by zero, see, for example, Equation (3), where Ni jc = 0 implies Ni jck = 0. [sent-338, score-0.575]
62 Given a fixed network structure, the parameters that maximize the first term, (α + β)LL(B | D), are given by OFE. [sent-344, score-0.09]
63 Let f (B | D) be a score defined by ∗ n qi ri 1 θi jck , f (B | D) = ∑ ∑ ∑ ∑ Ni jck λ log N (14) N j(1−c) i jc θi jck + iNi j∗ θi j(1−c)k i=1 j=1 k=1 c=0 Ni j∗ where λ is an arbitrary positive real value. [sent-352, score-1.396]
64 To clarify the analysis, we introduce the following short-hand notations: A1 = Ni jc θi jck , A2 = Ni jc , B1 = Ni j(1−c) θi j(1−c)k , B2 = Ni j(1−c) . [sent-356, score-0.81]
65 (15) With simple algebra, we can rewrite the logarithm in the second term of Equation (12) using the above notations as log Ni jc θi jck Ni j(1−c) θi j(1−c)k − log Ni jc Ni j(1−c) = log A1 B1 − log A2 B2 . [sent-357, score-0.954]
66 Assumption 2 We assume that Ni jc θi jck Ni jc θi jck + Ni j(1−c) θi j(1−c)k Ni jc Ni jc + Ni j(1−c) ∼ Uniform(0, 1), ∼ Uniform(0, 1). [sent-367, score-1.62]
67 0 w 1 2 3 4 Figure 3: Plot of g(w) = log w 1−w and g(w) = λ log w + ρ. [sent-373, score-0.072]
68 Hence, Assumption 2 reduces to Ni jck Ni jc ∼ Uniform(0, 1), and ∼ Uniform(0, 1). [sent-375, score-0.575]
69 6 ln 2 λ = ρ = (18) (19) Theorem 6 The approximation g with λ and ρ defined as in Theorem 5 is the minimum variance ˆ unbiased (MVU) approximation of g. [sent-381, score-0.139]
70 w In order to get an idea of the accuracy of the approximation g, consider the graph of log 1−w ˆ and λ log w + ρ in Figure 3. [sent-382, score-0.101]
71 A similar analysis can be applied to rewrite the logarithm of the third term in Equation (12) leading to log θc θ(1−c) = log θc 1 − θc 2193 ≈ λ log θc + ρ. [sent-386, score-0.108]
72 c=0 Observe that the third term of Equation (22) is such that 1 1 Nc log θc , c=0 N −βλ ∑ Nc log θc = −βλN ∑ c=0 (23) and, since β < 0, by Gibbs inequality (see Lemma 8 in the Appendix at page 2204) the parameters ˆ that maximize Equation (23) are given by the OFE, that is, θc = Nc . [sent-388, score-0.095]
73 3 Information-Theoretical Interpretation Before we present empirical results illustrating the behavior of the proposed scoring criteria, we point out that the ˆ fCLL criterion has an interesting information-theoretic interpretation based on interaction information. [sent-394, score-0.183]
74 (1997) point out, the local contribution of the i’th variable to LL(B | D) (recall Equation (3)) is given by q∗ i 1 ri Ni jck Ni jck log Ni jc j=1 c=0 k=1 N N∑ ∑∑ = −NHPD (Xi | Π∗ i ,C) ˆ X = −NHPD (Xi | C) + NIPD (Xi ; Π∗ i | C), ˆ ˆ X (25) where HPD (Xi | . [sent-397, score-0.996]
75 Since IPD (Xi ; C) on the last line of Equation (26) does not depend on Π∗ i , finding the parents of ˆ X Xi that maximize the sum amounts to maximizing the interaction information. [sent-405, score-0.099]
76 X All said, the ˆ fCLL criterion can be written as n ˆ fCLL(G | D) = ∑ (α + β)NIPD (Xi ; Π∗ i | C) − βλNIPD (C ; Xi ; Π∗ i ) + const, ˆ ˆ X X (27) i=1 where const is a constant independent of the network structure and can thus be omitted. [sent-407, score-0.126]
77 7 percent, while the second term that gives ˆ fCLL criterion its discriminative nature has the weight 63. [sent-412, score-0.12]
78 2 In addition to the insight provided by the information-theoretic interpretation of ˆ fCLL, it also provides a practically most useful corollary: the ˆ fCLL criterion is score equivalent. [sent-414, score-0.098]
79 A scoring criterion is said to be score equivalent if it assigns the same score to all network structures encoding the same independence assumptions, see Verma and Pearl (1990), Chickering (2002), Yang and Chang (2002) and de Campos (2006). [sent-415, score-0.322]
80 Theorem 7 The ˆ fCLL criterion is score equivalent for augmented naive Bayes classifiers. [sent-416, score-0.152]
81 4 Beyond Binary Classification and TAN ˆ Although aCLL and ˆ fCLL scoring criteria were devised having in mind binary classification tasks, their application in multi-class problems is straightforward. [sent-426, score-0.13]
82 To adapt it for multi-class problems, we considered Ni j(1−c)k = Ni j∗k − Ni jck and Ni j(1−c) = Ni j − Ni jc . [sent-430, score-0.575]
83 Finally, we point out that despite being derived under the augmented naive Bayes model, the ˆ fCLL score can be readily applied to models where the class variable is not a parent of some of the attributes. [sent-431, score-0.093]
84 Experimental Results We implemented the ˆ fCLL scoring criterion on top of the Weka open-source software (Hall et al. [sent-435, score-0.16]
85 Unfortunately, the Weka implementation of the TAN classifier assumes that the learning ˆ criterion is score equivalent, which rules out the use of our aCLL criterion. [sent-437, score-0.098]
86 4 ˆ We evaluated the performance of aCLL and ˆ fCLL scoring criteria in classification tasks comparing them with state-of-the-art classifiers. [sent-444, score-0.13]
87 Bayesian network-based classifiers (GHC2 and TAN) were included in different flavors, difˆ fering in the scoring criterion used for structure learning (LL, aCLL, ˆ fCLL) and the parameter estimator (OFE, ELR). [sent-481, score-0.16]
88 5, the nearest neighbor classifiers, and logistic regression, as well as the generativelytrained Bayesian network classifiers, TAN-LL-OFE and GHC-LL-OFE, all differences being statistically significant at the p < 0. [sent-523, score-0.085]
89 Conclusions and Future Work We proposed a new decomposable scoring criterion for classification tasks. [sent-627, score-0.19]
90 The new score, called factorized conditional log-likelihood, ˆ fCLL, is based on an approximation of conditional log-likelihood. [sent-628, score-0.145]
91 Moreover, the criterion is specifically designed for discriminative learning. [sent-631, score-0.12]
92 The merits of the new scoring criterion were evaluated and compared to those of common stateof-the-art classifiers, on a large suite of benchmark data sets from the UCI repository. [sent-632, score-0.16]
93 Optimal ˆ fCLL-scored tree-augmented naive Bayes (TAN) classifiers, as well as somewhat more general structures (referred to above as GHC2), performed better than generatively-trained Bayesian network classifiers, as well as C4. [sent-633, score-0.112]
94 Proof (Theorem 2) We have that p p 1 p2 x − (α log(x) + β log(y) + γ) dy dx = 0 x+y log 0 0 for α, β and γ defined as in (8), (9) and (10). [sent-657, score-0.086]
95 Proof (Theorem 3) We have that p p 0 0 1 p2 log x − (α log(x) + β log(y) + γ) x+y 2 dy dx = 36 + 36π2 − π4 −2 288 ln2 (2) for α, β and γ defined as in (8), (9) and (10), which concludes the proof. [sent-659, score-0.086]
96 Moreover, if we take the OFE for the parameters, we have Ni jk(1−c) Ni jkc ˆ ˆ θi jck = and θi j(1−c)k = . [sent-664, score-0.34]
97 Proof (Theorem 5) We have that 1 S(λ, ρ) = 0 x − (λ log (x) + ρ) log 1−x Moreover ∇. [sent-667, score-0.072]
98 X Since ˆ fCLL is a local scoring criterion, ˆ fCLL(G1 | D) can be computed from ˆ fCLL(G0 | D) taking only into account the difference in the contribution of node Y . [sent-679, score-0.101]
99 A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. [sent-782, score-0.161]
100 On discriminative Bayesian network u a classifiers and logistic regression. [sent-933, score-0.146]
wordName wordTfidf (topN-words)
[('ni', 0.383), ('fcll', 0.375), ('jck', 0.34), ('ipd', 0.319), ('ofe', 0.248), ('jc', 0.235), ('acll', 0.177), ('cll', 0.163), ('ut', 0.16), ('hpd', 0.156), ('roos', 0.149), ('elr', 0.113), ('arvalho', 0.106), ('liveira', 0.106), ('yllym', 0.106), ('scoring', 0.101), ('vt', 0.101), ('tan', 0.1), ('onditional', 0.099), ('ytn', 0.099), ('ll', 0.091), ('aki', 0.09), ('weka', 0.08), ('nipd', 0.078), ('ers', 0.076), ('factorized', 0.072), ('og', 0.07), ('network', 0.067), ('pb', 0.065), ('discriminative', 0.061), ('criterion', 0.059), ('ikelihood', 0.058), ('ln', 0.057), ('nut', 0.057), ('nvt', 0.057), ('domingos', 0.054), ('greiner', 0.05), ('logut', 0.05), ('bayesian', 0.049), ('grossman', 0.048), ('classi', 0.046), ('ri', 0.045), ('nc', 0.043), ('nrt', 0.043), ('ct', 0.04), ('score', 0.039), ('log', 0.036), ('logr', 0.035), ('parents', 0.034), ('carvalho', 0.032), ('su', 0.03), ('decomposable', 0.03), ('approximation', 0.029), ('criteria', 0.029), ('edmonds', 0.028), ('gibb', 0.028), ('naive', 0.028), ('xi', 0.028), ('myllym', 0.027), ('jk', 0.027), ('dy', 0.026), ('augmented', 0.026), ('friedman', 0.026), ('dx', 0.024), ('unbiased', 0.024), ('xik', 0.024), ('helsinki', 0.024), ('maximize', 0.023), ('rt', 0.023), ('interaction', 0.023), ('conditional', 0.022), ('chow', 0.022), ('discretization', 0.022), ('zhou', 0.021), ('rbf', 0.021), ('qi', 0.021), ('ibk', 0.021), ('logvt', 0.021), ('svmg', 0.021), ('teemu', 0.021), ('chickering', 0.021), ('diverges', 0.02), ('kaufmann', 0.02), ('networks', 0.02), ('maximizing', 0.019), ('arrow', 0.019), ('mutual', 0.018), ('attributes', 0.018), ('logistic', 0.018), ('decomposability', 0.018), ('oliveira', 0.018), ('svm', 0.018), ('heckerman', 0.017), ('structures', 0.017), ('dirichlet', 0.017), ('mvu', 0.016), ('kohavi', 0.016), ('zc', 0.016), ('petri', 0.016), ('silander', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
2 0.18960291 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
Author: Cassio P. de Campos, Qiang Ji
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique
3 0.067074239 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
4 0.063109636 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
Author: Vladimir V. V'yugin
Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction
5 0.040817387 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
Author: Trine Julie Abrahamsen, Lars Kai Hansen
Abstract: Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression. Keywords: PCA, kernel PCA, generalizability, variance renormalization
6 0.037594017 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
7 0.035668228 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
8 0.033877354 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
9 0.033188697 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.030710522 50 jmlr-2011-LPmade: Link Prediction Made Easy
11 0.030539036 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.029867113 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
13 0.029734584 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
14 0.029029369 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
15 0.027833886 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
16 0.02733507 101 jmlr-2011-Variable Sparsity Kernel Learning
17 0.027273208 54 jmlr-2011-Learning Latent Tree Graphical Models
18 0.025883868 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
19 0.025202954 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
20 0.023772333 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
topicId topicWeight
[(0, 0.153), (1, -0.032), (2, 0.017), (3, -0.06), (4, 0.015), (5, 0.044), (6, 0.009), (7, 0.011), (8, 0.158), (9, 0.203), (10, 0.011), (11, -0.21), (12, 0.071), (13, 0.125), (14, -0.18), (15, 0.119), (16, -0.006), (17, -0.139), (18, -0.104), (19, -0.151), (20, 0.051), (21, -0.057), (22, 0.159), (23, 0.094), (24, 0.037), (25, -0.013), (26, 0.189), (27, -0.119), (28, -0.163), (29, -0.126), (30, -0.097), (31, 0.293), (32, -0.144), (33, -0.072), (34, -0.008), (35, 0.145), (36, 0.037), (37, 0.074), (38, -0.097), (39, 0.008), (40, -0.038), (41, -0.047), (42, 0.129), (43, 0.172), (44, -0.018), (45, -0.068), (46, 0.031), (47, -0.062), (48, -0.046), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.9481529 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
2 0.51464951 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
Author: Cassio P. de Campos, Qiang Ji
Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the benefits of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique
3 0.48446912 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
Author: Vladimir V. V'yugin
Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction
4 0.31923214 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
5 0.26839522 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano
Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we confirmed that our algorithm can run in parallel using up to 256 processors with a parallelization efficiency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm
6 0.22171065 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
7 0.20306833 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
8 0.19635682 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
9 0.18810308 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
10 0.18759599 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
11 0.1709232 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
12 0.16004153 101 jmlr-2011-Variable Sparsity Kernel Learning
13 0.15390384 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
14 0.142616 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
15 0.14255251 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
16 0.14154762 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
17 0.13997249 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
18 0.13886578 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
19 0.13779707 12 jmlr-2011-Bayesian Co-Training
20 0.13621731 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
topicId topicWeight
[(4, 0.038), (9, 0.014), (10, 0.051), (24, 0.018), (31, 0.07), (32, 0.528), (41, 0.029), (60, 0.021), (71, 0.027), (73, 0.026), (78, 0.053), (90, 0.014)]
simIndex simValue paperId paperTitle
1 0.95051849 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
same-paper 2 0.82043338 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
3 0.81677836 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
4 0.75026023 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.36381501 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
6 0.34076408 48 jmlr-2011-Kernel Analysis of Deep Networks
7 0.31127918 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
8 0.30682692 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
9 0.30045381 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
10 0.29112327 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
11 0.28954801 91 jmlr-2011-The Sample Complexity of Dictionary Learning
12 0.2880283 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
13 0.28632227 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data
14 0.28347775 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
15 0.28277788 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
16 0.28277704 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
17 0.28072152 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
18 0.27833804 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
19 0.27762589 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
20 0.27532408 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination