jmlr jmlr2006 jmlr2006-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
Reference: text
sentIndex sentText sentNum sentScore
1 The Hammersley-Clifford canonical parameterization expresses the distribution as a product of probabilities over all variables. [sent-42, score-0.348]
2 Section 2 provides necessary background about Gibbs distributions, the factor graph associated with a Gibbs distribution, Markov blankets and the Hammersley-Clifford canonical parameterization. [sent-78, score-0.385]
3 In Section 3, building on the canonical parameterization for factor graphs, we derive our novel parameterization, which forms the basis of our parameter estimation algorithm. [sent-81, score-0.378]
4 Preliminaries In this section we first introduce Gibbs distributions, the factor graph associated with a Gibbs distribution, Markov blankets and the canonical parameterization. [sent-89, score-0.385]
5 The factor graph associated with a Gibbs distribution is a bipartite graph whose nodes correspond to variables and factors, with an edge between a variable X and a factor f j if the scope of f j contains X. [sent-107, score-0.351]
6 Thus, the Markov blanket of a set of variables D is the minimal set of variables that separates D from the other variables in the factor graph. [sent-122, score-0.505]
7 For any Gibbs distribution, we have, for any subset of random variables D, that D ⊥ X − D − MB(D) | MB(D), (1) or in words: given its Markov blanket MB(D), the set of variables D is independent of all other variables X − D − MB(D). [sent-124, score-0.407]
8 The canonical parameterization (Hammersley and Clifford, 1971; Besag, 1974b) provides one specific choice of parameterization for a Gibbs distribution, with some nice properties (see below). [sent-138, score-0.354]
9 The canonical factor for D ⊆ X is defined as follows: ∗ fD (d) = exp ∑U⊆D (−1)|D−U| log P(σU [d]) . [sent-165, score-0.466]
10 ∗ ∗ The parameterization of P using the canonical factors { f C∗ }J is called the canonical paramj j=1 eterization of P. [sent-172, score-0.593]
11 Note that first transforming a factor graph into a Markov network and then applying the Hammersley-Clifford theorem to the Markov network generally results in a significantly less sparse canonical parameterization than the canonical parameterization from Theorem 3. [sent-174, score-0.803]
12 We now give an example to clarify the definition of canonical factors and canonical parameterization. [sent-175, score-0.519]
13 (3) So to compute the canonical factor, we start with the joint instantiation of the factor variables {X1 , X2 } with all other variables {X3 , · · · , X9 } set to their default instantiations. [sent-179, score-0.424]
14 The canonical factor over just the variable X1 instantiated to x1 is given by ∗ log f{X1 } (x1 ) = log P(X1 = x1 , X2 = 0, X3 = 0, X4 = 0, · · · , X9 = 0) − log P(X1 = 0, X2 = 0, X3 = 0, X4 = 0, · · · , X9 = 0). [sent-184, score-0.79]
15 Then we formally introduce the key idea of Markov blanket canonical factors, which give a parameterization of a factor graph distribution only in terms of local probabilities. [sent-188, score-0.751]
16 So one could estimate the canonical factors 1749 A BBEEL , KOLLER AND N G (and thus the distribution) in closed-form by estimating these probabilities from data. [sent-197, score-0.357]
17 Unfortunately the probabilities appearing in the canonical factors are over full joint instantiations of all variables. [sent-198, score-0.447]
18 (5) Here we used in order: the definition of conditional probability; same terms with opposite sign cancel; conditioning on the Markov blanket is equivalent to conditioning on all other variables; MB({X1 , X2 }) = {X3 , X4 , X5 } in our example. [sent-204, score-0.371]
19 (6) ∗ The last line defines f {X1 ,X2 }|{X3 ,X4 ,X5 } (x1 , x2 ) (which we refer to as the Markov blanket canonical fac∗ ∗ tor for {X1 , X2 }). [sent-208, score-0.52]
20 ∗ The Markov blanket canonical factor f {X1 ,X2 }|{X3 ,X4 ,X5 } (x1 , x2 ) is computed from local probabilities ∗ as given in Eqn. [sent-210, score-0.662]
21 The (original) canonical factor f {X1 ,X2 } (x1 , x2 ) is computed from probabilities over full joint instantiations as given in Eqn. [sent-212, score-0.422]
22 1750 L EARNING FACTOR G RAPHS IN P OLYNOMIAL T IME AND S AMPLE C OMPLEXITY Similarly, the other canonical factors have equivalent Markov blanket canonical factors which involve local probabilities only. [sent-214, score-0.984]
23 We now generalize our observation from the example in the previous section: namely, that we can express the canonical factors using only probabilities over much smaller instantiations—those corresponding to a factor and its Markov blanket. [sent-223, score-0.455]
24 j j Proposition 4 shows that we can compute the canonical parameterization factors using probabilities over factor scopes and their Markov blankets only. [sent-244, score-0.657]
25 (7) we can expand the Markov blanket canonical factors in Proposition 4 and we see that any factor graph distribution can be parameterized as a product of local probabilities only. [sent-247, score-0.828]
26 , Xn factors of P scopes of factors of P empirical (sample) distribution distribution returned by learning algorithm canonical factor as defined in Eqn. [sent-263, score-0.648]
27 The Markov blanket canonical factors are a product of local conditional probabilities. [sent-299, score-0.652]
28 Thus the Markov blanket canonical factors can be estimated accurately from a small number of training examples. [sent-301, score-0.644]
29 Thus the factor graph distribution—which is just a product of the Markov canonical factors—can be estimated accurately from a small number of training examples. [sent-302, score-0.356]
30 Let { f D∗ |MB(D∗ ) }J be the non-trivial Markov j=1 j j ∗ blanket canonical factors of Q (those factors with not all entries equal to one). [sent-317, score-0.759]
31 Let {C ∗ }J be the j j=1 ˜ scopes of the canonical factors in the factor graph given to the algorithm. [sent-318, score-0.528]
32 The graceful degradation result is important, as it shows that each canonical factor that is incorrectly captured by our target structure adds at most a constant (namely, l2l+1 log 1 for an incorrectly captured factor over l variables) to our bound on γ the KL-divergence. [sent-323, score-0.7]
33 A canonical factor could be incorrectly captured when the corresponding factor scope is not included in the given structure. [sent-325, score-0.449]
34 Thus, canonical factors over large scopes are often close to the trivial all-ones factor in practice. [sent-327, score-0.51]
35 A canonical factor could also be incorrectly captured when the given structure does not have the correct Markov blanket for that factor. [sent-329, score-0.663]
36 Thus we have for a factor over l variables that maxd∗j log fD∗ (d∗ ) ≤ j log 1 γl2l−1 = l2l−1 log 1 . [sent-346, score-0.615]
37 Structure Learning The algorithm described in the previous section uses the known network to establish a Markov blanket for each factor. [sent-402, score-0.355]
38 This Markov blanket is then used to estimate the canonical parameters from empirical data. [sent-403, score-0.52]
39 1 Identifying Markov Blankets In the parameter learning results, the Markov blanket MB(C ∗ ) is used to efficiently estimate the j conditional probability P(C∗ |X − C∗ ), which is equal to P(C∗ |MB(C∗ )). [sent-406, score-0.363]
40 In j j j this section we show how conditional entropy can be used to find a candidate Markov blanket that gives a good approximation for this conditional probability. [sent-408, score-0.411]
41 For some readers, some intuition might be gained from the fact that the conditional entropy of C ∗ given the candidate j Markov blanket Y corresponds to the log-loss of predicting C∗ given the candidate Markov blanket Y. [sent-415, score-0.728]
42 Proposition 11 shows that conditional entropy can be used to find the Markov blanket for a given / set of variables. [sent-420, score-0.358]
43 Thus, we can select the set of variables Y that minimizes H(D|Y) as our candidate Markov blanket for the set of variables D. [sent-423, score-0.404]
44 δ However, as the empirical estimates of the conditional entropy are noisy, the true Markov blanket is not guaranteed to achieve the minimum of H(D|Y). [sent-432, score-0.358]
45 λ2 λ 1 (14) (15) A BBEEL , KOLLER AND N G In other words, if a set of variables U ∪ W looks like a Markov blanket for X, as evaluated by the conditional entropy H(X|U, W), then the conditional distribution P(X|U, W) must be close to the conditional distribution P(X|X − X). [sent-443, score-0.487]
46 Thus, it suffices to find such an approximate Markov blanket U ∪ W as a substitute for knowing the true Markov blanket U ∪ V. [sent-444, score-0.628]
47 The algorithm receives as input: training examples {x(i) }m ; k: the maximum number of variables per factor; b: the maximum i=1 ¯ number of variables per Markov blanket for any set of variables up to size k; x: a base instantiation. [sent-448, score-0.424]
48 C j |MB(C j ) • Threshold to one the factor entries fˆ∗ ∗ C j |MB(C∗ ) j (c∗ ) satisfying | log fˆ∗ ∗ j discard the factors that have all entries equal to one. [sent-456, score-0.417]
49 Factor- Thus the running time is exponential in the maximum factor scope size k and the maximum Markov blanket size b, polynomial in the number of variables n and the maximum domain size v, and linear in the number of training examples m. [sent-463, score-0.512]
50 Note in the parameter learning setting we had b equal to the size the largest Markov blanket for an actual factor in the distribution. [sent-466, score-0.412]
51 In contrast, now b corresponds to the size of the largest Markov blanket for any candidate factor up to size k. [sent-467, score-0.44]
52 The last term comes from computing the Markov blanket canonical factors. [sent-471, score-0.52]
53 from a distribution P, (b) i=1 an upper bound k on the number of variables per factor in the factor graph for P, and (c) an upper bound b on the number of variables per Markov blanket for any set of variables up to size k in the ˜ factor graph for P. [sent-477, score-0.795]
54 Theorem 15 shows that the sample complexity depends exponentially on the maximum factor size k and the maximum Markov blanket size b; and polynomially on 1 and 1 . [sent-483, score-0.433]
55 Theorem 15 considers the case where the generating distribution P factors according to a structure with factor scope sizes bounded by k and size of Markov blankets (of any subset of variables of size less than k) bounded by b. [sent-492, score-0.41]
56 Let ∗ { fD∗ |MB(D∗ ) } j be the non-trivial Markov blanket canonical factors of Q (those factors with not all j j entries equal to one). [sent-505, score-0.759]
57 Let J be the number of non-trivial Markov blanket canonical factors in Q with ˜ scope size smaller than k. [sent-506, score-0.657]
58 Second, the Markov blanket of the factor could be too large. [sent-516, score-0.412]
59 Recall that the structure learning algorithm correctly clips all estimates of trivial canonical factors to the trivial all-ones factor, when the structural assumptions are satisfied. [sent-519, score-0.375]
60 , trivial factors are correctly estimated as trivial if their Markov blanket is of size smaller than b. [sent-522, score-0.455]
61 The additional term |S|ε corresponds to estimation error on the factors that are trivial in the true distribution but that have a Markov blanket of size larger than b, and are thus not correctly estimated and clipped to trivial all-ones factors. [sent-523, score-0.505]
62 From a statistical perspective, our algorithm is based on the canonical parameterization, which is ¯ evaluated relative to a canonical assignment x. [sent-638, score-0.412]
63 In particular, we might be able to address the statistical ¯ limitation by putting together canonical factor estimates from multiple canonical assignments x. [sent-646, score-0.525]
64 If P is a positive Gibbs distribution with factor scopes {C j }J , then the canonical factors f D j=1 J 2C j . [sent-658, score-0.517]
65 are trivial all-ones factors, whenever D ∈ ∪ j=1 / The first part states that the canonical parameterization gives the correct distribution assuming we use a canonical factor for each subset of variables. [sent-659, score-0.625]
66 The second part states that we can ignore canonical factors over subsets of variables that do not appear together in one of the factor scopes {C j }J . [sent-661, score-0.54]
67 We have ∑ (−1)|D−U| log P(σU [d]) ∗ log fD (d) = U⊆D ∑ (−1)|D−U| = U⊆D J 1 ∑ log fC (C j [σU [d]]) + log Z j j=1 J ∑ ∑ (−1)|D−U| log fC (C j [σU [d]]). [sent-663, score-0.81]
68 Then we have / j=1 that ∑ (−1)|D−U| log fC (C j [σU [d]]) = ∑ j U⊆D (−1)|D−U| log fC j (C j [σU [d]]) U⊆D−Y +(−1)|D−U−Y | log fC j (C j [σU∪Y [d]]). [sent-668, score-0.486]
69 – When going through the m data points, we need to add to the counts of the observed instantiation whenever the Markov blanket is in the default instantiation. [sent-701, score-0.372]
70 Reading a specific instantiation of a specific factor and its Markov blanket takes O(k + b) to read every variable. [sent-702, score-0.449]
71 ) This gives ∗ us O(2|C j | ) operations per factor entry, and thus O(J22k vk ) total for computing the canonical factor entries from the empirical probabilities. [sent-709, score-0.489]
72 The following lemma shows that the log of the empirical average is an accurate estimate of the log of the population average, if the population average is bounded away from zero. [sent-717, score-0.387]
73 2ε 2 δ Since the function f (x) = log x is Lipschitz with Lipschitz-constant smaller than val [λ − ε , 1], we have that for ε ˆ | log φ − log φ| ≤ λ−ε m≥ (22) 1 λ−ε over the inter- ε to hold w. [sent-730, score-0.805]
74 Then for ˆ | log P(X = x|Y = y) − log P(X = x|Y = y)| ≤ ε to hold for all x, y with probability 1 − δ, it suffices that m≥ ε (1+ 2 )2 ε 2λ2 ( 2 )2 log 4|val(X)||val(Y)| . [sent-745, score-0.533]
75 δ Proof We have (using the definition of conditional probability and the triangle inequality) ˆ log P(X = x|Y = y) − log P(X = x|Y = y) = ≤ log P(X = x, Y = y) − log P(Y = y) ˆ ˆ − log P(X = x, Y = y) − log P(Y = y) ˆ (log P(X = x, Y = y) − log P(X = x, Y = y) ˆ + (log P(Y = y) − log P(Y = y) . [sent-746, score-1.345]
76 Then for all d ∈ val(D) for ∗ ∗ | log fD|Y (d) − log fˆD|W (d)| ≤ ε to hold, it suffices that for all instantiations d ∈ val(D) we have that ˆ ¯ | log P(d|¯ ) − log P(d|w)| ≤ y ε . [sent-756, score-0.722]
77 difference | log Z − log Z| We now show how the previous lemmas can be used to prove the parameter learning sample complexity result stated in Theorem 6. [sent-767, score-0.375]
78 Proof [Theorem 6] First note that since the scopes of the canonical factors used by the algorithm are subsets of the given scopes {C j }J , we have that j=1 max j |C∗ ∪ MB(C∗ )| ≤ b + k. [sent-768, score-0.493]
79 , J ∗ } for ˆ j log P(C∗ = c∗ |M∗ = m∗ ) − log P(C∗ = c∗ |M∗ = m∗ ) ≤ ε j j j j j j j (26) to hold for all instantiations c∗ , m∗ with probability 1 − δ , it suffices that j j m≥ (1 + ε )2 2 2γ2k+2b ( ε )2 2 log 4vk+b . [sent-774, score-0.607]
80 j j (32) j We have (adding and subtracting same term): log ∗ fD∗ |MB(D∗ ) (d∗ ) j j fˆ∗ ∗ j D j |MB(D∗ ) j (d∗ ) j = log ∗ fD∗ |MB(D∗ ) (d∗ ) j j f ∗∗ j D j |MB(D∗ ) j (d∗ ) j 1770 + log f ∗∗ (d∗ ) j fˆ∗ ∗ (d∗ ) j D j |MB(D∗ ) j D j |MB(D∗ ) j . [sent-791, score-0.486]
81 Thus we have (adding zero and adding j / j k j j and subtracting same term): log ∗ fˆC∗ |MB(C∗ ) (c∗ ) j j j = log fˆ∗ ∗ (c∗ ) C j |MB(C∗ ) j j f ∗∗ (c∗ ) C j |MB(C∗ ) j j + log f ∗∗ (c∗ ) C j |MB(C∗ ) j j . [sent-793, score-0.486]
82 δ (44) Also, using the same reasoning as in the proof of Lemma 20, we get that 2 ˜ ˜ Dn (P P) + Dn (P P) ≤ n J∗ ∗ ∗ ∑ maxc | log fC |MB(C ) (c∗j ) − log fˆC |MB(C ) (c∗j )|. [sent-811, score-0.381]
83 (45) to hold with probability 1 − δ, it suffices that m≥ εδ (1 + n 22k+3 )2 J εδ 2γ2k+2b ( n 22k+3 )2 J log k22k+4 vk+b log 1 γ n J εδ . [sent-817, score-0.371]
84 (2) Proof We use the concavity of the log function, to upper bound it with a tangent line at θ i , which gives the following inequality: (1) 1 (2) log θi ≤ log θi + (1) (2) θi (1) (2) (θi − θi ). [sent-922, score-0.486]
85 Proof [Lemma 22] We have (1) D(θ1:v (2) θ1:v ) v = ∑ (1) θi log i=1 (1) θi (2) θi (1) ≤ maxi log ≤ maxi log θi (2) θi 1 (2) θi ≤ maxy∈[γ,1−γ] log 1 = log , γ 1777 (55) 1 y A BBEEL , KOLLER AND N G which proves the lemma. [sent-949, score-0.81]
86 3 ε ˜ ¯ This case is trivial, since by Lemma 22 we have that D(θ1:v θ1:v ) ≤ ≥ log 4v ¯ ε P(X1:k =u) 3 v3 ¯ v3 ¯ ˜ ¯ log 4ε ≤ log 4v and the statement of the lemma is trivially implied, for all θ1:v ∈ [ 4ε , 1 − ε 4v3 v ¯ ¯ ε ] , so m = 0 samples is sufficient. [sent-956, score-0.526]
87 Then for i=1 ˆ ˆ |φ log φ − φ log φ| ≤ ε to hold w. [sent-980, score-0.347]
88 (60) 2 2ε δ Now since the function f (x) = x log x is Lipschitz with Lipschitz-constant smaller than max{1, | log(λ− ε )|} over the interval [λ − ε , 1], we have that for m≥ ˆ ˆ |φ log φ − φ log φ| ≤ ε max{1, | log(λ − ε )|} 1779 A BBEEL , KOLLER AND N G to hold w. [sent-986, score-0.509]
89 log 2 ε2 λ δ λ δ (61) ˆ Now since for any two distributions P and P we have |H(X|Y) − H(X|Y)| ≤ log |val(X)| ≤ 2|val(X)||val(Y)|, we have that for any ε ≥ 2|val(X)||val(Y)| the statement of the lemma holds trivially independent of the number of samples m. [sent-1002, score-0.391]
90 (14) and the definition of conditional entropy we get that ∑ x,u,v,w,y P(x, u, v, w, y) log P(x|u, v, w, y) − ∑ x,u,w P(x, u, w) log P(x|u, w) ≤ ε. [sent-1008, score-0.385]
91 (13) (U ∪ V is the Markov blanket of X) gives us ∑ P(x, u, v, w, y) log x,u,v,w,y P(x|u, v) ≤ ε. [sent-1011, score-0.476]
92 (13) (U ∪ V is the Markov blanket of X) we get P(x|u, v) ∑ P(u, v, w) ∑ P(x|u, v) log P(x|u, w) ≤ ε. [sent-1014, score-0.493]
93 3 Proof of Theorem 14 Proof [Theorem 14] There are O(knk bnb ) candidate factor, candidate Markov blanket pairs, each with O(vk+b ) different instantiations. [sent-1025, score-0.431]
94 Putting it all together gives us an upper bound on the running time of O knk bnb vk+b + mknk bnb (k + b) + knk bnb vk+b + knk (m(k + b) + 2k vk ) . [sent-1031, score-0.433]
95 γk+b = (65) Now from Lemma 18 we have that for √ 2 ε ≤ k+b γ ˆ j | log P(C∗ |MB(C∗ )) − log P(C∗ |MB(C∗ ))| j j j (66) to hold for all instantiations c∗ ∈ val(C∗ ) with probability 1 − δ , it suffices that j j √ m≥ ε (1 + γk+b )2 √ ε 2γ2k+2b ( γk+b )2 log 4vk+b . [sent-1049, score-0.607]
96 2k+2 C j |MB(C∗ ) j (c∗ ) satisfying | log fˆ∗ ∗ j Thus after the clipping we have, ∗ ∗ | log fC∗ |X −C∗ (c∗ ) − log fˆC∗ |MB(C∗ ) (c∗ )| ≤ j j j j j j C j |MB(C∗ ) j ε , 2k+1 (c∗ )| ≤ j ε 2k+2 introduces (69) for all canonical factors of P. [sent-1060, score-0.872]
97 We also have that all candidate factors that are not present in the canonical form of the true distribution P will now have been removed and do not contribute to ˜ P. [sent-1061, score-0.365]
98 Such a large enough b for these factors (which can be larger than the maximum Markov blanket size for factors present in the distribution) is important. [sent-1063, score-0.528]
99 Trivial (allones) canonical factors computed using their Markov blanket require a true Markov blanket to be all-ones. [sent-1064, score-0.941]
100 Moreover (recall the clipping procedure removed all candidate factors with scope less than k and Markov blanket size less than b that are not present in the canonical form of the true distribution P) we have that zero error is incurred on all other factors. [sent-1075, score-0.782]
wordName wordTfidf (topN-words)
[('mb', 0.69), ('blanket', 0.314), ('val', 0.296), ('canonical', 0.206), ('log', 0.162), ('pbn', 0.142), ('fd', 0.131), ('markov', 0.121), ('bbeel', 0.117), ('olynomial', 0.111), ('factors', 0.107), ('factor', 0.098), ('fc', 0.095), ('raphs', 0.095), ('omplexity', 0.095), ('ime', 0.095), ('ample', 0.085), ('gibbs', 0.082), ('scopes', 0.082), ('instantiations', 0.074), ('parameterization', 0.074), ('ex', 0.073), ('clipping', 0.073), ('vk', 0.062), ('bnb', 0.061), ('ces', 0.06), ('koller', 0.056), ('knk', 0.056), ('poly', 0.046), ('blankets', 0.046), ('graceful', 0.046), ('dn', 0.045), ('probabilities', 0.044), ('pax', 0.043), ('network', 0.041), ('lemma', 0.04), ('ml', 0.04), ('earning', 0.039), ('dependence', 0.038), ('instantiation', 0.037), ('bn', 0.036), ('graph', 0.035), ('cpt', 0.034), ('bayesian', 0.034), ('variables', 0.031), ('yes', 0.031), ('mh', 0.03), ('scope', 0.03), ('lemmas', 0.03), ('structure', 0.028), ('xk', 0.028), ('candidate', 0.028), ('degradation', 0.028), ('theorem', 0.028), ('graphs', 0.027), ('distributions', 0.027), ('clipped', 0.026), ('maxd', 0.025), ('conditional', 0.025), ('entries', 0.025), ('probability', 0.024), ('distribution', 0.024), ('bounded', 0.023), ('hold', 0.023), ('networks', 0.023), ('graphical', 0.022), ('polynomial', 0.022), ('uai', 0.022), ('maxc', 0.021), ('besag', 0.021), ('parents', 0.021), ('sample', 0.021), ('default', 0.021), ('kl', 0.021), ('abbeel', 0.02), ('abe', 0.02), ('jvk', 0.02), ('mknk', 0.02), ('narasimhan', 0.02), ('proofs', 0.02), ('dasgupta', 0.02), ('proof', 0.019), ('entropy', 0.019), ('suf', 0.018), ('trivial', 0.017), ('propositions', 0.017), ('union', 0.017), ('training', 0.017), ('captured', 0.017), ('get', 0.017), ('multinomial', 0.016), ('chickering', 0.016), ('conditioning', 0.016), ('appearing', 0.016), ('subsets', 0.016), ('contribution', 0.015), ('assignments', 0.015), ('nump', 0.015), ('srebro', 0.015), ('formal', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999881 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
2 0.066308796 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
Author: Magnus Ekdahl, Timo Koski
Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning
3 0.059486132 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
Author: Tobias Glasmachers, Christian Igel
Abstract: Support vector machines are trained by solving constrained quadratic optimization problems. This is usually done with an iterative decomposition algorithm operating on a small working set of variables in every iteration. The training time strongly depends on the selection of these variables. We propose the maximum-gain working set selection algorithm for large scale quadratic programming. It is based on the idea to greedily maximize the progress in each single iteration. The algorithm takes second order information from cached kernel matrix entries into account. We prove the convergence to an optimal solution of a variant termed hybrid maximum-gain working set selection. This method is empirically compared to the prominent most violating pair selection and the latest algorithm using second order information. For large training sets our new selection scheme is significantly faster. Keywords: working set selection, sequential minimal optimization, quadratic programming, support vector machines, large scale optimization
4 0.059142426 25 jmlr-2006-Distance Patterns in Structural Similarity
Author: Thomas Kämpke
Abstract: Similarity of edge labeled graphs is considered in the sense of minimum squared distance between corresponding values. Vertex correspondences are established by isomorphisms if both graphs are of equal size and by subisomorphisms if one graph has fewer vertices than the other. Best fit isomorphisms and subisomorphisms amount to solutions of quadratic assignment problems and are computed exactly as well as approximately by minimum cost flow, linear assignment relaxations and related graph algorithms. Keywords: assignment problem, best approximation, branch and bound, inexact graph matching, model data base
5 0.050367597 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
8 0.03917788 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
9 0.037438221 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
10 0.036805011 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
11 0.03632639 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
12 0.035459321 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
13 0.034887653 53 jmlr-2006-Learning a Hidden Hypergraph
14 0.034180954 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
15 0.033353407 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
16 0.032404799 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
17 0.030357216 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
18 0.02844003 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)
19 0.027977435 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
20 0.026723903 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs
topicId topicWeight
[(0, 0.142), (1, -0.039), (2, -0.105), (3, -0.009), (4, -0.106), (5, 0.091), (6, 0.066), (7, -0.028), (8, 0.039), (9, -0.05), (10, 0.055), (11, 0.047), (12, 0.16), (13, -0.031), (14, 0.056), (15, 0.1), (16, -0.07), (17, 0.002), (18, -0.157), (19, 0.015), (20, -0.087), (21, 0.092), (22, 0.242), (23, -0.175), (24, 0.013), (25, -0.068), (26, -0.166), (27, -0.095), (28, 0.113), (29, -0.005), (30, 0.119), (31, -0.068), (32, 0.002), (33, 0.251), (34, -0.106), (35, -0.022), (36, -0.01), (37, -0.045), (38, 0.034), (39, 0.005), (40, 0.032), (41, 0.063), (42, -0.12), (43, 0.225), (44, 0.024), (45, -0.112), (46, 0.215), (47, -0.056), (48, -0.165), (49, -0.227)]
simIndex simValue paperId paperTitle
same-paper 1 0.94448286 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
2 0.48336956 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
Author: Magnus Ekdahl, Timo Koski
Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning
3 0.28801242 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)
Author: Tobias Glasmachers, Christian Igel
Abstract: Support vector machines are trained by solving constrained quadratic optimization problems. This is usually done with an iterative decomposition algorithm operating on a small working set of variables in every iteration. The training time strongly depends on the selection of these variables. We propose the maximum-gain working set selection algorithm for large scale quadratic programming. It is based on the idea to greedily maximize the progress in each single iteration. The algorithm takes second order information from cached kernel matrix entries into account. We prove the convergence to an optimal solution of a variant termed hybrid maximum-gain working set selection. This method is empirically compared to the prominent most violating pair selection and the latest algorithm using second order information. For large training sets our new selection scheme is significantly faster. Keywords: working set selection, sequential minimal optimization, quadratic programming, support vector machines, large scale optimization
Author: Shai Shalev-Shwartz, Yoram Singer
Abstract: We discuss the problem of learning to rank labels from a real valued feedback associated with each label. We cast the feedback as a preferences graph where the nodes of the graph are the labels and edges express preferences over labels. We tackle the learning problem by defining a loss function for comparing a predicted graph with a feedback graph. This loss is materialized by decomposing the feedback graph into bipartite sub-graphs. We then adopt the maximum-margin framework which leads to a quadratic optimization problem with linear constraints. While the size of the problem grows quadratically with the number of the nodes in the feedback graph, we derive a problem of a significantly smaller size and prove that it attains the same minimum. We then describe an efficient algorithm, called SOPOPO, for solving the reduced problem by employing a soft projection onto the polyhedron defined by a reduced set of constraints. We also describe and analyze a wrapper procedure for batch learning when multiple graphs are provided for training. We conclude with a set of experiments which show significant improvements in run time over a state of the art interior-point algorithm.
5 0.27591872 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
6 0.26773539 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
7 0.25567943 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
8 0.25057507 53 jmlr-2006-Learning a Hidden Hypergraph
9 0.24970517 25 jmlr-2006-Distance Patterns in Structural Similarity
10 0.23615544 48 jmlr-2006-Learning Minimum Volume Sets
11 0.22574411 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
12 0.22357146 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
13 0.21575437 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
15 0.169453 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations
16 0.16151895 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
17 0.15687992 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
18 0.15196747 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
19 0.14783987 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
20 0.14657988 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals
topicId topicWeight
[(8, 0.013), (35, 0.011), (36, 0.057), (45, 0.033), (50, 0.09), (53, 0.42), (61, 0.012), (63, 0.032), (68, 0.027), (76, 0.027), (78, 0.015), (79, 0.012), (81, 0.047), (84, 0.013), (90, 0.025), (91, 0.017), (96, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.67324913 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng
Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks
2 0.60035545 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
3 0.30102044 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
4 0.29169151 66 jmlr-2006-On Model Selection Consistency of Lasso
Author: Peng Zhao, Bin Yu
Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency
5 0.29057741 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
6 0.28966737 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
7 0.28934911 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
8 0.28577545 48 jmlr-2006-Learning Minimum Volume Sets
9 0.28315169 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
10 0.28260082 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
11 0.28088278 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.28008392 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
13 0.27969593 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
14 0.27949539 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
16 0.27506313 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
17 0.27390334 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
18 0.2715883 53 jmlr-2006-Learning a Hidden Hypergraph
19 0.26559448 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
20 0.26480144 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples