jmlr jmlr2009 jmlr2009-44 knowledge-graph by maker-knowledge-mining

44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths

Source: pdf

Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin

Abstract: We deﬁne a model of learning probabilistic acyclic circuits using value injection queries, in which ﬁxed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that a polynomial time learning algorithm using general value injection experiments may not do much better than one using test paths. For probabilistic circuits with alphabets of size three or greater, we show that the test path lemmas (Angluin et al., 2009, 2008b) fail utterly. To overcome this obstacle, we introduce function injection queries, in which the values on a wire may be mapped to other values rather than just to themselves or constants, and prove a generalized test path lemma for this case. Keywords: nonadaptive learning algorithms, probabilistic circuits, causal Bayesian networks, value injection queries, test paths

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To overcome this obstacle, we introduce function injection queries, in which the values on a wire may be mapped to other values rather than just to themselves or constants, and prove a generalized test path lemma for this case. [sent-19, score-1.053]

2 In this paper we consider a different setting, value injection queries, in which we can ﬁx the values on any subset of wires in the target circuit, but still only observe the output of the circuit. [sent-31, score-0.907]

3 A value injection query may also be viewed as a set of such actions, one for each wire ﬁxed to a value. [sent-37, score-0.965]

4 Circuit Builder uses value injection queries to learn acyclic deterministic circuits with constant-size alphabets, constant fan-in and depth O(log n) up to behavioral equivalence in polynomial time. [sent-41, score-0.878]

5 Another algorithm is given that learns constant-depth acyclic Boolean circuits with NOT gates and unbounded fan-in AND, OR, NAND and NOR gates up to behavioral equivalence in polynomial time using value injection queries. [sent-42, score-0.887]

6 Negative results include an exponential lower bound on the number of value injection queries to learn acyclic Boolean circuits of unbounded depth and unbounded fan-in, and the NP-hardness of learning acyclic Boolean circuits of unbounded depth and constant fan-in using value injection queries. [sent-43, score-1.554]

7 They give the Distinguishing Paths Algorithm, which uses value injection queries and learns acyclic deterministic circuits that are transitively reduced and have polynomial-size alphabets, constant fan-in and unbounded depth up to behavioral equivalence in polynomial time. [sent-47, score-1.015]

8 In Section 2 we formally deﬁne our model of acyclic probabilistic circuits, value injection queries and distribution injection queries, behavioral equivalence, and the learning problem that we consider. [sent-55, score-1.025]

9 , 2009) to ﬁnd using value injection queries, with high probability, in time polynomial in n and 1/ε, a circuit that is ε-behaviorally equivalent to a target acyclic Boolean probabilistic circuit of size n with constant fan-in and depth bounded by a constant times log n. [sent-61, score-1.248]

10 In Section 8 we introduce a stronger kind of query, a function injection query, and show that test paths with function injections overcome the limitations of test paths for circuits with alphabets of size greater than two. [sent-64, score-0.865]

11 An unusual feature of this model is that circuits do not have distinguished inputs—since the learning algorithm seeks to predict the output behavior of value injection experiments that override the values on an arbitrary subset of wires, each wire is a potential input. [sent-68, score-1.206]

12 We call the set of C’s wires W , and these wires take values in a ﬁnite alphabet Σ with |Σ| ≥ 2. [sent-71, score-0.972]

13 A probabilistic circuit C maps wires to probabilistic gates. [sent-91, score-0.875]

14 The circuit graph of C has a node for each wire in W and a directed edge (u, w) if u is one of the input wires of the gate associated with w. [sent-94, score-1.601]

15 It is important to distinguish between wires in the circuit and edges in the circuit graph. [sent-95, score-1.092]

16 For example, if wire u is an input of wires v and w, then there will be two directed edges, (u, v) and (u, w), in the circuit graph. [sent-96, score-1.349]

17 Wire w is reachable from wire u if there is a directed path from u to w in the circuit graph. [sent-97, score-0.979]

18 A wire is relevant if the output wire is reachable from it. [sent-98, score-1.123]

19 The depth of a wire w is the number of edges in the longest simple path from w to the output wire in the circuit graph. [sent-99, score-1.532]

20 The circuit is transitively reduced if its circuit graph is transitively reduced, that is, if it contains no edge (u, w) such that there is a directed path of length at least two from u to w. [sent-102, score-0.984]

21 2 Experiments In an experiment some wires are constrained to be particular values or value distributions and the other wires are left free to take on values according to their gate functions and the values of their input wires. [sent-105, score-1.301]

22 For probabilistic circuits we consider both value injection experiments and distribution injection experiments. [sent-107, score-1.096]

23 A distribution injection experiment e is a function with domain W that maps each wire w to a special symbol ∗ or to a value distribution. [sent-108, score-1.033]

24 A value injection experiment e is a distribution injection experiment for which every value distribution assigned is deterministic—that is, always generates the same symbol. [sent-109, score-1.003]

25 If e is any experiment then a free path in e is a path in the circuit graph containing only wires w that are free in e. [sent-117, score-1.139]

26 Then a distribution injection experiment e determines a joint distribution over assignments of elements of Σ to all of the wires of the circuit, as follows. [sent-120, score-0.952]

27 If wire w 1884 L EARNING ACYCLIC P ROBABILISTIC C IRCUITS U SING T EST PATHS is constrained then w is randomly and independently assigned a value in Σ drawn according to the value distribution e(w); in the case of a value injection experiment, this just assigns a ﬁxed element of Σ to w. [sent-121, score-1.035]

28 If wire w is free and has probabilistic gate function f , and its inputs u1 , . [sent-122, score-0.948]

29 Constrained gates and gates of fan-in zero give the base cases for the above recursive deﬁnition, which assigns an element of Σ to every wire because the circuit is acyclic. [sent-132, score-0.964]

30 It is easy to see that this is one of two possible outcomes for experiment e; either all wires are assigned 0 or all wires are assigned 1, and these each occur with probability 1/2. [sent-152, score-1.035]

31 5 Behavioral Equivalence Two circuits C and C′ are behaviorally equivalent if they have the same set of wires, the same output wire and the same behavior, that is, for every value injection experiment e, C(e) = C′ (e). [sent-163, score-1.295]

32 For ε ≥ 0, C is ε-behaviorally equivalent to C′ if they contain the same wires and the same output wire, and for every value injection experiment e, d(C(e),C′ (e)) ≤ ε, where d is the statistical distance between value distributions. [sent-167, score-0.983]

33 However, behavioral equivalence is not sufﬁcient to guarantee that two circuits have the same topology; even when all the gates are Boolean, deterministic and relevant, the circuit graph of the target circuit may not be uniquely determined by its behavior (Angluin et al. [sent-169, score-0.988]

34 6 Queries The learning algorithm gets information about the target circuit by specifying a value injection experiment e and observing the element of Σ assigned to the output wire. [sent-172, score-0.853]

35 7 The Learning Problem The learning problem is ε-approximate learning: by making value injection queries to a target circuit C drawn from a known class of probabilistic circuits, the goal is to ﬁnd a circuit C′ that is ε-behaviorally equivalent to C. [sent-178, score-1.146]

36 The inputs to the learning algorithm are the names of the wires in C, the name of the output wire and positive numbers ε and δ, where the learning algorithm is required to succeed with probability at least (1 − δ). [sent-179, score-1.085]

37 Preliminary Results In this section we establish some basic results about probabilistic circuits, value injection experiments and distribution injection experiments. [sent-186, score-0.884]

38 We start by considering two circuits C1 and C2 over the same wires, and distribution injection experiments e1 and e2 that agree on the distribution assigned to a wire w and that show a certain distance between C1 (e1 ) and C2 (e2 ). [sent-190, score-1.232]

39 Lemma 1 Let C1 and C2 be probabilistic circuits on wires W with the same output wire, let w ∈ W be a wire, let D be a value distribution, and let e1 and e2 be distribution injection experiments such that e1 (w) = e2 (w) = D. [sent-192, score-1.189]

40 By successively replacing each value distribution by a particular value, we may convert a distribution injection experiment that shows a certain distance between two circuits into a value injection experiment that shows at least that distance between the two circuits. [sent-196, score-1.172]

41 Lemma 2 Let C1 and C2 be probabilistic circuits on wires W with the same output wire and let e be a distribution injection experiment. [sent-197, score-1.69]

42 Thus, value injection experiments sufﬁce to establish approximate behavioral equivalence with respect to distribution injection experiments. [sent-203, score-0.879]

43 Corollary 3 If circuits C1 and C2 are ε-behaviorally equivalent with respect to value injection experiments, then C1 and C2 are ε-behaviorally equivalent with respect to distribution injection experiments. [sent-204, score-1.029]

44 Lemma 4 Let C be a probabilistic circuit on wires W and let e1 and e2 be distribution injection experiments that agree on wires V ⊆ W . [sent-208, score-1.72]

45 Then there exist distribution injection experiments e′ ≤ e1 1 and e′ ≤ e2 such that for each wire w ∈ V , there exists a value σ ∈ Σ such that e′ (w) = e′ (w) = σ, 2 2 1 and d(C(e′ ),C(e′ )) ≥ d(C(e1 ),C(e2 )). [sent-209, score-0.98]

46 1 2 The following lemma shows that constraining a wire w does not change the behavior of wires that are not reachable from w. [sent-216, score-1.034]

47 Lemma 5 Let C be a probabilistic circuit on wires W , let e be a distribution injection experiment, let w ∈ W be a wire free in e, and let D be a value distribution. [sent-217, score-1.857]

48 If we consider the distance between the behavior of a circuit with a wire constrained to two different value distributions, the following lemma allows us to move to a situation in which the wire is constrained to two different value distributions whose supports are disjoint. [sent-226, score-1.451]

49 A test path for a wire w, or w-test path, is a value injection experiment in which the free gates form a directed path in the circuit graph from w to the output wire. [sent-241, score-1.687]

50 A side wire with respect to a test path p is a wire ﬁxed by p that is input to a free wire in p. [sent-243, score-1.8]

51 A w1 -test path is a value injection experiment that sets the wires of one of these paths to ∗ and the other wires to 0 or 1, for example, ∗011∗ or ∗∗0∗∗. [sent-246, score-1.553]

52 For the test path ∗011∗, the side wires are w3 and w4 , while for the test path ∗∗0∗∗ the side wire is w3 . [sent-247, score-1.206]

53 If for some value injection experiment e, wire w free in e and alphabet symbols σ and τ it is the case that C(p|w=σ ) = C(p|w=τ ) for every test path p ≤ e then also C(e|w=σ ) = C(e|w=τ ). [sent-253, score-1.266]

54 Lemma 8 If |Σ| ≥ 3, there exists a probabilistic circuit C, value injection experiment e, wire w free in e and alphabet symbols σ and τ such that although for every test path p ≤ e for w, d(C(p|w=σ ),C(p|w=τ )) = 0, it is nevertheless the case that d(C(e|w=σ ),C(e|w=τ )) = 1/2. [sent-257, score-1.633]

55 However, the only test paths for w1 ﬁx w3 and leave all other wires free, or ﬁx w4 and leave all other wires free, and the two cases are symmetric. [sent-268, score-1.031]

56 If w3 is ﬁxed to b ∈ {0, 1} and all other wires are free, then when w1 ∈ {0, 1}, w2 is a coin 1891 A NGLUIN , A SPNES , C HEN , E ISENSTAT AND R EYZIN w5 = X(w3,w4) w3 = w2 w4 = w2 w2 = T(w1) w1 = U({0,1}) Figure 3: The circuit C; w5 is the output wire. [sent-271, score-0.853]

57 Deﬁne Π(e, w) to be the set of all directed paths from w to the output wire on free wires in e. [sent-279, score-1.206]

58 Let S(e) be the set of wires that originate a free shortcut, that is, the set of free wires w such that there exists a path p ∈ Π(e, w) with two free wires to which w is an input. [sent-280, score-1.671]

59 As an example, if the target circuit has the circuit graph shown in Figure 2 and the experiment e leaves all wires free then Π(e, w1 ) contains the four paths w1 w5 , w1 w3 w5 , w1 w2 w4 w5 and w1 w3 w4 w5 , S(e) = {w1 , w3 }, and κ(e, w) is 2 + 4 + 2 + 4 = 12. [sent-284, score-1.354]

60 Lemma 9 Let C be a probabilistic circuit, e be a distribution injection experiment, w and u be free wires where w is an input to u, and D0 be a value distribution. [sent-286, score-1.014]

61 Lemma 10 Let C be a Boolean probabilistic circuit, e be a distribution injection experiment, w be a wire free in e and D1 , D2 be value distributions. [sent-293, score-1.082]

62 Corollary 11 Let C be a transitively reduced Boolean probabilistic circuit, e be a distribution injection experiment, and w be a wire free in e. [sent-314, score-1.189]

63 Let U∗ be a universal set of value injection experiments such that for every set of kc log n wires and every assignment of symbols from Σ ∪ {∗} to those wires, some experiment e ∈ U∗ agrees with the values assigned to those wires. [sent-333, score-1.05]

64 ) For every wire w and test path p for w, there is an experiment in U∗ that leaves the path wires of p free and ﬁxes the side wires of p to their values in p. [sent-337, score-1.801]

65 1895 (1) A NGLUIN , A SPNES , C HEN , E ISENSTAT AND R EYZIN Let e ∈ U∗ be a value injection experiment, w be a wire that e leaves free, and D be a value distribution. [sent-344, score-0.997]

66 While W ′ = W , PCB attempts to add another gate to C′ by searching for a wire w ∈ (W −W ′ ) and a probabilistic gate g′ all of whose inputs are in W ′ such that for each experiment e ∈ U∗ that leaves w free and ﬁxes all inputs of g′ , d(C(e), C(e|w=g′ (e) )) ≤ 2ε/(4nκ(n)). [sent-351, score-1.321]

67 The search for g′ iterates over every wire w ∈ (W −W ′ ) and every choice of an r-tuple of distinct wires w1 , . [sent-354, score-1.018]

68 Experiment ***** 0**** 1**** **0** **1** *1*0* 01*0* 11*0* *100* *110* Pr[w5 = 1] 1/2 1/6 5/6 5/12 3/4 1/3 0 2/3 1/6 1/2 Figure 4: A Boolean circuit with output wire w5 , and some of its behavior. [sent-394, score-0.882]

69 Lemma 13 Let C and C′ be probabilistic circuits on wires W , and let e be a distribution injection experiment. [sent-408, score-1.125]

70 Next we show that test paths are sufﬁcient to determine whether a gate is η-correct for a wire in C. [sent-426, score-0.882]

71 To prove that PCB constructs a circuit C′ that is ε-behaviorally equivalent to the target circuit C, we show that for each wire w ∈ W , PCB assigns a gate that is ε/n-correct for w in C. [sent-435, score-1.411]

72 PCB looks for a wire w ∈ (W − W ′ ) and probabilistic gate g′ ∈ G with all of its inputs in W ′ such that for each experiment e ∈ U∗ that leaves w free and ﬁxes all inputs of g′ , d(C(e), C(e|w=g′ (e) )) ≤ 2ε/(4nκ(n)). [sent-437, score-1.092]

73 Therefore, PCB will continue to make progress until it has assigned a gate to every wire in W , and every such gate will be ε/n-correct for its wire in C, which means that C′ will be ε-behaviorally equivalent to C. [sent-454, score-1.588]

74 For each choice, it must at worst iterate over O(n) wires in (W −W ′ ), over all O(nk ) choices of k or fewer input wires from W ′ , over all |Σ|k assignments of values to the input wires, and all experiments in U. [sent-460, score-0.978]

75 Proof For each positive integer ℓ, deﬁne the circuit Cℓ to be a chain of ℓ copies of the circuit C1 in Figure 1 with wire w4 of one copy identiﬁed with wire w1 of the next copy. [sent-467, score-1.724]

76 The wire wi,4 = A(wi,2 , wi,3 ) is the result of applying the two-input averaging probabilistic gate function A to the wires wi,2 and wi,3 . [sent-475, score-1.288]

77 To understand the operation of this circuit in response to a value injection experiment e, we may view each averaging gate as choosing one of its inputs to copy to its output. [sent-477, score-1.102]

78 Starting at the output wire, this determines a path back to the ﬁrst wire whose value has been ﬁxed, or to the wire w0,4 (which has no inputs) and the output of the circuit is the value of the wire so reached. [sent-478, score-2.105]

79 For each node w there exists a Boolean probabilistic circuit C whose circuit graph is G with output wire z such that for every value injection experiment e that leaves w free and for every test path p ≤ e for wire w we have d(C(e|w=1 ),C(e|w=0 )) ≥ π(e, w) · d(C(p|w=1 ),C(p|w=0 )). [sent-493, score-2.5]

80 Otherwise, the output of the circuit in response to e is determined by tracing from the output wire, following the choices of the averaging gates, until either the ﬁrst wire ﬁxed by e, or w, is reached. [sent-511, score-0.922]

81 such that for each ℓ there exists a value injection experiment e and a wire w free in e such that π(e, w) = 4ℓ and d(Dℓ (e|w=0 ), Dℓ (e|w=1 )) = (5/7)ℓ , but for any test path p for w, d(Dℓ (p|w=0 ), Dℓ (p|w=1 )) = (1/7)ℓ . [sent-520, score-1.179]

82 To construct Dℓ , we take ℓ copies of D1 and identify wire w5 in one copy with wire w1 in the next copy, making the wire w5 of the ﬁnal copy the output wire of the whole circuit. [sent-537, score-2.207]

83 The following result contrasts with the case of deterministic circuits, where the Distinguishing Paths algorithm uses value injection queries to learn arbitrary transitively reduced acyclic deterministic circuits of constant fan-in over polynomial size alphabets in polynomial time (Angluin et al. [sent-547, score-1.01]

84 Theorem 19 If BPP = NP and k ≥ 4 then there is no polynomial time algorithm using value injection queries that approximately learns all acyclic transitively reduced Boolean probabilistic circuits with fan-in bounded by k. [sent-549, score-0.951]

85 Proof Suppose L is a polynomial time algorithm that approximately learns the behavior of every transitively reduced acyclic Boolean probabilistic circuit of fan-in bounded by 4 using value injection queries. [sent-550, score-1.027]

86 Each output wire gOut j has four inputs: the corresponding gadget input wire gIn j and the three variable wires for the variables of the clause c j of φ. [sent-600, score-1.66]

87 The output of the whole circuit is the ﬁrst output wire of the ﬁnal (topmost) expander layer. [sent-608, score-0.957]

88 If a value injection experiment succeeds in setting the variable wires in every gadget layer to (possibly different) satisfying assignments for the formula φ and leaves all other wires free, then the output of C0 is 0 and the output of C1 is 1. [sent-613, score-1.612]

89 Without loss of generality we may assume that e ﬁxes no wires other than variable wires and initial input wires, because any other ﬁxed wires reduce the probability of reaching an initial input. [sent-623, score-1.386]

90 Otherwise, we use a random walk from the output wire in the circuit C to give an output for e. [sent-651, score-0.927]

91 In a value injection experiment, each wire is either ﬁxed to a constant value or left free. [sent-658, score-0.97]

92 In a function injection experiment for a wire w, these possibilities are expanded to permit a transformation of the value that the wire w would take if it were left free. [sent-659, score-1.562]

93 A 1-partition is a constant function, achieving the same result as ﬁxing the wire to a value in a value injection experiment. [sent-667, score-0.97]

94 τ∈Σ A function injection experiment is a mapping e with domain W that assigns to each wire the symbol ∗ or a symbol from Σ or an alphabet transformation f . [sent-671, score-1.09]

95 We now deﬁne the joint probability distribution on assignments of symbols from Σ to wires determined by a function injection experiment e. [sent-675, score-0.959]

96 In a function injection query (FIQ), the learning algorithm gives a function injection experiment e and receives a symbol σ assigned to the output wire of C by the probability distribution deﬁned above. [sent-687, score-1.479]

97 A functional test path for a wire w is a function injection experiment in which the free and transformed wires are a directed path in the circuit graph from w to the output wire, and all other wires are ﬁxed. [sent-688, score-2.534]

98 Lemma 20 Let C be a probabilistic circuit, e be a function injection experiment, w be a wire free in e and D1 , D2 be value distributions. [sent-694, score-1.069]

99 Then if e is the value injection experiment that leaves both wires free, C1 (e) = 1 = C2 (e). [sent-717, score-0.952]

100 , 2008b) learns transitively reduced acyclic deterministic circuits over polynomial size alphabets with constant fan-in and no depth bound using value injection queries in polynomial time, and relies on a version of the test path lemma. [sent-735, score-1.12]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wire', 0.538), ('wires', 0.458), ('injection', 0.392), ('circuit', 0.317), ('gate', 0.229), ('circuits', 0.212), ('pcb', 0.112), ('transitively', 0.108), ('paths', 0.09), ('acyclic', 0.086), ('path', 0.08), ('boolean', 0.073), ('gadget', 0.072), ('free', 0.069), ('angluin', 0.069), ('inputs', 0.062), ('eyzin', 0.06), ('ircuits', 0.06), ('isenstat', 0.06), ('ngluin', 0.06), ('spnes', 0.06), ('alphabet', 0.056), ('experiment', 0.055), ('robabilistic', 0.051), ('coin', 0.051), ('probabilistic', 0.05), ('gates', 0.049), ('expander', 0.048), ('layer', 0.044), ('xes', 0.042), ('queries', 0.04), ('behaviorally', 0.04), ('sing', 0.039), ('est', 0.037), ('hen', 0.035), ('attenuation', 0.034), ('behavioral', 0.032), ('aspnes', 0.032), ('assigned', 0.032), ('ip', 0.032), ('depth', 0.032), ('alphabets', 0.031), ('builder', 0.028), ('deterministic', 0.027), ('output', 0.027), ('leaves', 0.027), ('test', 0.025), ('clauses', 0.024), ('polynomial', 0.024), ('lev', 0.024), ('directed', 0.024), ('assignments', 0.021), ('dana', 0.021), ('reachable', 0.02), ('value', 0.02), ('symbols', 0.02), ('gin', 0.02), ('kc', 0.02), ('transformation', 0.019), ('reduced', 0.019), ('walk', 0.018), ('lemma', 0.018), ('deterministically', 0.017), ('experiments', 0.017), ('query', 0.015), ('earning', 0.015), ('clause', 0.015), ('agree', 0.015), ('symbol', 0.015), ('social', 0.015), ('assignment', 0.014), ('copy', 0.014), ('yale', 0.014), ('constrains', 0.014), ('equivalence', 0.013), ('inductive', 0.013), ('distribution', 0.013), ('jiang', 0.013), ('averaging', 0.013), ('james', 0.013), ('cu', 0.012), ('gene', 0.012), ('node', 0.012), ('bpp', 0.012), ('eout', 0.012), ('equiprobably', 0.012), ('haven', 0.012), ('reyzin', 0.012), ('tilt', 0.012), ('unsatis', 0.012), ('input', 0.012), ('exposed', 0.011), ('graph', 0.011), ('gm', 0.011), ('every', 0.011), ('nonadaptive', 0.01), ('kempe', 0.01), ('originate', 0.01), ('unbounded', 0.01), ('target', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths

Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin

2 0.047585886 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs

Author: Tyler J. VanderWeele, James M. Robins

Abstract: Various relationships are shown hold between monotonic effects and weak monotonic effects and the monotonicity of certain conditional expectations. Counterexamples are provided to show that the results do not hold under less restrictive conditions. Monotonic effects are furthermore used to relate signed edges on a causal directed acyclic graph to qualitative effect modiﬁcation. The theory is applied to an example concerning the direct effect of smoking on cardiovascular disease controlling for hypercholesterolemia. Monotonicity assumptions are used to construct a test for whether there is a variable that confounds the relationship between the mediator, hypercholesterolemia, and the outcome, cardiovascular disease. Keywords: Bayesian networks, conditional expectation, covariance, directed acyclic graphs, effect modiﬁcation, monotonicity

3 0.039144937 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin

Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artiﬁcial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning

4 0.029639769 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors

Author: Mathias Drton, Michael Eichler, Thomas S. Richardson

Abstract: In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models’ long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of ‘bow-free’ recursive linear models. The term ‘bow-free’ refers to the condition that the errors for variables i and j be uncorrelated if variable i occurs in the structural equation for variable j. We introduce a new algorithm, termed Residual Iterative Conditional Fitting (RICF), that can be implemented using only least squares computations. In contrast to existing algorithms, RICF has clear convergence properties and yields exact maximum likelihood estimates after the ﬁrst iteration whenever the MLE is available in closed form. Keywords: linear regression, maximum likelihood estimation, path diagram, structural equation model, recursive semi-Markov model, residual iterative conditional ﬁtting

5 0.025413299 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

Author: Vitaly Feldman

Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a speciﬁc marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning

6 0.024166347 81 jmlr-2009-Robust Process Discovery with Artificial Negative Events (Special Topic on Mining and Learning with Graphs and Relations)

7 0.023047645 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

8 0.020558054 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

9 0.020420367 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

10 0.017795747 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

11 0.017678564 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

12 0.016149687 50 jmlr-2009-Learning When Concepts Abound

13 0.014306322 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations

14 0.0140585 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

15 0.01358219 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)

16 0.013381064 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

17 0.013131997 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

18 0.013017583 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

19 0.012684844 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

20 0.01236041 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.064), (1, -0.033), (2, 0.03), (3, 0.032), (4, -0.008), (5, -0.034), (6, -0.082), (7, -0.025), (8, 0.033), (9, 0.005), (10, -0.036), (11, -0.001), (12, 0.027), (13, 0.001), (14, -0.156), (15, -0.128), (16, -0.083), (17, 0.028), (18, -0.041), (19, -0.101), (20, -0.064), (21, -0.12), (22, -0.028), (23, 0.005), (24, 0.193), (25, -0.007), (26, -0.148), (27, 0.005), (28, 0.164), (29, -0.099), (30, 0.136), (31, -0.045), (32, -0.258), (33, 0.024), (34, 0.253), (35, -0.376), (36, 0.125), (37, 0.241), (38, 0.269), (39, -0.224), (40, 0.051), (41, 0.118), (42, 0.136), (43, 0.143), (44, -0.019), (45, -0.024), (46, 0.115), (47, -0.06), (48, 0.115), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97853756 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths

Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin

2 0.4110229 74 jmlr-2009-Properties of Monotonic Effects on Directed Acyclic Graphs

Author: Tyler J. VanderWeele, James M. Robins

3 0.24363342 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin

4 0.16963695 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér

Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisﬁes weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics

5 0.16192114 81 jmlr-2009-Robust Process Discovery with Artificial Negative Events (Special Topic on Mining and Learning with Graphs and Relations)

Author: Stijn Goedertier, David Martens, Jan Vanthienen, Bart Baesens

Abstract: Process discovery is the automated construction of structured process models from information system event logs. Such event logs often contain positive examples only. Without negative examples, it is a challenge to strike the right balance between recall and speciﬁcity, and to deal with problems such as expressiveness, noise, incomplete event logs, or the inclusion of prior knowledge. In this paper, we present a conﬁgurable technique that deals with these challenges by representing process discovery as a multi-relational classiﬁcation problem on event logs supplemented with Artiﬁcially Generated Negative Events (AGNEs). This problem formulation allows using learning algorithms and evaluation techniques that are well-know in the machine learning community. Moreover, it allows users to have a declarative control over the inductive bias and language bias. Keywords: graph pattern discovery, inductive logic programming, Petri net, process discovery, positive data only

6 0.15952864 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

7 0.1511725 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

8 0.13190417 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors

9 0.13053863 50 jmlr-2009-Learning When Concepts Abound

10 0.10655661 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

11 0.10250433 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

12 0.095799215 67 jmlr-2009-Online Learning with Sample Path Constraints

13 0.093822658 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

14 0.082038768 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

15 0.079503894 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

16 0.07945025 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

17 0.077465162 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

18 0.073876671 46 jmlr-2009-Learning Halfspaces with Malicious Noise

19 0.072476298 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

20 0.070718132 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.497), (8, 0.01), (11, 0.043), (19, 0.017), (38, 0.022), (39, 0.017), (47, 0.014), (52, 0.039), (55, 0.028), (58, 0.028), (66, 0.085), (68, 0.021), (90, 0.033), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70154309 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths

Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin

2 0.57991362 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult

3 0.22578868 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the ﬁrst online bounded algorithm that can learn complex classiﬁcation tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

4 0.22169253 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

Author: Junning Li, Z. Jane Wang

Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classiﬁcation or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-speciﬁed level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-speciﬁed level, and a heuristic modiﬁcation of the method is able to control the FDR more accurately around the user-speciﬁed level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton

5 0.22102621 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufﬁcient condition characterizing this class of matrix regularizers and we highlight some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion of matrix nondecreasing functions. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, regularization

6 0.21609785 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

7 0.21588796 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

8 0.21517459 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

9 0.21440031 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

10 0.21432221 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

11 0.21337225 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

12 0.21239156 46 jmlr-2009-Learning Halfspaces with Malicious Noise

13 0.20895055 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

14 0.20888191 38 jmlr-2009-Hash Kernels for Structured Data

15 0.20887333 78 jmlr-2009-Refinement of Reproducing Kernels

16 0.20814764 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

17 0.20757899 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

18 0.20745084 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

19 0.2072541 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

20 0.20722179 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning