nips nips2007 nips2007-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. [sent-10, score-0.192]
2 We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. [sent-11, score-0.246]
3 As energy minimization in general MRFs is computationally intractable, approximate inference algorithms based on belief propagation are often necessary in practice. [sent-15, score-0.319]
4 [16, 8]) on sufficient conditions for convergence of sum-product algorithms suggests that they converge better on MRFs containing potentials with small strengths. [sent-21, score-0.112]
5 In this paper, we propose a novel algorithm, called Relaxed Survey Propagation (RSP), based on performing the sum-product algorithm on a relaxed MRF. [sent-22, score-0.095]
6 In the relaxed MRF, there is a parameter vector y that can be optimized for convergence. [sent-23, score-0.095]
7 The relaxed MRF is built in two steps, by first (i) converting the energy minimization problem into its weighted MAX-SAT equivalent [17], and then (ii) constructing a relaxed version of the survey propagation MRF proposed in [14]. [sent-25, score-0.494]
8 We prove that the relaxed MRF has approximately equal joint distribution (and hence the same MAP and marginals) as the original MRF, independent (to a large extent) of the parameter vector y. [sent-26, score-0.116]
9 For sum-product algorithms, we compare against residual belief propagation [6] as a state-of-the-art asynchronous belief propagation algorithm, as well as the treestructured expectation propagation [15], which has been shown to be a special case of generalized belief propagation [19]. [sent-29, score-0.836]
10 We show that RSP outperforms all approaches for Ising models with mixed couplings, as well as in a supervised clustering application for web person disambigation. [sent-30, score-0.174]
11 The clauses α1 to α4 in (b) are entries in the factor a in (a), γ1 and γ2 (resp. [sent-32, score-0.296]
12 The relaxed MRF for RSP has a similar form to the graph in (b). [sent-36, score-0.134]
13 For Xi ∈ V and Xa ⊆ V , we denote by xi the event that Xi = xi , and by xa the event Xa = xa . [sent-41, score-0.454]
14 When each 1 Φa is a positive function, the joint distribution can be written as P (x) = Z exp(−E(x)/T ) where E(x) = a − log Φa (xa ) is the energy function, and the temperature T is set to 1. [sent-44, score-0.136]
15 A factor graph [13] is a graphical representation of a MRF, in the form of a bipartite graph with two types of nodes, the variables and the factors. [sent-45, score-0.147]
16 Weighted MAX-SAT conversion [17]: Before describing RSP, we describe the weighted MAXSAT (WMS) conversion of the energy minimization problem for a MRF. [sent-48, score-0.167]
17 Each clause is a set of literals (a variable or its negation), and is satisfied if one of its literals evaluates to 1. [sent-51, score-0.312]
18 In WMS, each clause has a weight, and the WMS problem consists of finding a configuration with maximum total weight of satisfied clauses (called the weight of the configuration). [sent-53, score-0.493]
19 For each Xi ∈ V , introduce the variables σ(i,xi ) ∈ B as the predicate that Xi = xi . [sent-56, score-0.158]
20 For convenience, we index variables in B either by k or by (i, xi ), denote factors in F with Roman alphabet (e. [sent-57, score-0.18]
21 a, b, c) and clauses in C with Greek alphabet (e. [sent-59, score-0.263]
22 For a clause α ∈ C, we denote by C(α) as the set of variables in α. [sent-62, score-0.266]
23 There are two types of clauses in C: interaction and positivity clauses. [sent-63, score-0.431]
24 Interaction clauses: For each entry Φa (xa ) in Φa ∈ F , introduce the clause α = a to show that the clause α ∨xi ∈xa σ(i,xi ) with the weight wα = − log(Φa (xa )). [sent-65, score-0.46]
25 The violation of an interaction clause corresponding to Φa (xa ) entails that σ(i,xi ) = 1 for all xi ∈ xa . [sent-67, score-0.511]
26 Positivity clauses: for Xi ∈ V , introduce the clause β(i) = ∨xi ∈Q σ(i,xi ) with weights wβ(i) = 2 ∗ α∈Ci wα , where Ci is the set of interaction clauses containing any variable in {σ(i,xi ) }xi ∈Q . [sent-70, score-0.57]
27 For Xi ∈ V , we denote β(i) as the corresponding positivity clause in C, and for a positivity clause β ∈ C, we denote src(β) for the corresponding variable in V . [sent-71, score-0.688]
28 Positivity clauses have large weights to ensure that for each Xi ∈ V , at least one predicate in {σ(i,xi ) }xi ∈Q equals 1. [sent-72, score-0.35]
29 There are two types of invalid configurations: MTO configurations where more than one variable in the set {σi,xi }xi ∈Q equals 1, and AZ configurations where all variables in the set equals zero . [sent-76, score-0.153]
30 For valid configurations σ, and for each a ∈ F , exactly one interaction clause in {α}α a is violated: when α corresponding to Φa (xa ) is violated, we have Xa = xa in x(σ). [sent-78, score-0.481]
31 Valid configurations have locally maximal weights [17]: MTO configurations have low weights since in all interaction clauses, variables appear as negative literals. [sent-79, score-0.136]
32 AZ configurations have low weights because they violate the positivity clauses. [sent-80, score-0.157]
33 3 Relaxed Survey Propagation In this section, we transform the WMS problem W = (B, C) into another MRF, Gs = (Vs , Fs ), based on the construction of the MRF for survey propagation [14]. [sent-82, score-0.195]
34 In survey propagation, in addition to the values {0, 1}, variables can take a third value, * (“joker” state), signifying that the variable is free to take either 0 or 1, without violating any clause. [sent-85, score-0.126]
35 [14] A variable σk is constrained by the clause α ∈ C if it is the unique satisfying variable for clause α (all other variables violate α). [sent-88, score-0.516]
36 An assignment σ is invalid for clause α if and only if all variables are unsatisfying except for exactly one for which σk = ∗. [sent-91, score-0.321]
37 Define if σC(a) satisfies α exp(wα ) VALα (σC(α) ) = exp(−ysrc(α) ) if σC(a) violates α (1) 0 if σC(a) is invalid The term exp(−ysrc(α) ) is the penalty for violating clauses, with src(α) ∈ V ∪ F defined in Definitions 1 and 2. [sent-93, score-0.109]
38 For interaction clauses, we index ya by a ∈ F because among valid configurations, exactly one clause in the group {α}α a is violated, and exp(−ya ) becomes a constant factor. [sent-94, score-0.522]
39 Positivity clauses are always satisfied and the penalty factor will not appear for valid configurations. [sent-95, score-0.376]
40 [14] Define the parent set Pi of a variable σk to be the set of clauses for which σk is the unique satisfying variable, (i. [sent-97, score-0.263]
41 The clause compatibilities introduce the clause weights and penalties into the joint distribution. [sent-106, score-0.552]
42 The factor graph has the following underlying distribution: n n P ({σk , Pk }k ) ∝ ω0 0 ω∗ ∗ exp (wα ) α∈SAT(σ) exp(−ysrc(α) ), (4) α∈UNSAT(σ) where n0 is the number of unconstrained variables in σ, and n∗ the number of variables taking ∗. [sent-107, score-0.199]
43 In the limit where all ya , yi → ∞, RSP is equivalent to SP-ρ [14], with ρ = ω∗ . [sent-109, score-0.179]
44 Taking y to infinity correspond to disallowing violated constraints, and SP-ρ was formulated for satisfiable SAT problems, where violated constraints are forbidden. [sent-110, score-0.19]
45 In this case, all clauses must be n n satisfied and the term α∈C exp(wα ) in Equation 4 is a constant, and P (σ) ∝ ω0 0 ω∗ ∗ . [sent-111, score-0.263]
46 1 Main result In the following, we assume the following settings: (1) ω∗ = 1 and ω0 = 0 ; (2) for positivity clauses β(i), let yi = 0 ; and (3) in the original MRF G = (V, F ), single-variable factors are defined on all variables (we can always define uniform factors). [sent-113, score-0.468]
47 Under these settings, we will prove the main result that the joint distribution on the relaxed MRF is approximately equal to that on the original MRF, and that RSP estimates marginals on the original MRF. [sent-114, score-0.204]
48 The joint probability over valid configurations on Gs is proportional to the joint probability of the corresponding configurations on the original MRF, G = (V, F ). [sent-116, score-0.101]
49 For valid configurations, all positivity clauses are satisfied, and for each a ∈ F , all valid configurations have one violated constraint in the set of interaction clauses {α}α a . [sent-118, score-0.893]
50 Hence the penalty term for violated constraints a∈F exp(ya ) is a constant factor. [sent-119, score-0.102]
51 Single-variable factors on G translate into single-literal clauses in the WMS formulation, which in turn becomes single-variable factors in Gs . [sent-123, score-0.373]
52 In MTO configurations, ∃(i, xi , xi ), σi,xi = σi,xi = 1. [sent-129, score-0.178]
53 The positivity clause β(i) is hence non-constraining for these variables, and since all other clauses connected to them are interaction clauses and contain them as negative literals, both variables are unconstrained. [sent-130, score-0.96]
54 Assuming that exp(wβ(i) ) 1 for all Xi ∈ V , the joint distribution over the relaxed MRF Gs = (Vs , Fs ) is approximately equal to the joint distribution over the original MRF, G = (V, F ). [sent-133, score-0.137]
55 Moreover, RSP estimates the marginals on the original MRF, and at the fixed points, the beliefs at each node, B(σ(i,xi ) = 1), is an estimate of P (Xi = xi ), and xi ∈Q B(σ(i,xi ) = 1) ≈ 1. [sent-134, score-0.266]
56 We can understand the above theorem as follows: if we assume that the probability of AZ invalid configurations is negligible (equivalent to assuming that the probability of violating positivity clauses are negligible, i. [sent-135, score-0.503]
57 Since the positivity clauses have large weights, exp(wi ) 1 are usually satisfied. [sent-139, score-0.377]
58 Hence RSP, as the sum-product algorithm on the relaxed MRF, returns estimations of the marginals P (Xi = xi ) as B(σ(i,xi ) = 1). [sent-140, score-0.272]
59 Empirically, we observe that for a large range of values of {ya }a∈F , RSP returns marginals satisfying xi B(σ(i,xi ) = 1) ≈ 1, indicating that AZ configurations do indeed have negligible probability. [sent-144, score-0.215]
60 To calculate N (Ψα ) for the interaction clause α, we characterize these factors as follows: Ψα ((σk , Pk ), (σl , Pl )) = exp(−ysrc(α) ) exp(wα ) if clause α is violated, i. [sent-149, score-0.569]
61 (σk , σl ) = (0, 0) otherwise (6) As ya are shared among α a, we choose ya to minimize α a N (Ψα ) = α a tanh 1 |wα +ya |. [sent-151, score-0.382]
62 4 A good approximation for ya would be the median of {−wα }α a . [sent-152, score-0.179]
63 For our experiments, we divide the search range for ya into 10 bins, and use fminsearch in Matlab to find a local minimum. [sent-153, score-0.179]
64 The interaction clause α sends a number να→(i,v) to each (i, x) ∈ C(α), and the positivity clauses β(i) sends a number µβ(i)→(i,x) to (i, x) for each x ∈ Q. [sent-157, score-0.661]
65 This seems to work better than the schedule defined by residual belief propagation [6] on the relaxed MRF. [sent-160, score-0.381]
66 4 Experimental Results While Ising models with attractive couplings are exactly solvable by graph-cut algorithms, general Ising models with mixed couplings on complete graphs are NP-hard [4], and graph cut algorithms are not applicable to graphs with mixed couplings [12]. [sent-163, score-0.584]
67 In Figure Ω = Ψi,j (xi , xj )= 0 1 ω 1 0 1 1 0 0 exp(Ωi,j /4) if xi = xj exp(−Ωi,j /4) if xi = xj (a) 4-node (binary) complete graph (b) = +1 (c) = −1 Figure 2: In Figure (a), we define factors under the two settings: = ±1. [sent-169, score-0.272]
68 Figure (b) and (c) show the L2 distance between the returned marginals and the nearest mode of the graph. [sent-170, score-0.11]
69 In each case, we vary ω from 0 to 12, and for each ω, run residual belief propagation (RBP) damped at 0. [sent-173, score-0.258]
70 We plot the L2 distance between the returned marginals and the nearest mode marginals (marginals with probability one on the modes). [sent-176, score-0.198]
71 As ω is increased, for = +1 in Figure 2(b), both approaches converge to marginals with probability 1 on one of the modes. [sent-181, score-0.139]
72 For = −1, however, RSP converges again to marginals indicating a mode, while RBP faces convergence problems for ω ≥ 8. [sent-182, score-0.15]
73 When the algorithms converge for large ω, they converge not to the correct marginals, but to a MAP configuration. [sent-184, score-0.102]
74 Ising models with mixed couplings: we conduct experiments on complete graphs of size 20 with different percentage of attractive couplings, using the Ising model with the energy function: H(s) = − i,j θi,j si sj − i θi si ,where si ∈{−1, 1}. [sent-188, score-0.264]
75 To control the percentage of attractive couplings, we draw θi,j from U [0, α], and randomly assign negative signs to the θi,j with probability (1 − ρ), where ρ is the percentage of attractive couplings required. [sent-191, score-0.271]
76 For RSP, RBP and TEP, which are variants of the sum product algorithm, we lower the temperature by a factor of 2 each time the method converges and stop when the method fails to converge or if the results are not improved over the last temperature. [sent-201, score-0.146]
77 While the performance of TRW-S remains constant from 25% to 75%, the sum-product based algorithms (RBP and TEP) improve as the percentage of attractive potentials is increased. [sent-203, score-0.117]
78 TEP, being of the class of generalized belief propagation [19], runs significantly slower than RSP. [sent-205, score-0.21]
79 In training SV M cluster , they have to minimize E(x) = i,j Sim(i, j)δ(xi , xj ) where xi ∈ {1, . [sent-207, score-0.112]
80 (a) 75% (b) 50% (c) 25% Figure 3: Experiments on the complete graph Ising model with mixed couplings (legend in (a)), with different percentage of attractive couplings. [sent-214, score-0.274]
81 α mbp trw-s rbp tep opt 1 2/0 26/0 1/0 2/0 0/0 75% attractive 1. [sent-216, score-0.45]
82 We are able to obtain lower energy configurations by the recursive 2-way partitioning procedure in [5] used for graph cuts. [sent-226, score-0.125]
83 We use the web person disambiguation task defined in SemEval-2007 [1] as the test application. [sent-233, score-0.167]
84 Method Number of test domains where RSP attains lower/higher energy E(x) than Method Percentage of convergence over all runs F-measure of purity and inverse purity [1] RSP 0/0 91% 75. [sent-246, score-0.163]
85 78% Table 2: Results for the web person disambiguation task. [sent-250, score-0.167]
86 SP-ρ is the sumproduct interpretation for the survey propagation (SP) algorithm [3]. [sent-252, score-0.195]
87 In RSP, we took inspiration from the SP-y algorithm [2] in adding a penalty term for violated clauses. [sent-255, score-0.102]
88 SP-y works on MAXSAT problems and SP can be considered as SP-y with y taken to ∞, hence disallowing violated constraints. [sent-256, score-0.109]
89 We show that while RSP is the sum-product algorithm on a relaxed MRF, it can be used for solving the energy minimization problem. [sent-260, score-0.204]
90 By tuning the strengths of the factors (based on convergence criteria in [16]) while keeping the underlying distribution approximately correct, RSP converges well even at low temperatures. [sent-261, score-0.146]
91 As far as we know, this is the first application of convergence criteria to aid convergence of belief propagation algorithms, and this mechanism can be used to exploit future work on sufficient conditions for the convergence of belief propagation algorithms. [sent-263, score-0.507]
92 Acknowledgments We would like to thank Yee Fan Tan for his help on the web person disambiguation task, and Tomas Lozano-Perez and Leslie Pack Kaelbling for valuable comments on the paper. [sent-264, score-0.167]
93 References [1] “Web person disambiguation task at SemEval,” 2007. [sent-266, score-0.119]
94 Koller, “Residual belief propagation: Informed scheduling for asynchronous message passing,” in UAI, 2006. [sent-293, score-0.116]
95 Heskes, “On the uniqueness of loopy belief propagation fixed points,” Neural Computation, vol. [sent-298, score-0.21]
96 Kolmogorov, “Convergent tree-reweighted message passing for energy minimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-307, score-0.11]
97 Zabih, “What energy functions can be minimized via graph cuts? [sent-312, score-0.125]
98 Wainwright, “A new look at survey propagation and its generalizations,” 2004. [sent-325, score-0.195]
99 Kappen, “Sufficient conditions for convergence of loopy belief propagation,” in UAI, 2005. [sent-336, score-0.101]
100 Weiss, “Constructing free-energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. [sent-356, score-0.21]
wordName wordTfidf (topN-words)
[('rsp', 0.619), ('clauses', 0.263), ('mrf', 0.253), ('clause', 0.23), ('gurations', 0.207), ('ya', 0.179), ('xa', 0.138), ('propagation', 0.138), ('mbp', 0.124), ('rbp', 0.124), ('tep', 0.124), ('wms', 0.124), ('positivity', 0.114), ('pk', 0.11), ('con', 0.109), ('couplings', 0.101), ('relaxed', 0.095), ('xi', 0.089), ('marginals', 0.088), ('energy', 0.086), ('mto', 0.083), ('ysrc', 0.083), ('violated', 0.081), ('guration', 0.078), ('ising', 0.073), ('belief', 0.072), ('disambiguation', 0.072), ('sat', 0.066), ('gs', 0.066), ('az', 0.064), ('valid', 0.059), ('mrfs', 0.058), ('survey', 0.057), ('attractive', 0.056), ('exp', 0.055), ('factors', 0.055), ('val', 0.055), ('invalid', 0.055), ('interaction', 0.054), ('converge', 0.051), ('mixed', 0.049), ('compatibilities', 0.048), ('residual', 0.048), ('web', 0.048), ('person', 0.047), ('graphs', 0.044), ('literals', 0.041), ('src', 0.041), ('graph', 0.039), ('negligible', 0.038), ('variables', 0.036), ('sim', 0.036), ('vs', 0.034), ('factor', 0.033), ('converges', 0.033), ('predicate', 0.033), ('violating', 0.033), ('potentials', 0.032), ('satis', 0.032), ('equals', 0.031), ('clustering', 0.03), ('conversion', 0.029), ('percentage', 0.029), ('temperature', 0.029), ('strengths', 0.029), ('convergence', 0.029), ('cooled', 0.028), ('disallowing', 0.028), ('finley', 0.028), ('maxsat', 0.028), ('semeval', 0.028), ('unsat', 0.028), ('sp', 0.028), ('schedule', 0.028), ('boolean', 0.026), ('sv', 0.026), ('fs', 0.025), ('joachims', 0.025), ('greedy', 0.025), ('purity', 0.024), ('literal', 0.024), ('zecchina', 0.024), ('kappen', 0.024), ('mooij', 0.024), ('pcc', 0.024), ('tanh', 0.024), ('message', 0.024), ('map', 0.024), ('weights', 0.023), ('cluster', 0.023), ('minimization', 0.023), ('opt', 0.022), ('singapore', 0.022), ('wi', 0.022), ('returned', 0.022), ('joint', 0.021), ('penalty', 0.021), ('lemma', 0.021), ('violate', 0.02), ('asynchronous', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
2 0.21196936 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
Author: Luis E. Ortiz
Abstract: This paper proposes constraint propagation relaxation (CPR), a probabilistic approach to classical constraint propagation that provides another view on the whole parametric family of survey propagation algorithms SP(ρ). More importantly, the approach elucidates the implicit, but fundamental assumptions underlying SP(ρ), thus shedding some light on its effectiveness and leading to applications beyond k-SAT. 1
3 0.10834537 114 nips-2007-Learning and using relational theories
Author: Charles Kemp, Noah Goodman, Joshua B. Tenenbaum
Abstract: Much of human knowledge is organized into sophisticated systems that are often called intuitive theories. We propose that intuitive theories are mentally represented in a logical language, and that the subjective complexity of a theory is determined by the length of its representation in this language. This complexity measure helps to explain how theories are learned from relational data, and how they support inductive inferences about unobserved relations. We describe two experiments that test our approach, and show that it provides a better account of human learning and reasoning than an approach developed by Goodman [1]. What is a theory, and what makes one theory better than another? Questions like these are of obvious interest to philosophers of science but are also discussed by psychologists, who have argued that everyday knowledge is organized into rich and complex systems that are similar in many respects to scientific theories. Even young children, for instance, have systematic beliefs about domains including folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many of the same roles as scientific theories: in particular, both kinds of theories are used to explain and encode observations of the world, and to predict future observations. This paper explores the nature, use and acquisition of simple theories. Consider, for instance, an anthropologist who has just begun to study the social structure of a remote tribe, and observes that certain words are used to indicate relationships between selected pairs of individuals. Suppose that term T1(·, ·) can be glossed as ancestor(·, ·), and that T2(·, ·) can be glossed as friend(·, ·). The anthropologist might discover that the first term is transitive, and that the second term is symmetric with a few exceptions. Suppose that term T3(·, ·) can be glossed as defers to(·, ·), and that the tribe divides into two castes such that members of the second caste defer to members of the first caste. In this case the anthropologist might discover two latent concepts (caste 1(·) and caste 2(·)) along with the relationship between these concepts. As these examples suggest, a theory can be defined as a system of laws and concepts that specify the relationships between the elements in some domain [2]. We will consider how these theories are learned, how they are used to encode relational data, and how they support predictions about unobserved relations. Our approach to all three problems relies on the notion of subjective complexity. We propose that theory learners prefer simple theories, that people remember relational data in terms of the simplest underlying theory, and that people extend a partially observed data set according to the simplest theory that is consistent with their observations. There is no guarantee that a single measure of subjective complexity can do all of the work that we require [3]. This paper, however, explores the strong hypothesis that a single measure will suffice. Our formal treatment of subjective complexity begins with the question of how theories are mentally represented. We suggest that theories are represented in some logical language, and propose a specific first-order language that serves as a hypothesis about the “language of thought.” We then pursue the idea that the subjective complexity of a theory corresponds to the length of its representation in this language. Our approach therefore builds on the work of Feldman [4], and is related to other psychological applications of the notion of Kolmogorov complexity [5]. The complexity measure we describe can be used to define a probability distribution over a space of theories, and we develop a model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also 1 (a) Star 11 (b) Bipartite (c) Exception 22 33 44 55 66 77 88 16 26 36 46 56 21 31 41 51 61 71 81 11 17 27 37 47 57 18 28 38 48 58 R(X, X). T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T R(X, 1). (d) Symmetric (e) Transitive 11 22 33 44 55 66 77 13 31 12 21 24 42 56 65 26 36 46 56 17 27 37 47 57 18 28 38 48 58 T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T ¯ R(1, 1). R(1, 6). (f) Random 12 21 13 23 14 24 34 13 32 14 24 34 15 25 35 45 16 26 36 46 56 51 52 35 54 61 26 63 46 56 R(1, 2). R(1, 3). R(2, 4). R(5, 6). R(1, 2). R(2, 3). R(3, 4). R(5, X). R(X, 4). R(X, Y) ← R(Y, X). R(X, X). R(4, 5). R(5, 6). R(X, Z) ← R(X, Y), R(Y, Z). R(2, 1). R(1, 3). R(6, 1). R(3, 2). R(2, 6). R(3, 5). R(6, 3). R(4, 6). ¯ ¯ ¯ R(X, X). R(6, 4). R(5, 3). Figure 1: Six possible extensions for a binary predicate R(·, ·). In each case, the objects in the domain are represented as digits, and a pair such as 16 indicates that R(1, 6) is true. Below each set of pairs, the simplest theory according to our complexity measure is shown. show how the same Bayesian approach helps to explain how theories support inductive generalization: given a set of observations, future observations (e.g. whether one individual defers to another) can be predicted using the posterior distribution over the space of theories. We test our approach by developing two experiments where people learn and make predictions about binary and ternary relations. As far as we know, the approach of Goodman [1] is the only other measure of theory complexity that has previously been tested as a psychological model [6]. We show that our experiments support our approach and raise challenges for this alternative model. 1 Theory complexity: a representation length approach Intuitive theories correspond to mental representations of some sort, and our first task is to characterize the elements used to build these representations. We explore the idea that a theory is a system of statements in a logical language, and six examples are shown in Fig. 1. The theory in Fig. 1b is related to the defers to(·, ·) example already described. Here we are interested in a domain including 9 elements, and a two place predicate R(·, ·) that is true of all and only the 15 pairs shown. R is defined using a unary predicate T which is true of only three elements: 6, 7, and 8. The theory includes a clause which states that R(X, Y) is true for all pairs XY such that T(X) is false and T(Y) is true. The theory in Fig. 1c is very similar, but includes an additional clause which specifies that R(1, 1) is true, and an exception which specifies that R(1, 6) is false. Formally, each theory we consider is a collection of function-free definite clauses. All variables are universally quantified: for instance, the clause R(X, Z) ← R(X, Y), R(Y, Z) is equivalent to the logical formula ∀x ∀y ∀z (R(x, z) ← R(x, y) ∧ R(y, z)). For readability, the theories in Fig. 1 include parentheses and arrows, but note that these symbols are unnecessary and can be removed. Our proposed language includes only predicate symbols, variable symbols, constant symbols, and a period that indicates when one clause finishes and another begins. Each theory in Fig. 1 specifies the extension of one or more predicates. The extension of predicate P is defined in terms of predicate P+ (which captures the basic rules that lead to membership in P) and predicate P− (which captures exceptions to these rules). The resulting extension of P is defined 2 as P+ \ P− , or the set difference of P+ and P− .1 Once P has been defined, later clauses in the theory may refer to P or its negation ¯. To ensure that our semantics is well-defined, the predicates P in any valid theory must permit an ordering so that the definition of any predicate does not refer to predicates that follow it in the order. Formally, the definition of each predicate P+ or P− can refer only to itself (recursive definitions are allowed) and to any predicate M or ¯ where M < P. M Once we have committed to a specific language, the subjective complexity of a theory is assumed to correspond to the number of symbols in its representation. We have chosen a language where there is one symbol for each position in a theory where a predicate, variable or constant appears, and one symbol to indicate when each clause ends. Given this language, the subjective complexity c(T ) of theory T is equal to the sum of the number of clauses in the theory and the number of positions in the theory where a predicate, variable or constant appears: c(T ) = #clauses(T ) + #pred slots(T ) + #var slots(T ) + #const slots(T ). (1) For instance, the clause R(X, Z) ← R(X, Y), R(Y, Z). contributes ten symbols towards the complexity of a theory (three predicate symbols, six variable symbols, and one period). Other languages might be considered: for instance, we could use a language which uses five symbols (e.g. five bits) to represent each predicate, variable and constant, and one symbol (e.g. one bit) to indicate the end of a clause. Our approach to subjective complexity depends critically on the representation language, but once a language has been chosen the complexity measure is uniquely specified. Although our approach is closely related to the notion of Kolmogorov complexity and to Minimum Message Length (MML) and Minimum Description Length (MDL) approaches, we refer to it as a Representation Length (RL) approach. A RL approach includes a commitment to a specific language that is proposed as a psychological hypothesis, but these other approaches aspire towards results that do not depend on the language chosen.2 It is sometimes suggested that the notion of Kolmogorov complexity provides a more suitable framework for psychological research than the RL approach, precisely because it allows for results that do not depend on a specific description language [8]. We subscribe to the opposite view. Mental representations presumably rely on some particular language, and identifying this language is a central challenge for psychological research. The language we described should be considered as a tentative approximation of the language of thought. Other languages can and should be explored, but our language has several appealing properties. Feldman [4] has argued that definite clauses are psychologically natural, and working with these representations allows our approach to account for several classic results from the concept learning literature. For instance, our language leads to the prediction that conjunctive concepts are easier to learn than disjunctive concepts [9].3 Working with definite clauses also ensures that each of our theories has a unique minimal model, which means that the extension of a theory can be defined in a particularly simple way. Finally, human learners deal gracefully with noise and exceptions, and our language provides a simple way to handle exceptions. Any concrete proposal about the language of thought should make predictions about memory, learning and reasoning. Suppose that data set D lists the extensions of one or more predicates, and that a theory is a “candidate theory” for D if it correctly defines the extensions of all predicates in D. Note that a candidate theory may well include latent predicates—predicates that do not appear in D, but are useful for defining the predicates that have been observed. We will assume that humans encode D in terms of the simplest candidate theory for D, and that the difficulty of memorizing D is determined by the subjective complexity of this theory. Our approach can and should be tested against classic results from the memory literature. Unlike some other approaches to complexity [10], for instance, our model predicts that a sequence of k items is about equally easy to remember regardless of whether the items are drawn from a set of size 2, a set of size 10, or a set of size 1000 [11]. 1 The extension of P+ is the smallest set that satisfies all of the clauses that define P+ , and the extension of P is defined similarly. To simplify our notation, Fig. 1 uses P to refer to both P and P+ , and ¯ to refer to ¯ and P P P− . Any instance of P that appears in a clause defining P is really an instance of P+ , and any instance of ¯ that P appears in a clause defining ¯ is really an instance of P− . P 2 MDL approaches also commit to a specific language, but this language is often intended to be as general as possible. See, for instance, the discussion of universal codes in Gr¨ nwald et al. [7]. u 3 A conjunctive concept C(·) can be defined using a single clause: C(X) ← A(X), B(X). The shortest definition of a disjunctive concept requires two clauses: D(X) ← A(X). D(X) ← B(X). − 3 To develop a model of inductive learning and reasoning, we take a Bayesian approach, and use our complexity measure to define a prior distribution over a hypothesis space of theories: P (T ) ∝ 2−c(T ) .4 Given this prior distribution, we can use Bayesian inference to make predictions about unobserved relations and to discover the theory T that best accounts for the observations in data set D [12, 13]. Suppose that we have a likelihood function P (D|T ) which specifies how the examples in D were generated from some underlying theory T . The best explanation for the data D is the theory that maximizes the posterior distribution P (T |D) ∝ P (D|T )P (T ). If we need to predict whether ground term g is likely to be true, 5 we can sum over the space of theories: P (g|D) = P (g|T )P (T |D) = T 1 P (D) P (D|T )P (T ) (2) T :g∈T where the final sum is over all theories T that make ground term g true. 1.1 Related work The theories we consider are closely related to logic programs, and methods for Inductive Logic Programming (ILP) explore how these programs can be learned from examples [14]. ILP algorithms are often inspired by the idea of searching for the shortest theory that accounts for the available data, and ILP is occasionally cast as the problem of minimizing an explicit MDL criterion [10]. Although ILP algorithms are rarely considered as cognitive models, the RL approach has a long psychological history, and is proposed by Chomsky [15] and Leeuwenberg [16] among others. Formal measures of complexity have been developed in many fields [17], and there is at least one other psychological account of theory complexity. Goodman [1] developed a complexity measure that was originally a philosophical proposal about scientific theories, but was later tested as a model of subjective complexity [6]. A detailed description of this measure is not possible here, but we attempt to give a flavor of the approach. Suppose that a basis is a set of predicates. The starting point for Goodman’s model is the intuition that basis B1 is at least as complex as basis B2 if B1 can be used to define B2. Goodman argues that this intuition is flawed, but his model is founded on a refinement of this intuition. For instance, since the binary predicate in Fig. 1b can be defined in terms of two unary predicates, Goodman’s approach requires that the complexity of the binary predicate is no more than the sum of the complexities of the two unary predicates. We will use Goodman’s model as a baseline for evaluating our own approach, and a comparison between these two models should be informed by both theoretical and empirical considerations. On the theoretical side, our approach relies on a simple principle for deciding which structural properties are relevant to the measurement of complexity: the relevant properties are those with short logical representations. Goodman’s approach incorporates no such principle, and he proposes somewhat arbitrarily that reflexivity and symmetry are among the relevant structural properties but that transitivity is not. A second reason for preferring our model is that it makes contact with a general principle—the idea that simplicity is related to representation length—that has found many applications across psychology, machine learning, and philosophy. 2 Experimental results We designed two experiments to explore settings where people learn, remember, and make inductive inferences about relational data. Although theories often consist of systems of many interlocking relations, we keep our experiments simple by asking subjects to learn and reason about a single relation at a time. Despite this restriction, our experiments still make contact with several issues raised by systems of relations. As the defers to(·, ·) example suggests, a single relation may be best explained as the observable tip of a system involving several latent predicates (e.g. caste 1(·) and caste 2(·)). 4 To ensure that this distribution can be normalized, we assume that there is some upper bound on the number of predicate symbols, variable symbols, and constants, and on the length of the theories we will consider. There will therefore be a finite number of possible theories, and our prior will be a valid probability distribution. 5 A ground term is a term such as R(8, 9) that does not include any variables. 4 Learning time Complexity (RL) Complexity (Human) 6 300 Complexity (Goodman) 4 20 0 0 0 star bprt excp sym trans rand 2 0 star bprt excp sym trans rand 2 star bprt excp sym trans rand 100 200 star bprt excp sym trans rand 4 40 Figure 2: (a) Average time in seconds to learn the six sets in Fig. 1. (b) Average ratings of set complexity. (c) Complexity scores according to our representation length (RL) model. (d) Complexity scores according to Goodman’s model. 2.1 Experiment 1: memory and induction In our first experiment, we studied the subjective complexity of six binary relations that display a range of structural properties, including reflexivity, symmetry, and transitivity. Materials and Methods. 18 adults participated in this experiment. Subjects were required to learn the 6 sets shown in Fig. 1, and to make inductive inferences about each set. Although Fig. 1 shows pairs of digits, the experiment used letter pairs, and the letters for each condition and the order in which these conditions were presented were randomized across subjects. The pairs for each condition were initially laid out randomly on screen, and subjects could drag them around and organize them to help them understand the structure of the set. At any stage, subjects could enter a test phase where they were asked to list the 15 pairs belonging to the current set. Subjects who made an error on the test were returned to the learning phase. After 9 minutes had elapsed, subjects were allowed to pass the test regardless of how many errors they made. After passing the test, subjects were asked to rate the complexity of the set compared to other sets with 15 pairs. Ratings were provided on a 7 point scale. Subjects were then asked to imagine that a new letter (e.g. letter 9) had belonged to the current alphabet, and were given two inductive tasks. First they were asked to enter between 1 and 10 novel pairs that they might have expected to see (each novel pair was required to include the new letter). Next they were told about a novel pair that belonged to the set (e.g. pair 91), and were again asked to enter up to 10 additional pairs that they might have expected to see. Results. The average time needed to learn each set is shown in Fig. 2a, and ratings of set complexity are shown in Fig. 2b. It is encouraging that these measures yield converging results, but they may be confounded since subjects rated the complexity of a set immediately after learning it. The complexities plotted in Fig. 2c are the complexities of the theories shown in Fig. 1, which we believe to be the simplest theories according to our complexity measure. The final plot in Fig. 2 shows complexities according to Goodman’s model, which assigns each binary relation an integer between 0 and 4. There are several differences between these models: for instance, Goodman’s account incorrectly predicts that the exception case is the hardest of the six, but our model acknowledges that a simple theory remains simple if a handful of exceptions are added. Goodman’s account also predicts that transitivity is not an important structural regularity, but our model correctly predicts that the transitive set is simpler than the same set with some of the pairs reversed (the random set). Results for the inductive task are shown in Fig. 3. The first two columns show the number of subjects who listed each novel pair. The remaining two columns show the probability of set membership predicted by our model. To generate these predictions, we applied Equation 2 and summed over a set of theories created by systematically extending the theories shown in Fig. 1. Each extended theory includes up to one additional clause for each predicate in the base theory, and each additional clause includes at most two predicate slots. For instance, each extended theory for the bipartite case is created by choosing whether or not to add the clause T(9), and adding up to one clause for predicate R.6 For the first inductive task, the likelihood term P (D|T ) (see Equation 2) is set to 0 for all theories that are not consistent with the pairs observed during training, and to a constant for all remaining theories. For the second task we assumed in addition that the novel pair observed is 6 R(9, X), ¯(2, 9), and R(X, 9) ← R(X, 2) are three possible additions. R 5 18 9 9 0.5 trans symm excep bipart 0 91 random r=0.99 1 star 18 99 19 0 91 89 99 19 89 0.5 0 91 18 18 1 9 9 99 19 r=0.96 89 0.5 0 91 99 19 0 91 89 99 19 89 18 9 1 99 19 89 r=0.98 1 9 0.5 0 91 99 19 0 91 89 99 19 89 18 9 99 19 89 0.5 0 81 88 18 0 78 81 88 18 78 0 18 18 9 0 0 71 77 17 67 71 77 17 67 18 18 81 9 88 18 r=0.62 78 71 77 17 67 Human (no examples) 0 71 77 17 67 Human (1 example) 0 0 91 99 19 89 r=0.99 0 81 88 18 78 71 77 17 67 1 71 77 17 67 r=0.38 0.5 0 89 r=0.93 0.5 1 9 99 19 r=0.99 1 0.5 0 89 r=0.99 0.5 1 9 0 91 1 r=0.88 1 9 99 19 0.5 0 91 18 0 91 0.5 0 91 18 r=0.99 1 0 r=0.74 1 0.5 71 77 17 67 RL (no examples) 0 71 77 17 67 RL (one example) Figure 3: Data and model predictions for the induction task in Experiment 1. Columns 1 and 3 show predictions before any pairs involving the new letter are observed. Columns 2 and 4 show predictions after a single novel pair (marked with a gray bar) is observed to belong to the set. The model plots for each condition include correlations with the human data. sampled at random from all pairs involving the new letter.7 All model predictions were computed using Mace4 [18] to generate the extension of each theory considered. The supporting material includes predictions for a model based on the Goodman complexity measure and an exemplar model which assumes that the new letter will be just like one of the old letters.8 The exemplar model outperforms our model in the random condition, and makes accurate predictions about three other conditions. Overall, however, our model performs better than the two baselines. Here we focus on two important predictions that are not well handled by the exemplar model. In the symmetry condition, almost all subjects predict that 78 belongs to the set after learning that 87 belongs to the set, suggesting that they have learned an abstract rule. In the transitive condition, most subjects predict that pairs 72 through 76 belong to the set after learning that 71 belongs to the set. Our model accounts for this result, but the exemplar model has no basis for making predictions about letter 7, since this letter is now known to be unlike any of the others. 2.2 Experiment 2: learning from positive examples During the learning phase of our first experiment, subjects learned a theory based on positive examples (the theory included all pairs they had seen) and negative examples (the theory ruled out all pairs they had not seen). Often, however, humans learn theories based on positive examples alone. Suppose, for instance, that our anthropologist has spent only a few hours with a new tribe. She may have observed several pairs who are obviously friends, but should realize that many other pairs of friends have not yet interacted in her presence. 7 For the second task, P (D|T ) is set to 0 for theories that are inconsistent with the training pairs and theories 1 which do not include the observed novel pair. For all remaining theories, P (D|T ) is set to n , where n is the total number of novel pairs that are consistent with T . 8 Supporting material is available at www.charleskemp.com 6 1 221 331 441 551 c) 7 1 R(X, X, Y). 221 443 552 663 d) 7 1 R(X, Y, Z). 231 456 615 344 e) 7 1 −10 −5 −0.1 −10 −20 −20 −10 −0.2 −20 777 771 778 789 237 777 771 778 789 237 −10 231 234 235 236 0 777 771 778 789 237 0 777 771 778 789 237 0 777 771 778 789 237 0 RL 0 R(2, 3, X). 777 771 778 789 237 7 R(X, X, 1). 777 771 778 789 237 b) 777 771 778 789 237 1 111 222 333 444 777 771 778 789 237 7 R(X, X, X). 777 771 778 789 237 Human a) Figure 4: Data and model predictions for Experiment 2. The four triples observed for each set are shown at the top of the figure. The first row of plots shows average ratings on a scale from 1 (very unlikely to belong to the set) to 7 (very likely). Model predictions are plotted as log probabilities. Our framework can handle cases like these if we assume that the data D in Equation 2 are sampled from the ground terms that are true according to the underlying theory. We follow [10] and [13] and use a distribution P (D|T ) which assumes that the examples in D are randomly sampled with replacement from the ground terms that are true. This sampling assumption encourages our model to identify the theory with the smallest extension that is compatible with all of the training examples. We tested this approach by designing an experiment where learners were given sets of examples that were compatible with several underlying theories. Materials and Methods. 15 adults participated in this experiment immediately after taking Experiment 1. In each of five conditions, subjects were told about a set of triples built from an alphabet of 9 letters. They were shown four triples that belonged to the set (Fig. 4), and told that the set might include triples that they had not seen. Subjects then gave ratings on a seven point scale to indicate whether five additional triples (see Fig. 4) were likely to belong to the set. Results. Average ratings and model predictions are shown in Fig. 4. Model predictions for each condition were computed using Equation 2 and summing over a space of theories that included the five theories shown at the top of Fig. 4, variants of these five theories which stated that certain pairs of slots could not be occupied by the same constant,9 and theories that included no variables but merely enumerated up to 5 triples.10 Although there are general theories like R(X, Y, Z) that are compatible with the triples observed in all five conditions, Fig. 4 shows that people were sensitive to different regularities in each case.11 We focus on one condition (Fig. 4b) that exposes the strengths and weaknesses of our model. According to our model, the two most probable theories given the triples for this condition are R(X, X, 1) and the closely related variant that rules out R(1, 1, 1). The next most probable theory is R(X, X, Y). These predictions are consistent with people’s judgments that 771 is very likely to belong to the set, and that 778 is the next most likely option. Unlike our model, however, people consider 777 to be substantially less likely than 778 to belong to the set. This result may suggest that the variant of R(X, X, Y) that rules out R(X, X, X) deserves a higher prior probability than our model recognizes. To better account for cases like this, it may be worth considering languages where any two variables that belong to the same clause but have different names must refer to different entities. 3 Discussion and Conclusion There are many psychological models of concept learning [4, 12, 13], but few that use representations rich enough to capture the content of intuitive theories. We suggested that intuitive theories are mentally represented in a first-order logical language, and proposed a specific hypothesis about 9 ¯ One such theory includes two clauses: R(X, X, Y). R(X, X, X). One such theory is the following list of clauses: R(2, 2, 1). R(3, 3, 1). R(4, 4, 1). R(5, 5, 1). R(7, 7, 7). 11 Similar results have been found with 9-month old infants. Cases like Figs. 4b and 4c have been tested in an infant language-learning study where the stimuli were three-syllable strings [19]. 9-month old infants exposed to strings like the four in Fig. 4c generalized to other strings consistent with the theory R(X, X, Y), but infants in the condition corresponding to Fig. 4b generalized only to strings consistent with the theory R(X, X, 1). 10 7 this “language of thought.” We assumed that the subjective complexity of a theory depends on the length of its representation in this language, and described experiments which suggest that the resulting complexity measure helps to explain how theories are learned and used for inductive inference. Our experiments deliberately used stimuli that minimize the influence of prior knowledge. Theories, however, are cumulative, and the theory that seems simplest to a learner will often depend on her background knowledge. Our approach provides a natural place for background knowledge to be inserted. A learner can be supplied with a stock of background predicates, and the shortest representation for a data set will depend on which background predicates are available. Since different sets of predicates will lead to different predictions about subjective complexity, empirical results can help to determine the background knowledge that people bring to a given class of problems. Future work should aim to refine the representation language and complexity measure we proposed. We expect that something like our approach will be suitable for modeling a broad class of intuitive theories, but the specific framework presented here can almost certainly be improved. Future work should also consider different strategies for searching the space of theories. Some of the strategies developed in the ILP literature should be relevant [14], but a detailed investigation of search algorithms seems premature until our approach has held up to additional empirical tests. It is comparatively easy to establish whether the theories that are simple according to our approach are also considered simple by people, and our experiments have made a start in this direction. It is much harder to establish that our approach captures most of the theories that are subjectively simple, and more exhaustive experiments are needed before this conclusion can be drawn. Boolean concept learning has been studied for more than fifty years [4, 9], and many psychologists have made empirical and theoretical contributions to this field. An even greater effort will be needed to crack the problem of theory learning, since the space of intuitive theories is much richer than the space of Boolean concepts. The difficulty of this problem should not be underestimated, but computational approaches can contribute part of the solution. Acknowledgments Supported by the William Asbjornsen Albert memorial fellowship (CK), the James S. McDonnell Foundation Causal Learning Collaborative Initiative (NDG, JBT) and the Paul E. Newton chair (JBT). References [1] N. Goodman. The structure of appearance. 2nd edition, 1961. [2] S. Carey. Conceptual change in childhood. MIT Press, Cambridge, MA, 1985. [3] H. A. Simon. Complexity and the representation of patterned sequences of symbols. Psychological Review, 79:369–382, 1972. [4] J. Feldman. An algebra of human concept learning. JMP, 50:339–368, 2006. [5] N. Chater and P. Vitanyi. Simplicity: a unifying principle in cognitive science. TICS, 7:19–22, 2003. [6] J. T. Krueger. A theory of structural simplicity and its relevance to aspects of memory, perception, and conceptual naturalness. PhD thesis, University of Pennsylvania, 1979. [7] P. Gr¨ nwald, I. J. Myung, and M. Pitt, editors. Advances in Minimum Description Length: Theory and u Applications. 2005. [8] N. Chater. Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, 103:566–581, 1996. [9] J. A. Bruner, J. S. Goodnow, and G. J. Austin. A study of thinking. Wiley, 1956. [10] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(1):81–97, 1956. [12] N. D. Goodman, T. L. Griffiths, J. Feldman, and J. B. Tenenbaum. A rational analysis of rule-based concept learning. In CogSci, 2007. [13] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. BBS, 24:629–641, 2001. [14] S. Muggleton and L. De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19-20:629–679, 1994. [15] N. Chomsky. The logical structure of linguistic theory. University of Chicago Press, Chicago, 1975. [16] E. L. J. Leeuwenberg. A perceptual coding language for visual and auditory patterns. American Journal of Psychology, 84(3):307–349, 1971. [17] B. Edmonds. Syntactic measures of complexity. PhD thesis, University of Manchester, 1999. [18] W. McCune. Mace4 reference manual and guide. Technical Report ANL/MCS-TM-264, Argonne National Laboratory, 2003. [19] L. Gerken. Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition, 98(3):67–74, 2006. 8
4 0.097012587 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright
Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1
5 0.088646963 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs
Author: Kyomin Jung, Devavrat Shah
Abstract: We present a new local approximation algorithm for computing MAP and logpartition function for arbitrary exponential family distribution represented by a finite-valued pair-wise Markov random field (MRF), say G. Our algorithm is based on decomposing G into appropriately chosen small components; computing estimates locally in each of these components and then producing a good global solution. We prove that the algorithm can provide approximate solution within arbitrary accuracy when G excludes some finite sized graph as its minor and G has bounded degree: all Planar graphs with bounded degree are examples of such graphs. The running time of the algorithm is Θ(n) (n is the number of nodes in G), with constant dependent on accuracy, degree of graph and size of the graph that is excluded as a minor (constant for Planar graphs). Our algorithm for minor-excluded graphs uses the decomposition scheme of Klein, Plotkin and Rao (1993). In general, our algorithm works with any decomposition scheme and provides quantifiable approximation guarantee that depends on the decomposition scheme.
6 0.079252169 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
7 0.07784421 141 nips-2007-New Outer Bounds on the Marginal Polytope
8 0.077047586 51 nips-2007-Comparing Bayesian models for multisensory cue combination without mandatory integration
9 0.072254159 187 nips-2007-Structured Learning with Approximate Inference
10 0.07030122 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
11 0.06951087 56 nips-2007-Configuration Estimates Improve Pedestrian Finding
12 0.065368861 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
13 0.061414044 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
14 0.059980195 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
15 0.054006957 30 nips-2007-Bayes-Adaptive POMDPs
16 0.052805554 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
17 0.052416384 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
18 0.046869166 128 nips-2007-Message Passing for Max-weight Independent Set
19 0.044947769 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior
20 0.044798214 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
topicId topicWeight
[(0, -0.146), (1, 0.009), (2, -0.069), (3, -0.004), (4, -0.063), (5, -0.132), (6, 0.016), (7, 0.12), (8, 0.105), (9, -0.101), (10, -0.05), (11, -0.042), (12, -0.016), (13, -0.004), (14, 0.033), (15, -0.03), (16, 0.081), (17, 0.024), (18, 0.009), (19, 0.003), (20, 0.16), (21, -0.032), (22, 0.086), (23, -0.02), (24, -0.092), (25, -0.006), (26, 0.032), (27, 0.095), (28, 0.114), (29, 0.323), (30, -0.01), (31, -0.121), (32, -0.045), (33, 0.138), (34, -0.085), (35, 0.108), (36, 0.132), (37, -0.117), (38, -0.015), (39, -0.028), (40, 0.007), (41, 0.007), (42, -0.073), (43, 0.051), (44, 0.004), (45, 0.132), (46, 0.014), (47, -0.033), (48, -0.038), (49, -0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.93308163 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
2 0.86273336 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
Author: Luis E. Ortiz
Abstract: This paper proposes constraint propagation relaxation (CPR), a probabilistic approach to classical constraint propagation that provides another view on the whole parametric family of survey propagation algorithms SP(ρ). More importantly, the approach elucidates the implicit, but fundamental assumptions underlying SP(ρ), thus shedding some light on its effectiveness and leading to applications beyond k-SAT. 1
3 0.58289647 114 nips-2007-Learning and using relational theories
Author: Charles Kemp, Noah Goodman, Joshua B. Tenenbaum
Abstract: Much of human knowledge is organized into sophisticated systems that are often called intuitive theories. We propose that intuitive theories are mentally represented in a logical language, and that the subjective complexity of a theory is determined by the length of its representation in this language. This complexity measure helps to explain how theories are learned from relational data, and how they support inductive inferences about unobserved relations. We describe two experiments that test our approach, and show that it provides a better account of human learning and reasoning than an approach developed by Goodman [1]. What is a theory, and what makes one theory better than another? Questions like these are of obvious interest to philosophers of science but are also discussed by psychologists, who have argued that everyday knowledge is organized into rich and complex systems that are similar in many respects to scientific theories. Even young children, for instance, have systematic beliefs about domains including folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many of the same roles as scientific theories: in particular, both kinds of theories are used to explain and encode observations of the world, and to predict future observations. This paper explores the nature, use and acquisition of simple theories. Consider, for instance, an anthropologist who has just begun to study the social structure of a remote tribe, and observes that certain words are used to indicate relationships between selected pairs of individuals. Suppose that term T1(·, ·) can be glossed as ancestor(·, ·), and that T2(·, ·) can be glossed as friend(·, ·). The anthropologist might discover that the first term is transitive, and that the second term is symmetric with a few exceptions. Suppose that term T3(·, ·) can be glossed as defers to(·, ·), and that the tribe divides into two castes such that members of the second caste defer to members of the first caste. In this case the anthropologist might discover two latent concepts (caste 1(·) and caste 2(·)) along with the relationship between these concepts. As these examples suggest, a theory can be defined as a system of laws and concepts that specify the relationships between the elements in some domain [2]. We will consider how these theories are learned, how they are used to encode relational data, and how they support predictions about unobserved relations. Our approach to all three problems relies on the notion of subjective complexity. We propose that theory learners prefer simple theories, that people remember relational data in terms of the simplest underlying theory, and that people extend a partially observed data set according to the simplest theory that is consistent with their observations. There is no guarantee that a single measure of subjective complexity can do all of the work that we require [3]. This paper, however, explores the strong hypothesis that a single measure will suffice. Our formal treatment of subjective complexity begins with the question of how theories are mentally represented. We suggest that theories are represented in some logical language, and propose a specific first-order language that serves as a hypothesis about the “language of thought.” We then pursue the idea that the subjective complexity of a theory corresponds to the length of its representation in this language. Our approach therefore builds on the work of Feldman [4], and is related to other psychological applications of the notion of Kolmogorov complexity [5]. The complexity measure we describe can be used to define a probability distribution over a space of theories, and we develop a model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also 1 (a) Star 11 (b) Bipartite (c) Exception 22 33 44 55 66 77 88 16 26 36 46 56 21 31 41 51 61 71 81 11 17 27 37 47 57 18 28 38 48 58 R(X, X). T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T R(X, 1). (d) Symmetric (e) Transitive 11 22 33 44 55 66 77 13 31 12 21 24 42 56 65 26 36 46 56 17 27 37 47 57 18 28 38 48 58 T(6). T(7). T(8). R(X, Y) ← ¯(X), T(Y). T ¯ R(1, 1). R(1, 6). (f) Random 12 21 13 23 14 24 34 13 32 14 24 34 15 25 35 45 16 26 36 46 56 51 52 35 54 61 26 63 46 56 R(1, 2). R(1, 3). R(2, 4). R(5, 6). R(1, 2). R(2, 3). R(3, 4). R(5, X). R(X, 4). R(X, Y) ← R(Y, X). R(X, X). R(4, 5). R(5, 6). R(X, Z) ← R(X, Y), R(Y, Z). R(2, 1). R(1, 3). R(6, 1). R(3, 2). R(2, 6). R(3, 5). R(6, 3). R(4, 6). ¯ ¯ ¯ R(X, X). R(6, 4). R(5, 3). Figure 1: Six possible extensions for a binary predicate R(·, ·). In each case, the objects in the domain are represented as digits, and a pair such as 16 indicates that R(1, 6) is true. Below each set of pairs, the simplest theory according to our complexity measure is shown. show how the same Bayesian approach helps to explain how theories support inductive generalization: given a set of observations, future observations (e.g. whether one individual defers to another) can be predicted using the posterior distribution over the space of theories. We test our approach by developing two experiments where people learn and make predictions about binary and ternary relations. As far as we know, the approach of Goodman [1] is the only other measure of theory complexity that has previously been tested as a psychological model [6]. We show that our experiments support our approach and raise challenges for this alternative model. 1 Theory complexity: a representation length approach Intuitive theories correspond to mental representations of some sort, and our first task is to characterize the elements used to build these representations. We explore the idea that a theory is a system of statements in a logical language, and six examples are shown in Fig. 1. The theory in Fig. 1b is related to the defers to(·, ·) example already described. Here we are interested in a domain including 9 elements, and a two place predicate R(·, ·) that is true of all and only the 15 pairs shown. R is defined using a unary predicate T which is true of only three elements: 6, 7, and 8. The theory includes a clause which states that R(X, Y) is true for all pairs XY such that T(X) is false and T(Y) is true. The theory in Fig. 1c is very similar, but includes an additional clause which specifies that R(1, 1) is true, and an exception which specifies that R(1, 6) is false. Formally, each theory we consider is a collection of function-free definite clauses. All variables are universally quantified: for instance, the clause R(X, Z) ← R(X, Y), R(Y, Z) is equivalent to the logical formula ∀x ∀y ∀z (R(x, z) ← R(x, y) ∧ R(y, z)). For readability, the theories in Fig. 1 include parentheses and arrows, but note that these symbols are unnecessary and can be removed. Our proposed language includes only predicate symbols, variable symbols, constant symbols, and a period that indicates when one clause finishes and another begins. Each theory in Fig. 1 specifies the extension of one or more predicates. The extension of predicate P is defined in terms of predicate P+ (which captures the basic rules that lead to membership in P) and predicate P− (which captures exceptions to these rules). The resulting extension of P is defined 2 as P+ \ P− , or the set difference of P+ and P− .1 Once P has been defined, later clauses in the theory may refer to P or its negation ¯. To ensure that our semantics is well-defined, the predicates P in any valid theory must permit an ordering so that the definition of any predicate does not refer to predicates that follow it in the order. Formally, the definition of each predicate P+ or P− can refer only to itself (recursive definitions are allowed) and to any predicate M or ¯ where M < P. M Once we have committed to a specific language, the subjective complexity of a theory is assumed to correspond to the number of symbols in its representation. We have chosen a language where there is one symbol for each position in a theory where a predicate, variable or constant appears, and one symbol to indicate when each clause ends. Given this language, the subjective complexity c(T ) of theory T is equal to the sum of the number of clauses in the theory and the number of positions in the theory where a predicate, variable or constant appears: c(T ) = #clauses(T ) + #pred slots(T ) + #var slots(T ) + #const slots(T ). (1) For instance, the clause R(X, Z) ← R(X, Y), R(Y, Z). contributes ten symbols towards the complexity of a theory (three predicate symbols, six variable symbols, and one period). Other languages might be considered: for instance, we could use a language which uses five symbols (e.g. five bits) to represent each predicate, variable and constant, and one symbol (e.g. one bit) to indicate the end of a clause. Our approach to subjective complexity depends critically on the representation language, but once a language has been chosen the complexity measure is uniquely specified. Although our approach is closely related to the notion of Kolmogorov complexity and to Minimum Message Length (MML) and Minimum Description Length (MDL) approaches, we refer to it as a Representation Length (RL) approach. A RL approach includes a commitment to a specific language that is proposed as a psychological hypothesis, but these other approaches aspire towards results that do not depend on the language chosen.2 It is sometimes suggested that the notion of Kolmogorov complexity provides a more suitable framework for psychological research than the RL approach, precisely because it allows for results that do not depend on a specific description language [8]. We subscribe to the opposite view. Mental representations presumably rely on some particular language, and identifying this language is a central challenge for psychological research. The language we described should be considered as a tentative approximation of the language of thought. Other languages can and should be explored, but our language has several appealing properties. Feldman [4] has argued that definite clauses are psychologically natural, and working with these representations allows our approach to account for several classic results from the concept learning literature. For instance, our language leads to the prediction that conjunctive concepts are easier to learn than disjunctive concepts [9].3 Working with definite clauses also ensures that each of our theories has a unique minimal model, which means that the extension of a theory can be defined in a particularly simple way. Finally, human learners deal gracefully with noise and exceptions, and our language provides a simple way to handle exceptions. Any concrete proposal about the language of thought should make predictions about memory, learning and reasoning. Suppose that data set D lists the extensions of one or more predicates, and that a theory is a “candidate theory” for D if it correctly defines the extensions of all predicates in D. Note that a candidate theory may well include latent predicates—predicates that do not appear in D, but are useful for defining the predicates that have been observed. We will assume that humans encode D in terms of the simplest candidate theory for D, and that the difficulty of memorizing D is determined by the subjective complexity of this theory. Our approach can and should be tested against classic results from the memory literature. Unlike some other approaches to complexity [10], for instance, our model predicts that a sequence of k items is about equally easy to remember regardless of whether the items are drawn from a set of size 2, a set of size 10, or a set of size 1000 [11]. 1 The extension of P+ is the smallest set that satisfies all of the clauses that define P+ , and the extension of P is defined similarly. To simplify our notation, Fig. 1 uses P to refer to both P and P+ , and ¯ to refer to ¯ and P P P− . Any instance of P that appears in a clause defining P is really an instance of P+ , and any instance of ¯ that P appears in a clause defining ¯ is really an instance of P− . P 2 MDL approaches also commit to a specific language, but this language is often intended to be as general as possible. See, for instance, the discussion of universal codes in Gr¨ nwald et al. [7]. u 3 A conjunctive concept C(·) can be defined using a single clause: C(X) ← A(X), B(X). The shortest definition of a disjunctive concept requires two clauses: D(X) ← A(X). D(X) ← B(X). − 3 To develop a model of inductive learning and reasoning, we take a Bayesian approach, and use our complexity measure to define a prior distribution over a hypothesis space of theories: P (T ) ∝ 2−c(T ) .4 Given this prior distribution, we can use Bayesian inference to make predictions about unobserved relations and to discover the theory T that best accounts for the observations in data set D [12, 13]. Suppose that we have a likelihood function P (D|T ) which specifies how the examples in D were generated from some underlying theory T . The best explanation for the data D is the theory that maximizes the posterior distribution P (T |D) ∝ P (D|T )P (T ). If we need to predict whether ground term g is likely to be true, 5 we can sum over the space of theories: P (g|D) = P (g|T )P (T |D) = T 1 P (D) P (D|T )P (T ) (2) T :g∈T where the final sum is over all theories T that make ground term g true. 1.1 Related work The theories we consider are closely related to logic programs, and methods for Inductive Logic Programming (ILP) explore how these programs can be learned from examples [14]. ILP algorithms are often inspired by the idea of searching for the shortest theory that accounts for the available data, and ILP is occasionally cast as the problem of minimizing an explicit MDL criterion [10]. Although ILP algorithms are rarely considered as cognitive models, the RL approach has a long psychological history, and is proposed by Chomsky [15] and Leeuwenberg [16] among others. Formal measures of complexity have been developed in many fields [17], and there is at least one other psychological account of theory complexity. Goodman [1] developed a complexity measure that was originally a philosophical proposal about scientific theories, but was later tested as a model of subjective complexity [6]. A detailed description of this measure is not possible here, but we attempt to give a flavor of the approach. Suppose that a basis is a set of predicates. The starting point for Goodman’s model is the intuition that basis B1 is at least as complex as basis B2 if B1 can be used to define B2. Goodman argues that this intuition is flawed, but his model is founded on a refinement of this intuition. For instance, since the binary predicate in Fig. 1b can be defined in terms of two unary predicates, Goodman’s approach requires that the complexity of the binary predicate is no more than the sum of the complexities of the two unary predicates. We will use Goodman’s model as a baseline for evaluating our own approach, and a comparison between these two models should be informed by both theoretical and empirical considerations. On the theoretical side, our approach relies on a simple principle for deciding which structural properties are relevant to the measurement of complexity: the relevant properties are those with short logical representations. Goodman’s approach incorporates no such principle, and he proposes somewhat arbitrarily that reflexivity and symmetry are among the relevant structural properties but that transitivity is not. A second reason for preferring our model is that it makes contact with a general principle—the idea that simplicity is related to representation length—that has found many applications across psychology, machine learning, and philosophy. 2 Experimental results We designed two experiments to explore settings where people learn, remember, and make inductive inferences about relational data. Although theories often consist of systems of many interlocking relations, we keep our experiments simple by asking subjects to learn and reason about a single relation at a time. Despite this restriction, our experiments still make contact with several issues raised by systems of relations. As the defers to(·, ·) example suggests, a single relation may be best explained as the observable tip of a system involving several latent predicates (e.g. caste 1(·) and caste 2(·)). 4 To ensure that this distribution can be normalized, we assume that there is some upper bound on the number of predicate symbols, variable symbols, and constants, and on the length of the theories we will consider. There will therefore be a finite number of possible theories, and our prior will be a valid probability distribution. 5 A ground term is a term such as R(8, 9) that does not include any variables. 4 Learning time Complexity (RL) Complexity (Human) 6 300 Complexity (Goodman) 4 20 0 0 0 star bprt excp sym trans rand 2 0 star bprt excp sym trans rand 2 star bprt excp sym trans rand 100 200 star bprt excp sym trans rand 4 40 Figure 2: (a) Average time in seconds to learn the six sets in Fig. 1. (b) Average ratings of set complexity. (c) Complexity scores according to our representation length (RL) model. (d) Complexity scores according to Goodman’s model. 2.1 Experiment 1: memory and induction In our first experiment, we studied the subjective complexity of six binary relations that display a range of structural properties, including reflexivity, symmetry, and transitivity. Materials and Methods. 18 adults participated in this experiment. Subjects were required to learn the 6 sets shown in Fig. 1, and to make inductive inferences about each set. Although Fig. 1 shows pairs of digits, the experiment used letter pairs, and the letters for each condition and the order in which these conditions were presented were randomized across subjects. The pairs for each condition were initially laid out randomly on screen, and subjects could drag them around and organize them to help them understand the structure of the set. At any stage, subjects could enter a test phase where they were asked to list the 15 pairs belonging to the current set. Subjects who made an error on the test were returned to the learning phase. After 9 minutes had elapsed, subjects were allowed to pass the test regardless of how many errors they made. After passing the test, subjects were asked to rate the complexity of the set compared to other sets with 15 pairs. Ratings were provided on a 7 point scale. Subjects were then asked to imagine that a new letter (e.g. letter 9) had belonged to the current alphabet, and were given two inductive tasks. First they were asked to enter between 1 and 10 novel pairs that they might have expected to see (each novel pair was required to include the new letter). Next they were told about a novel pair that belonged to the set (e.g. pair 91), and were again asked to enter up to 10 additional pairs that they might have expected to see. Results. The average time needed to learn each set is shown in Fig. 2a, and ratings of set complexity are shown in Fig. 2b. It is encouraging that these measures yield converging results, but they may be confounded since subjects rated the complexity of a set immediately after learning it. The complexities plotted in Fig. 2c are the complexities of the theories shown in Fig. 1, which we believe to be the simplest theories according to our complexity measure. The final plot in Fig. 2 shows complexities according to Goodman’s model, which assigns each binary relation an integer between 0 and 4. There are several differences between these models: for instance, Goodman’s account incorrectly predicts that the exception case is the hardest of the six, but our model acknowledges that a simple theory remains simple if a handful of exceptions are added. Goodman’s account also predicts that transitivity is not an important structural regularity, but our model correctly predicts that the transitive set is simpler than the same set with some of the pairs reversed (the random set). Results for the inductive task are shown in Fig. 3. The first two columns show the number of subjects who listed each novel pair. The remaining two columns show the probability of set membership predicted by our model. To generate these predictions, we applied Equation 2 and summed over a set of theories created by systematically extending the theories shown in Fig. 1. Each extended theory includes up to one additional clause for each predicate in the base theory, and each additional clause includes at most two predicate slots. For instance, each extended theory for the bipartite case is created by choosing whether or not to add the clause T(9), and adding up to one clause for predicate R.6 For the first inductive task, the likelihood term P (D|T ) (see Equation 2) is set to 0 for all theories that are not consistent with the pairs observed during training, and to a constant for all remaining theories. For the second task we assumed in addition that the novel pair observed is 6 R(9, X), ¯(2, 9), and R(X, 9) ← R(X, 2) are three possible additions. R 5 18 9 9 0.5 trans symm excep bipart 0 91 random r=0.99 1 star 18 99 19 0 91 89 99 19 89 0.5 0 91 18 18 1 9 9 99 19 r=0.96 89 0.5 0 91 99 19 0 91 89 99 19 89 18 9 1 99 19 89 r=0.98 1 9 0.5 0 91 99 19 0 91 89 99 19 89 18 9 99 19 89 0.5 0 81 88 18 0 78 81 88 18 78 0 18 18 9 0 0 71 77 17 67 71 77 17 67 18 18 81 9 88 18 r=0.62 78 71 77 17 67 Human (no examples) 0 71 77 17 67 Human (1 example) 0 0 91 99 19 89 r=0.99 0 81 88 18 78 71 77 17 67 1 71 77 17 67 r=0.38 0.5 0 89 r=0.93 0.5 1 9 99 19 r=0.99 1 0.5 0 89 r=0.99 0.5 1 9 0 91 1 r=0.88 1 9 99 19 0.5 0 91 18 0 91 0.5 0 91 18 r=0.99 1 0 r=0.74 1 0.5 71 77 17 67 RL (no examples) 0 71 77 17 67 RL (one example) Figure 3: Data and model predictions for the induction task in Experiment 1. Columns 1 and 3 show predictions before any pairs involving the new letter are observed. Columns 2 and 4 show predictions after a single novel pair (marked with a gray bar) is observed to belong to the set. The model plots for each condition include correlations with the human data. sampled at random from all pairs involving the new letter.7 All model predictions were computed using Mace4 [18] to generate the extension of each theory considered. The supporting material includes predictions for a model based on the Goodman complexity measure and an exemplar model which assumes that the new letter will be just like one of the old letters.8 The exemplar model outperforms our model in the random condition, and makes accurate predictions about three other conditions. Overall, however, our model performs better than the two baselines. Here we focus on two important predictions that are not well handled by the exemplar model. In the symmetry condition, almost all subjects predict that 78 belongs to the set after learning that 87 belongs to the set, suggesting that they have learned an abstract rule. In the transitive condition, most subjects predict that pairs 72 through 76 belong to the set after learning that 71 belongs to the set. Our model accounts for this result, but the exemplar model has no basis for making predictions about letter 7, since this letter is now known to be unlike any of the others. 2.2 Experiment 2: learning from positive examples During the learning phase of our first experiment, subjects learned a theory based on positive examples (the theory included all pairs they had seen) and negative examples (the theory ruled out all pairs they had not seen). Often, however, humans learn theories based on positive examples alone. Suppose, for instance, that our anthropologist has spent only a few hours with a new tribe. She may have observed several pairs who are obviously friends, but should realize that many other pairs of friends have not yet interacted in her presence. 7 For the second task, P (D|T ) is set to 0 for theories that are inconsistent with the training pairs and theories 1 which do not include the observed novel pair. For all remaining theories, P (D|T ) is set to n , where n is the total number of novel pairs that are consistent with T . 8 Supporting material is available at www.charleskemp.com 6 1 221 331 441 551 c) 7 1 R(X, X, Y). 221 443 552 663 d) 7 1 R(X, Y, Z). 231 456 615 344 e) 7 1 −10 −5 −0.1 −10 −20 −20 −10 −0.2 −20 777 771 778 789 237 777 771 778 789 237 −10 231 234 235 236 0 777 771 778 789 237 0 777 771 778 789 237 0 777 771 778 789 237 0 RL 0 R(2, 3, X). 777 771 778 789 237 7 R(X, X, 1). 777 771 778 789 237 b) 777 771 778 789 237 1 111 222 333 444 777 771 778 789 237 7 R(X, X, X). 777 771 778 789 237 Human a) Figure 4: Data and model predictions for Experiment 2. The four triples observed for each set are shown at the top of the figure. The first row of plots shows average ratings on a scale from 1 (very unlikely to belong to the set) to 7 (very likely). Model predictions are plotted as log probabilities. Our framework can handle cases like these if we assume that the data D in Equation 2 are sampled from the ground terms that are true according to the underlying theory. We follow [10] and [13] and use a distribution P (D|T ) which assumes that the examples in D are randomly sampled with replacement from the ground terms that are true. This sampling assumption encourages our model to identify the theory with the smallest extension that is compatible with all of the training examples. We tested this approach by designing an experiment where learners were given sets of examples that were compatible with several underlying theories. Materials and Methods. 15 adults participated in this experiment immediately after taking Experiment 1. In each of five conditions, subjects were told about a set of triples built from an alphabet of 9 letters. They were shown four triples that belonged to the set (Fig. 4), and told that the set might include triples that they had not seen. Subjects then gave ratings on a seven point scale to indicate whether five additional triples (see Fig. 4) were likely to belong to the set. Results. Average ratings and model predictions are shown in Fig. 4. Model predictions for each condition were computed using Equation 2 and summing over a space of theories that included the five theories shown at the top of Fig. 4, variants of these five theories which stated that certain pairs of slots could not be occupied by the same constant,9 and theories that included no variables but merely enumerated up to 5 triples.10 Although there are general theories like R(X, Y, Z) that are compatible with the triples observed in all five conditions, Fig. 4 shows that people were sensitive to different regularities in each case.11 We focus on one condition (Fig. 4b) that exposes the strengths and weaknesses of our model. According to our model, the two most probable theories given the triples for this condition are R(X, X, 1) and the closely related variant that rules out R(1, 1, 1). The next most probable theory is R(X, X, Y). These predictions are consistent with people’s judgments that 771 is very likely to belong to the set, and that 778 is the next most likely option. Unlike our model, however, people consider 777 to be substantially less likely than 778 to belong to the set. This result may suggest that the variant of R(X, X, Y) that rules out R(X, X, X) deserves a higher prior probability than our model recognizes. To better account for cases like this, it may be worth considering languages where any two variables that belong to the same clause but have different names must refer to different entities. 3 Discussion and Conclusion There are many psychological models of concept learning [4, 12, 13], but few that use representations rich enough to capture the content of intuitive theories. We suggested that intuitive theories are mentally represented in a first-order logical language, and proposed a specific hypothesis about 9 ¯ One such theory includes two clauses: R(X, X, Y). R(X, X, X). One such theory is the following list of clauses: R(2, 2, 1). R(3, 3, 1). R(4, 4, 1). R(5, 5, 1). R(7, 7, 7). 11 Similar results have been found with 9-month old infants. Cases like Figs. 4b and 4c have been tested in an infant language-learning study where the stimuli were three-syllable strings [19]. 9-month old infants exposed to strings like the four in Fig. 4c generalized to other strings consistent with the theory R(X, X, Y), but infants in the condition corresponding to Fig. 4b generalized only to strings consistent with the theory R(X, X, 1). 10 7 this “language of thought.” We assumed that the subjective complexity of a theory depends on the length of its representation in this language, and described experiments which suggest that the resulting complexity measure helps to explain how theories are learned and used for inductive inference. Our experiments deliberately used stimuli that minimize the influence of prior knowledge. Theories, however, are cumulative, and the theory that seems simplest to a learner will often depend on her background knowledge. Our approach provides a natural place for background knowledge to be inserted. A learner can be supplied with a stock of background predicates, and the shortest representation for a data set will depend on which background predicates are available. Since different sets of predicates will lead to different predictions about subjective complexity, empirical results can help to determine the background knowledge that people bring to a given class of problems. Future work should aim to refine the representation language and complexity measure we proposed. We expect that something like our approach will be suitable for modeling a broad class of intuitive theories, but the specific framework presented here can almost certainly be improved. Future work should also consider different strategies for searching the space of theories. Some of the strategies developed in the ILP literature should be relevant [14], but a detailed investigation of search algorithms seems premature until our approach has held up to additional empirical tests. It is comparatively easy to establish whether the theories that are simple according to our approach are also considered simple by people, and our experiments have made a start in this direction. It is much harder to establish that our approach captures most of the theories that are subjectively simple, and more exhaustive experiments are needed before this conclusion can be drawn. Boolean concept learning has been studied for more than fifty years [4, 9], and many psychologists have made empirical and theoretical contributions to this field. An even greater effort will be needed to crack the problem of theory learning, since the space of intuitive theories is much richer than the space of Boolean concepts. The difficulty of this problem should not be underestimated, but computational approaches can contribute part of the solution. Acknowledgments Supported by the William Asbjornsen Albert memorial fellowship (CK), the James S. McDonnell Foundation Causal Learning Collaborative Initiative (NDG, JBT) and the Paul E. Newton chair (JBT). References [1] N. Goodman. The structure of appearance. 2nd edition, 1961. [2] S. Carey. Conceptual change in childhood. MIT Press, Cambridge, MA, 1985. [3] H. A. Simon. Complexity and the representation of patterned sequences of symbols. Psychological Review, 79:369–382, 1972. [4] J. Feldman. An algebra of human concept learning. JMP, 50:339–368, 2006. [5] N. Chater and P. Vitanyi. Simplicity: a unifying principle in cognitive science. TICS, 7:19–22, 2003. [6] J. T. Krueger. A theory of structural simplicity and its relevance to aspects of memory, perception, and conceptual naturalness. PhD thesis, University of Pennsylvania, 1979. [7] P. Gr¨ nwald, I. J. Myung, and M. Pitt, editors. Advances in Minimum Description Length: Theory and u Applications. 2005. [8] N. Chater. Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, 103:566–581, 1996. [9] J. A. Bruner, J. S. Goodnow, and G. J. Austin. A study of thinking. Wiley, 1956. [10] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(1):81–97, 1956. [12] N. D. Goodman, T. L. Griffiths, J. Feldman, and J. B. Tenenbaum. A rational analysis of rule-based concept learning. In CogSci, 2007. [13] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. BBS, 24:629–641, 2001. [14] S. Muggleton and L. De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19-20:629–679, 1994. [15] N. Chomsky. The logical structure of linguistic theory. University of Chicago Press, Chicago, 1975. [16] E. L. J. Leeuwenberg. A perceptual coding language for visual and auditory patterns. American Journal of Psychology, 84(3):307–349, 1971. [17] B. Edmonds. Syntactic measures of complexity. PhD thesis, University of Manchester, 1999. [18] W. McCune. Mace4 reference manual and guide. Technical Report ANL/MCS-TM-264, Argonne National Laboratory, 2003. [19] L. Gerken. Decisions, decisions: infant language learning when multiple generalizations are possible. Cognition, 98(3):67–74, 2006. 8
4 0.41921034 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
Author: Michael Kearns, Jinsong Tan, Jennifer Wortman
Abstract: We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms — distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else. 1
5 0.41636324 141 nips-2007-New Outer Bounds on the Marginal Polytope
Author: David Sontag, Tommi S. Jaakkola
Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction. 1
6 0.40611637 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs
7 0.35907212 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
8 0.32615799 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
9 0.31345436 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
10 0.30270955 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference
11 0.29450804 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
12 0.28338566 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
13 0.28130767 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
14 0.27591032 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
15 0.26893914 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
16 0.26802894 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
17 0.26422274 101 nips-2007-How SVMs can estimate quantiles and the median
18 0.25564387 128 nips-2007-Message Passing for Max-weight Independent Set
19 0.24297927 187 nips-2007-Structured Learning with Approximate Inference
20 0.24212381 171 nips-2007-Scan Strategies for Meteorological Radars
topicId topicWeight
[(5, 0.052), (13, 0.031), (16, 0.022), (21, 0.051), (29, 0.332), (31, 0.014), (34, 0.037), (35, 0.03), (47, 0.068), (49, 0.019), (79, 0.01), (83, 0.151), (85, 0.028), (87, 0.018), (90, 0.045)]
simIndex simValue paperId paperTitle
1 0.87611055 55 nips-2007-Computing Robust Counter-Strategies
Author: Michael Johanson, Martin Zinkevich, Michael Bowling
Abstract: Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1
2 0.80131012 169 nips-2007-Retrieved context and the discovery of semantic structure
Author: Vinayak Rao, Marc Howard
Abstract: Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reflects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented. 1
3 0.74843246 165 nips-2007-Regret Minimization in Games with Incomplete Information
Author: Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione
Abstract: Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods. 1
same-paper 4 0.72766203 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
5 0.6399585 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum
Abstract: We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classifiers. Theoretical results provide new insights into relationships among popular classifiers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classification criterion and its kernel and local versions perform competitively against existing classifiers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-specific information. 1
6 0.52203351 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
7 0.51724279 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
8 0.50187111 63 nips-2007-Convex Relaxations of Latent Variable Training
9 0.50016087 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
10 0.50009364 49 nips-2007-Colored Maximum Variance Unfolding
11 0.4996039 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
12 0.49924314 187 nips-2007-Structured Learning with Approximate Inference
13 0.49883699 115 nips-2007-Learning the 2-D Topology of Images
14 0.49784064 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
15 0.49735525 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
16 0.49487936 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
17 0.494432 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
18 0.49285781 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
19 0.49264252 134 nips-2007-Multi-Task Learning via Conic Programming
20 0.49262556 128 nips-2007-Message Passing for Max-weight Independent Set