nips nips2000 nips2000-127 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joshua B. Tenenbaum, Thomas L. Griffiths
Abstract: We use graphical models to explore the question of how people learn simple causal relationships from data. The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. Our argument is supported through the discussion of three data sets.
Reference: text
sentIndex sentText sentNum sentScore
1 Structure learning in human causal induction Joshua B. [sent-1, score-0.983]
2 edu Abstract We use graphical models to explore the question of how people learn simple causal relationships from data. [sent-5, score-1.003]
3 The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. [sent-6, score-0.121]
4 We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. [sent-7, score-1.896]
5 1 Introduction Causality plays a central role in human mental life. [sent-9, score-0.091]
6 Our behavior depends upon our understanding of the causal structure of our environment, and we are remarkably good at inferring causation from mere observation. [sent-10, score-0.841]
7 Constructing formal models of causal induction is currently a major focus of attention in computer science [7], psychology [3,6], and philosophy [5]. [sent-11, score-0.968]
8 This paper attempts to connect these literatures, by framing the debate between two major psychological theories in the computational language of graphical models. [sent-12, score-0.182]
9 Psychological models of causal induction address the question of how people learn associations between causes and effects, such as P ( C ----+ E), the probability that some event C causes outcome E. [sent-14, score-1.194]
10 Three case studies have been done to evaluate the probability that certain chemicals, when injected into rats, cause certain genes to be expressed. [sent-17, score-0.159]
11 In case 1, levels of gene 1 were measured in 100 rats injected with chemical 1, as well as in 100 uninjected rats; cases 2 and 3 were conducted likewise but with different chemicals and genes. [sent-18, score-0.335]
12 In case 1, 40 out of 100 injected rats were found to have expressed the gene, while 0 out of 100 uninjected rats expressed the gene. [sent-19, score-0.331]
13 For each case, we would like to know the probability that the chemical causes the gene to be expressed, P ( C ----+ E), where C denotes the chemical and E denotes gene expression. [sent-22, score-0.273]
14 People typically rate P( C----+E) highest for case 1, followed by case 2 and then case 3. [sent-23, score-0.072]
15 The two leading psychological models of causal induction elaborate upon this basis in attempting to specify P(C----+E). [sent-32, score-1.043]
16 P model [6] claims that people estimate P(C----+E) according to (1) ° (We restrict our attention here to facilitatory causes, in which case f:j. [sent-34, score-0.205]
17 ) Equation 1 captures the intuition that C is perceived to cause E to the extent that C's occurence increases the likelihood of observing E. [sent-36, score-0.128]
18 P and proposed that P( C----+E)instead corresponds to causal power, the probability that C produces E in the absence of all other causes. [sent-38, score-0.729]
19 Formally, the power model can be expressed as: f:j. [sent-39, score-0.262]
20 P (2) power = 1 _ P(e+ lc)' There are a variety of normative arguments in favor of either of these models [3,7]. [sent-40, score-0.22]
21 Empirically, however, neither model is fully adequate to explain human causal induction. [sent-41, score-0.884]
22 We will present ample evidence for this claim below, but for now, the basic problem can be illustrated with the three scenarios above. [sent-42, score-0.095]
23 While people rate P(C----+E) higher for case 2, {71100,OllOO}, than for case 3, {531100,4611O0}, f:j. [sent-43, score-0.189]
24 p rates them equally and the power model ranks case 3 over case 2. [sent-44, score-0.265]
25 To understand this discrepancy, we have to distinguish between two possible senses of P( C----+E): "the probability thatC causes E (on any given trial when C is present)" versus "the probability that C is a cause of E (in general, as opposed to being causally independent of E)". [sent-45, score-0.191]
26 P and power models concern only the former sense, while people's intuitions about P( C ----+ E) are often concerned with the latter. [sent-47, score-0.251]
27 In our example, while the effect of C on any given trial in case 3 may be equal to (according to f:j. [sent-48, score-0.081]
28 P) or stronger than (according to power) its effect in case 2, the general pattern of results seems more likely in case 2 than in case 3 to be due to a genuine causal influence, as opposed to a spurious correlation between random samples of two independent variables. [sent-49, score-0.864]
29 In the following section, we formalize this distinction in terms of parameter estimation versus structure learning on a graphical model. [sent-50, score-0.114]
30 Section 3 then compares two variants of our structure learning model with the parameter estimation models (f:j. [sent-51, score-0.136]
31 P and power) in light of data from three experiments on human causal induction. [sent-52, score-0.82]
32 2 Graphical models of causal induction The language of causal graphical models provides a useful framework for thinking about people's causal intuitions [5,7]. [sent-53, score-2.528]
33 All the induction models we consider here can be viewed as computations on a simple directed graph (Graph l in Figure 1). [sent-54, score-0.3]
34 ) Each parent node is associated with a parameter, W B or We, that defines the strength of its effect on E. [sent-60, score-0.1]
35 ) In the causal power model, as first shown by Glymour [5], E is a noisy-OR gate: (4) 2. [sent-65, score-0.906]
36 P and power model's predictions for P (C - E ) can be seen as maximum likelihood estimates of the causal strength parameter We in Graph1' but under different parameterizations. [sent-70, score-1.144]
37 ] (5) N L ei log Q(ei lei ) + (1 - ei ) log(1 - Q( e; lei )) , (6) ; =: 1 where we have suppressed the dependence of Q(ei Ie;) on WB , We . [sent-72, score-0.147]
38 P model's predictions for P( C - E ) correspond to maximum likelihood estimates of We under a linear parameterization of Graph1 ' we identify We in Equation 3 with! [sent-75, score-0.217]
39 , e = 0), thus satisfying the sufficient conditions in Equations 8-9 for WB and We to be maximum likelihood estimates. [sent-82, score-0.086]
40 To show that the causal power model's predictions for P (C - E ) correspond to maximum likelihood estimates of We under a noisy-OR parameterization, we follow the analogous procedure: identify We in Equation 4 with power (Equation 2), and WB with P (e+ le-). [sent-83, score-1.299]
41 Then Equation 4 reduces to P( e+ le+) for e = e+ and to P( e+ le-) for e = e- , again satisfying the conditions for WB and We to be maximum likelihood estimates. [sent-84, score-0.086]
42 2 Structural inferences: Causal Support and x2 The central claim of this paper is that people's judgments of P( C - E ) reflect something other than estimates of causal strength parameters - the quantities that we have just shown to be computed by ! [sent-86, score-0.913]
43 Rather, people's judgments may correspond to inferences about the underlying causal structure, such as the probability that C is a direct cause of E. [sent-89, score-0.989]
44 In terms of the graphical model in Figure 1, human causal induction may be focused on trying to distinguish between Graph), in which C is a parent of E, and the "null hypothesis" of Grapho, in which C is not. [sent-90, score-1.113]
45 This structural inference can be formalized as a Bayesian decision. [sent-91, score-0.078]
46 Let he be a binary variable indicating whether or not the link C -+ E exists in the true causal model responsible for generating our observations. [sent-92, score-0.769]
47 We will assume a noisy-OR gate, and thus our model is closely related to causal power. [sent-93, score-0.769]
48 However, we propose to model human estimates of P(C -+E ) as causal support, the log posterior odds in favor of Graph l (h e = 1) over Grapho (h e = 0) : support P(h e = l 1X ) = log P (h e =O IX ) . [sent-94, score-1.028]
49 (10) Via Bayes' rule, we can express P( he = l 1 in terms of the marginal likelihood or eviX) den ce, P(X lhe = 1), and the prior probability that C is a cause of E , P(h e = 1) : P(h e = l 1X) ex P( Xl he = I)P(h e = 1). [sent-95, score-0.128]
50 Computing the evidence requires integrating the likelihood P( X IW B , we ) over all possible values of the strength parameters: We take p( W , We Ihe = 1) to be a uniform density, and we note that p( X IW we ) is simB B, ply the exponential of £( XI W8, we ) as defined in Equation 5. [sent-97, score-0.094]
51 In general, evaluating causal support may require fairly involved computations, but in the limit of large N and weak causal strength We , it can be approximated by the familiar X 2 statistic for independence, N L c,e (P( c ' ~o(~~~c , e))2 . [sent-103, score-1.589]
52 P and causal power, and the two structural inference models, causal support and X2, as accounts of data from three behavioral experiments, each designed to address different aspects of human causal induction. [sent-108, score-2.509]
53 To compensate for possible nonlinearities in people's use of numerical rating scales on these tasks, all model predictions have been scaled by power-law transformations, f( x ) = sign(x) Ixl 'Y, with I chosen separately for each model and each data set to maximize their linear correlation. [sent-109, score-0.168]
54 In the figures, predictions are expressed over the same range as the data, with minimum and maximum values aligned. [sent-110, score-0.134]
55 Figure 2 presents data from a study by Buehner & Cheng [2], designed to contrast the predictions of flP and causal power. [sent-111, score-0.826]
56 People judged P ( C -+ E) for hypothetical medical studies much like the gene expression scenarios described above, seeing eight cases in which C occurred and eight in which C did not occur. [sent-112, score-0.223]
57 Some trends in the data are clearly captured by the causal power model but not by flP, such as the monotonic decrease in P( C -+ E ) from {l. [sent-113, score-1.1]
58 OO}, as flP stays constant but P( e+lc-) (and hence power) decreases (columns 6-9). [sent-117, score-0.082]
59 Other trends are clearly captured by flP but not by the power model, like the monotonic increase in P(C -+E ) as P(e+ Ic+) stays constant at 1. [sent-118, score-0.378]
60 However, one of the most salient trends is captured by neither model: the decrease in P(C -+E ) as flP stays constant at 0 but P (e+lc-) decreases (columns 1-5). [sent-124, score-0.201]
61 The causal support model predicts this decrease, as well as the other trends . [sent-125, score-0.953]
62 The intuition behind the model's predictions for flP = 0 is that decreasing the base rate P(e+ Ic-) increases the opportunity to observe the cause's influence and thus increases the statistical force behind the inference that C does not cause E , given flP = O. [sent-126, score-0.164]
63 This effect is most obvious when P( e+lc+) = P( e+lc-) = 1, yielding a ceiling effect with no statistical leverage [3], but also occurs to a lesser extent for P (e +Ie) < 1. [sent-127, score-0.093]
64 While x 2 generally approximates the support model rather well, italso fails to explain the cases with P( e+ Ic+) = P( e+ Ic-), which always yield X2 = O. [sent-128, score-0.155]
65 The superior fit of the support model is reflected in its correlation with the data, giving R 2 = 0. [sent-129, score-0.154]
66 Figure 3 shows results from an experiment conducted by Lober and Shanks [6], designed to explore the trend in Buehner and Cheng's experiment that was predicted by flP but not by the power model. [sent-134, score-0.297]
67 Columns 1-3 show a second situation in which the predictions of the power model are constant, but judgements of P ( C -+ E ) increase. [sent-137, score-0.274]
68 Columns 8-10 feature three scenarios with equal flP, for which the causal power model predicts a decreasing trend. [sent-138, score-1.002]
69 For each of these trends the flP model outperforms the causal power model, with overall R 2 values of 0. [sent-140, score-1.039]
70 However, it is important to note that the responses of the human subjects in columns 8-10 (contingencies {l. [sent-143, score-0.193]
71 OO}) are not quite consistent with the predictions of flP : they show a slight V-shaped non-linearity, with P( C -+E) judged to be smaller for 0. [sent-149, score-0.083]
72 This trend is predicted by the causal support model and its X 2 approximation, however, which both give the slightly better R 2 of 0. [sent-152, score-0.889]
73 They each provided a judgment of P ( C -+ E ) in 14 different medical scenarios, where information about P( e+ Ic+) and P(e+ Ic-) was provided in terms of how many mice from a sample of 100 expressed a particular gene. [sent-156, score-0.074]
74 Columns 1-3, 5-7, and 9-11 show contingency structures designed to elicit V-shaped trends in P ( C -+ E ). [sent-157, score-0.217]
75 Column 14 attempted to explore the effects of manipulating sample size, with a contingency structure of {7/7, 93/ 193}. [sent-159, score-0.106]
76 In each case, we observed the predicted nonlinearity : in a set of situations with the same flP , the situations involving less extreme probabilities show reduced judgments of P( C -+E ) . [sent-160, score-0.159]
77 These non-linearities are not consistent with the flP model, but are predicted by both causal support and X2 . [sent-161, score-0.849]
78 The support model gives a slightly worse fit than X2, R2 = 0. [sent-166, score-0.131]
79 80, while the power model gives a poor account of the data, R2 = 0. [sent-167, score-0.217]
80 4 Conclusions and future directions In each of the studies above, the structural inference models based on causal support or X2 consistently outperformed the parameter estimation models, I:! [sent-169, score-0.991]
81 P were each capable of capturing certain trends in the data, causal support was the only model capable of predicting all the trends. [sent-174, score-0.953]
82 For the third data set, X2 provided a significantly better fit to the data than did causal support. [sent-175, score-0.729]
83 This finding merits future investigation in a study designed to tease apart X2 and causal support; in any case, due to the close relationship between the two models, this result does not undermine our claim that probabilistic structural inferences are central to human causal induction. [sent-176, score-1.759]
84 One unique advantage of the Bayesian causal support model is its ability to draw inferences from very few observations. [sent-177, score-0.946]
85 We have begun a line of experiments, inspired by Gopnik, Sobel & Glymour (submitted), to examine how adults revise their causal judgments when given only one or two observations, rather than the large samples used in the above studies. [sent-178, score-0.807]
86 They were then given two pencils, analogous to B and C in Figure 1, and asked to rate how likely these pencils were to have supedead, that is, to cause the detector to activate. [sent-181, score-0.166]
87 Next, they were shown that the superlead detector responded when Band C were tested together, and their causal ratings of both Band C increased. [sent-183, score-0.887]
88 Finally, they were shown that B set off the supedead detector on its own, and causal ratings of B increased to ceiling while ratings of C returned to their prior levels. [sent-184, score-0.99]
89 This situation is exactly analogous to that explored in the medical tasks described above, and people were able to perform accurate causal inductions given only one trial of each type. [sent-185, score-0.949]
90 Of the models we have considered, only Bayesian causal support can explain this behavior, by allowing the prior in Equation 11 to adapt depending on whether superlead is rare or common. [sent-186, score-0.941]
91 We also hope to look at inferences about more complex causal structures, including those with hidden variables. [sent-187, score-0.845]
92 With just a single cause, causal support and X2 are highly correlated, but with more complex structures, the Bayesian computation of causal support becomes increasingly intractable while the X2 approximation becomes less accurate. [sent-188, score-1.64]
93 Through experiments with more complex structures, we hope to discover where and how human causal induction strikes a balance between ponderous rationality and efficient heuristic. [sent-189, score-1.013]
94 Finally, we should stress that despite the superior performance of the structural inference models here, in many situations estimating causal strength parameters is likely to be just as important as inferring causal structure. [sent-190, score-1.67]
95 Our hope is that by using graphical models to relate and extend upon existing accounts of causal induction, we have provided a framework for exploring the interplay between the different kinds of judgments that people make. [sent-191, score-1.113]
96 Cheng (1997) Causal induction; The power PC theory versus the RescorlaWagner theory. [sent-198, score-0.177]
97 _ _ Figure 1: Different theories of human causal induction expressed as different operations on a simple graphical model. [sent-226, score-1.133]
98 The M' and power models correspond to maximum likelihood parameter estimates on a fixed graph (Graph[), while the support model corresponds to a (Bayesian) inference about which graph is the true causal structure. [sent-227, score-1.464]
99 III Figure 2: Computational models compared with the performance of human participants from Buehner and Cheng [1], Experiment lB. [sent-232, score-0.181]
100 1 II I Support i I Figure 4: Computational models compared with the performance of human participants on a set of stimuli designed to elicit the non-monotonic trends shown in the data of Lober and Shanks [5]. [sent-290, score-0.345]
wordName wordTfidf (topN-words)
[('causal', 0.729), ('flp', 0.253), ('wb', 0.218), ('power', 0.177), ('induction', 0.163), ('people', 0.141), ('cheng', 0.099), ('le', 0.095), ('graph', 0.094), ('trends', 0.093), ('support', 0.091), ('human', 0.091), ('grapho', 0.091), ('inferences', 0.086), ('judgments', 0.078), ('psychological', 0.077), ('cause', 0.074), ('lc', 0.074), ('columns', 0.072), ('buehner', 0.072), ('lober', 0.072), ('ratings', 0.072), ('rats', 0.071), ('ic', 0.066), ('gene', 0.065), ('shanks', 0.062), ('graphical', 0.061), ('causes', 0.059), ('predictions', 0.057), ('scenarios', 0.056), ('ihe', 0.054), ('supedead', 0.054), ('superlead', 0.054), ('likelihood', 0.054), ('glymour', 0.047), ('participants', 0.047), ('stays', 0.047), ('expressed', 0.045), ('structural', 0.045), ('theories', 0.044), ('models', 0.043), ('chemical', 0.042), ('strength', 0.04), ('model', 0.04), ('designed', 0.04), ('injected', 0.039), ('claim', 0.039), ('chemicals', 0.036), ('pencils', 0.036), ('uninjected', 0.036), ('equation', 0.035), ('decreases', 0.035), ('monotonic', 0.035), ('bayesian', 0.033), ('ei', 0.033), ('psychology', 0.033), ('inference', 0.033), ('opposed', 0.032), ('detector', 0.032), ('maximum', 0.032), ('intuitions', 0.031), ('rating', 0.031), ('causation', 0.031), ('ceiling', 0.031), ('elicit', 0.031), ('effect', 0.031), ('upon', 0.031), ('subjects', 0.03), ('hope', 0.03), ('humans', 0.03), ('medical', 0.029), ('parent', 0.029), ('explore', 0.029), ('predicted', 0.029), ('parameter', 0.028), ('survey', 0.028), ('contingency', 0.028), ('lei', 0.028), ('estimates', 0.027), ('captured', 0.026), ('judged', 0.026), ('iw', 0.026), ('trial', 0.026), ('situations', 0.026), ('structure', 0.025), ('log', 0.025), ('structures', 0.025), ('occurred', 0.025), ('inferring', 0.025), ('parameterization', 0.025), ('case', 0.024), ('effects', 0.024), ('analogous', 0.024), ('explain', 0.024), ('reflected', 0.023), ('gate', 0.023), ('studies', 0.022), ('correspond', 0.022), ('behavioral', 0.022), ('conducted', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 127 nips-2000-Structure Learning in Human Causal Induction
Author: Joshua B. Tenenbaum, Thomas L. Griffiths
Abstract: We use graphical models to explore the question of how people learn simple causal relationships from data. The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. Our argument is supported through the discussion of three data sets.
2 0.11296694 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
Author: Guy Mayraz, Geoffrey E. Hinton
Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this
3 0.075879335 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
Author: Simon Tong, Daphne Koller
Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.
4 0.071894176 16 nips-2000-Active Inference in Concept Learning
Author: Jonathan D. Nelson, Javier R. Movellan
Abstract: People are active experimenters, not just passive observers, constantly seeking new information relevant to their goals. A reasonable approach to active information gathering is to ask questions and conduct experiments that maximize the expected information gain, given current beliefs (Fedorov 1972, MacKay 1992, Oaksford & Chater 1994). In this paper we present results on an exploratory experiment designed to study people's active information gathering behavior on a concept learning task (Tenenbaum 2000). The results of the experiment are analyzed in terms of the expected information gain of the questions asked by subjects. In scientific inquiry and in everyday life, people seek out information relevant to perceptual and cognitive tasks. Scientists perform experiments to uncover causal relationships; people saccade to informative areas of visual scenes, turn their head towards surprising sounds, and ask questions to understand the meaning of concepts . Consider a person learning a foreign language, who notices that a particular word,
5 0.058352813 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
Author: Yee Whye Teh, Geoffrey E. Hinton
Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
6 0.056685578 54 nips-2000-Feature Selection for SVMs
7 0.056516796 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
8 0.04897001 13 nips-2000-A Tighter Bound for Graphical Models
9 0.044670779 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation
10 0.044291146 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
11 0.043567676 49 nips-2000-Explaining Away in Weight Space
12 0.042983215 82 nips-2000-Learning and Tracking Cyclic Human Motion
13 0.042775024 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
14 0.042062897 114 nips-2000-Second Order Approximations for Probability Models
15 0.041660983 92 nips-2000-Occam's Razor
16 0.04019447 143 nips-2000-Using the Nyström Method to Speed Up Kernel Machines
17 0.037991632 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
18 0.037864596 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
19 0.037241317 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
20 0.035701245 103 nips-2000-Probabilistic Semantic Video Indexing
topicId topicWeight
[(0, 0.138), (1, -0.019), (2, 0.053), (3, -0.023), (4, 0.061), (5, 0.005), (6, 0.064), (7, 0.015), (8, 0.038), (9, 0.055), (10, 0.056), (11, -0.05), (12, 0.083), (13, -0.03), (14, 0.116), (15, 0.015), (16, -0.06), (17, -0.002), (18, -0.105), (19, 0.008), (20, -0.007), (21, 0.159), (22, 0.01), (23, -0.126), (24, 0.097), (25, -0.046), (26, 0.056), (27, 0.023), (28, -0.01), (29, -0.01), (30, 0.019), (31, 0.011), (32, -0.265), (33, 0.029), (34, 0.164), (35, 0.148), (36, 0.089), (37, 0.002), (38, -0.081), (39, -0.078), (40, 0.231), (41, 0.07), (42, 0.049), (43, 0.164), (44, 0.153), (45, -0.144), (46, -0.027), (47, -0.109), (48, -0.161), (49, 0.144)]
simIndex simValue paperId paperTitle
same-paper 1 0.96365583 127 nips-2000-Structure Learning in Human Causal Induction
Author: Joshua B. Tenenbaum, Thomas L. Griffiths
Abstract: We use graphical models to explore the question of how people learn simple causal relationships from data. The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. Our argument is supported through the discussion of three data sets.
2 0.64036149 16 nips-2000-Active Inference in Concept Learning
Author: Jonathan D. Nelson, Javier R. Movellan
Abstract: People are active experimenters, not just passive observers, constantly seeking new information relevant to their goals. A reasonable approach to active information gathering is to ask questions and conduct experiments that maximize the expected information gain, given current beliefs (Fedorov 1972, MacKay 1992, Oaksford & Chater 1994). In this paper we present results on an exploratory experiment designed to study people's active information gathering behavior on a concept learning task (Tenenbaum 2000). The results of the experiment are analyzed in terms of the expected information gain of the questions asked by subjects. In scientific inquiry and in everyday life, people seek out information relevant to perceptual and cognitive tasks. Scientists perform experiments to uncover causal relationships; people saccade to informative areas of visual scenes, turn their head towards surprising sounds, and ask questions to understand the meaning of concepts . Consider a person learning a foreign language, who notices that a particular word,
3 0.36064494 132 nips-2000-The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving
Author: David B. Grimes, Michael Mozer
Abstract: Although connectionist models have provided insights into the nature of perception and motor control, connectionist accounts of higher cognition seldom go beyond an implementation of traditional symbol-processing theories. We describe a connectionist constraint satisfaction model of how people solve anagram problems. The model exploits statistics of English orthography, but also addresses the interplay of sub symbolic and symbolic computation by a mechanism that extracts approximate symbolic representations (partial orderings of letters) from sub symbolic structures and injects the extracted representation back into the model to assist in the solution of the anagram. We show the computational benefit of this extraction-injection process and discuss its relationship to conscious mental processes and working memory. We also account for experimental data concerning the difficulty of anagram solution based on the orthographic structure of the anagram string and the target word. Historically, the mind has been viewed from two opposing computational perspectives. The symbolic perspective views the mind as a symbolic information processing engine. According to this perspective, cognition operates on representations that encode logical relationships among discrete symbolic elements, such as stacks and structured trees, and cognition involves basic operations such as means-ends analysis and best-first search. In contrast, the subsymbolic perspective views the mind as performing statistical inference, and involves basic operations such as constraint-satisfaction search. The data structures on which these operations take place are numerical vectors. In some domains of cognition, significant progress has been made through analysis from one computational perspective or the other. The thesis of our work is that many of these domains might be understood more completely by focusing on the interplay of subsymbolic and symbolic information processing. Consider the higher-cognitive domain of problem solving. At an abstract level of description, problem solving tasks can readily be formalized in terms of symbolic representations and operations. However, the neurobiological hardware that underlies human cognition appears to be subsymbolic-representations are noisy and graded, and the brain operates and adapts in a continuous fashion that is difficult to characterize in discrete symbolic terms. At some level-between the computational level of the task description and the implementation level of human neurobiology-the symbolic and subsymbolic accounts must come into contact with one another. We focus on this point of contact by proposing mechanisms by which symbolic representations can modulate sub symbolic processing, and mechanisms by which subsymbolic representations are made symbolic. We conjecture that these mechanisms can not only provide an account for the interplay of symbolic and sub symbolic processes in cognition, but that they form a sensible computational strategy that outperforms purely subsymbolic computation, and hence, symbolic reasoning makes sense from an evolutionary perspective. In this paper, we apply our approach to a high-level cognitive task, anagram problem solving. An anagram is a nonsense string of letters whose letters can be rearranged to form a word. For example, the solution to the anagram puzzle RYTEHO is THEORY. Anagram solving is a interesting task because it taps higher cognitive abilities and issues of awareness, it has a tractable state space, and interesting psychological data is available to model. 1 A Sub symbolic Computational Model We start by presenting a purely subsymbolic model of anagram processing. By subsymbolic, we mean that the model utilizes only English orthographic statistics and does not have access to an English lexicon. We will argue that this model proves insufficient to explain human performance on anagram problem solving. However, it is a key component of a hybrid symbolic-subsymbolic model we propose, and is thus described in detail. 1.1 Problem Representation A computational model of anagram processing must represent letter orderings. For example, the model must be capable of representing a solution such as
4 0.33280829 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
Author: Guy Mayraz, Geoffrey E. Hinton
Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this
5 0.29004124 143 nips-2000-Using the Nyström Method to Speed Up Kernel Machines
Author: Christopher K. I. Williams, Matthias Seeger
Abstract: unkown-abstract
6 0.28198692 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
7 0.26757571 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
8 0.24725786 43 nips-2000-Dopamine Bonuses
9 0.24453151 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
10 0.23979197 103 nips-2000-Probabilistic Semantic Video Indexing
11 0.2381776 116 nips-2000-Sex with Support Vector Machines
12 0.21895167 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
13 0.2184175 52 nips-2000-Fast Training of Support Vector Classifiers
14 0.21116811 82 nips-2000-Learning and Tracking Cyclic Human Motion
15 0.20575498 54 nips-2000-Feature Selection for SVMs
16 0.20523369 42 nips-2000-Divisive and Subtractive Mask Effects: Linking Psychophysics and Biophysics
17 0.20253737 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
18 0.19921753 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
19 0.19520347 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
topicId topicWeight
[(10, 0.025), (17, 0.119), (32, 0.026), (33, 0.058), (55, 0.038), (62, 0.042), (65, 0.037), (67, 0.057), (68, 0.305), (76, 0.04), (79, 0.039), (81, 0.032), (90, 0.013), (91, 0.021), (97, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.82916152 127 nips-2000-Structure Learning in Human Causal Induction
Author: Joshua B. Tenenbaum, Thomas L. Griffiths
Abstract: We use graphical models to explore the question of how people learn simple causal relationships from data. The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. Our argument is supported through the discussion of three data sets.
2 0.76988423 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in two of its arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of continuous functions with these and other properties. We apply this new class of functions to the task of modeling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.
3 0.47067037 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
4 0.44862816 37 nips-2000-Convergence of Large Margin Separable Linear Classification
Author: Tong Zhang
Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed
5 0.44706059 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
6 0.44498888 74 nips-2000-Kernel Expansions with Unlabeled Examples
7 0.44475845 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
8 0.44461164 122 nips-2000-Sparse Representation for Gaussian Process Models
9 0.44436482 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
10 0.44296297 60 nips-2000-Gaussianization
11 0.44256866 146 nips-2000-What Can a Single Neuron Compute?
12 0.44180623 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
13 0.43938363 133 nips-2000-The Kernel Gibbs Sampler
14 0.43695775 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
15 0.43597224 49 nips-2000-Explaining Away in Weight Space
16 0.43574119 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
17 0.43552557 79 nips-2000-Learning Segmentation by Random Walks
18 0.43500265 4 nips-2000-A Linear Programming Approach to Novelty Detection
19 0.43471164 130 nips-2000-Text Classification using String Kernels
20 0.43414393 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks