nips nips2011 nips2011-194 knowledge-graph by maker-knowledge-mining

194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Source: pdf

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

Abstract: We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. We prove that the causal graph of such models is generically identiﬁable in the bivariate, Gaussian-noise case. We also propose a method to learn such models from observational data. In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. We illustrate the proposed method on synthetic data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. [sent-13, score-0.767]

2 We prove that the causal graph of such models is generically identiﬁable in the bivariate, Gaussian-noise case. [sent-14, score-0.492]

3 In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. [sent-16, score-0.423]

4 1 Introduction Causal discovery refers to a special class of statistical and machine learning methods that infer causal relationships between variables from data and prior knowledge [1, 2, 3]. [sent-18, score-0.516]

5 In this sense, causal discovery concentrates more on inferring the underlying mechanism that generated the data than on modeling the data itself. [sent-20, score-0.499]

6 An important assumption often made in causal discovery is that the causal mechanism is acyclic, i. [sent-21, score-0.923]

7 For example, if A causes B, and B causes C, then the possibility that C also causes A is usually excluded from the outset. [sent-24, score-0.143]

8 Nevertheless, causal cycles are known to occur frequently in biological systems such as gene regulatory networks and protein interaction networks. [sent-26, score-0.459]

9 One would expect that taking such feedback loops into account during data analysis should therefore signiﬁcantly improve the quality of the inferred causal structure. [sent-27, score-0.472]

10 Essentially two strategies for dealing with cycles in causal models can be distinguished. [sent-28, score-0.479]

11 The ﬁrst one is to perform repeated measurements in time, and to infer a causal model for the dynamics of the underlying system. [sent-29, score-0.423]

12 The fact that causes always precede their effects provides additional prior knowledge that simpliﬁes causal discovery, which is exploited in methods based on Granger causality [4]. [sent-30, score-0.487]

13 However, all these methods need time series data where the temporal resolution of the measurements is high relative to the characteristic time scale of the feedback loops in order to rule out instantaneous cyclic relationships. [sent-32, score-0.272]

14 This noise models unobserved causes and is assumed to be different for each independent realization of the system, but constant during equilibration. [sent-38, score-0.208]

15 In the simplest case (assuming causal sufﬁciency), the noise terms are jointly independent. [sent-39, score-0.554]

16 An important novel aspect of our work is that we consider continuous-valued variables and nonlinear causal mechanisms. [sent-41, score-0.516]

17 Although the linear case has been studied in considerable detail already [6, 7, 8], as far as we know, nobody has yet investigated the (more realistic) case of nonlinear causal mechanisms. [sent-42, score-0.478]

18 The basic assumption made in [7] is the so-called Global Directed Markov Condition, which relates (conditional) independences between the variables with the structure of the causal graph. [sent-43, score-0.519]

19 In the cyclic case, however, it is not obvious what the relationship is with the class of nonlinear causal models that we consider here. [sent-44, score-0.721]

20 For instance, in the bivariate case, one cannot distinguish between X → Y , Y → X and X Y using conditional independences alone. [sent-47, score-0.149]

21 Researchers have also studied cyclic causal models with discrete variables [9, 10]. [sent-48, score-0.704]

22 However, if the measured variables are intrinsically continuous-valued, it is desirable to avoid discretization as a preprocessing step, as this throws away information that is useful for causal discovery. [sent-49, score-0.461]

23 2 Cyclic additive noise models Let V be a ﬁnite index set. [sent-50, score-0.241]

24 Let (Xi )i∈V be random variables modeling measurable properties of the system of interest and let (Ei )i∈V be other random variables modeling unobservable noise sources. [sent-51, score-0.223]

25 We also assume that the noise variables (Ei )i∈V have densities and are jointly independent: p(eV ) = pEi (ei ). [sent-53, score-0.194]

26 Under certain assumptions (see below), the following equations specify a unique probability distribution on the observable variables (Xi )i∈V : Xi = fi (Xpa(i) ) + Ei , i ∈ V. [sent-55, score-0.209]

27 (3) The probability distribution p(X) induced by these equations is interpreted as the equilibrium distribution of an underlying dynamic system. [sent-57, score-0.201]

28 Each function fi represents a causal mechanism which determines Xi as a function of its parents Xpa(i) , which model its direct causes. [sent-58, score-0.521]

29 The noise variables can be interpreted as other, unobserved causes for their corresponding variables. [sent-59, score-0.231]

30 By assuming independence of the noise variables, we are assuming causal sufﬁciency, or in other words, absence of confounders (hidden common causes). [sent-60, score-0.582]

31 We call a model speciﬁed by (1) and (2) an additive noise model. [sent-61, score-0.221]

32 With any additive noise model we can associate a directed graph with vertices V and directed edges i → j if i ∈ pa(j), i. [sent-62, score-0.32]

33 1 If this graph is acyclic, we call the model an acyclic additive noise model. [sent-65, score-0.336]

34 If the graph contains (directed) cycles, we call the model a cyclic additive noise model. [sent-66, score-0.469]

35 2 Interpretation in the cyclic case Note that the presence of cycles increases the complexity of the model, because the equations (2) become recursive. [sent-67, score-0.358]

36 The interpretation of these equations also becomes less straightforward in the cyclic case. [sent-68, score-0.322]

37 In general, for a ﬁxed noise value E = e, the ﬁxed point equations x = f (x) + e can have any number of ﬁxed points between 0 and ∞. [sent-69, score-0.23]

38 For simplicity, however, we will assume that for each noise value e there exists a unique ﬁxed point x = F (e). [sent-70, score-0.156]

39 This interpretation also shows a way to sample from the joint distribution: First, one samples a joint value of the noise e. [sent-74, score-0.131]

40 Thus, the equations can be interpreted as the equilibrium distribution of a dynamic system in the presence of noise which is constant during equilibration, but differs across measurements (data points). [sent-78, score-0.348]

41 If in reality the noise does change over time, but on a slow time scale relative to the time scale of the equilibration, then this model can be considered as the ﬁrst-order approximation. [sent-79, score-0.131]

42 The induced density Although the mapping F : e → x that maps noise values to their corresponding ﬁxed points under (3) is nontrivial in most cases, a crucial observation is that its inverse G = F −1 = I − f has a very simple form (here, I is the identity mapping). [sent-80, score-0.146]

43 Under the change of variables e → x, the transformation rule of the densities reads: pX (x) = pE x − f (x) |I − f (x)| = |I − pEi xi − fi (xpa(i) ) f (x)| (4) i∈V where f (x) is the Jacobian of f evaluated at x and |·| denotes the absolute value of the determinant of a matrix. [sent-81, score-0.133]

44 Note that although sampling from the distribution pX is elaborate (as it typically involves many iterations of the ﬁxed point equations), the corresponding density can be easily expressed analytically in terms of the noise distributions and partial derivatives of the causal mechanisms. [sent-82, score-0.554]

45 Causal interpretation An additive noise model can be used for ordinary prediction tasks (i. [sent-84, score-0.246]

46 Such an intervention can be modeled by replacing the equations for the intervened variables by simple equations Xi = Ci , with Ci the value set by the intervention. [sent-87, score-0.26]

47 If the altered ﬁxed point equations induce a unique probability distribution on X, then this is the predicted distribution on X under the intervention. [sent-89, score-0.124]

48 In this sense, additive noise models are given a causal interpretation. [sent-90, score-0.664]

49 Hereafter, we will therefore refer to the graph associated with the additive noise model as the causal graph. [sent-91, score-0.669]

50 3 Identiﬁability An interesting and important question for causal discovery is under which conditions the causal graph is identiﬁable given only the joint distribution p(X). [sent-92, score-0.911]

51 [8] have shown that under ∂f 1 If some causal mechanism fj does not depend on one of its parents i ∈ pa(j), i. [sent-94, score-0.474]

52 2 Cyclic additive noise models are also known as “non-recursive” (nonlinear) structural equation models, whereas the acyclic versions are known as “recursive” (nonlinear) SEMs. [sent-97, score-0.363]

53 This terminology is common usage but confusing, as it is precisely in the cyclic case that one needs a recursive procedure to calculate the solutions of equations (2), and not the other way around. [sent-98, score-0.338]

54 , all functions fi are linear), the causal graph is completely identiﬁable if at most one of the noise sources has a Gaussian distribution. [sent-101, score-0.626]

55 In this work, we focus our attention on the bivariate case. [sent-104, score-0.108]

56 Our main result, Theorem 1, can be seen as an extension of the identiﬁability result for acyclic nonlinear additive noise models derived in [11], although we make the additional simplifying assumption that the noise variables are Gaussian. [sent-105, score-0.572]

57 We believe that similar identiﬁability results can be derived in the multivariate case (|V | > 2) and for non-Gaussian noise distributions. [sent-106, score-0.131]

58 1 The bivariate case Before we state our identiﬁability result, we ﬁrst give a sufﬁcient condition for existence of a unique equilibrium distribution for the bivariate case. [sent-109, score-0.303]

59 If supx,y |fX (y)fY (x)| = r < 1, then for any (cX , cY ), the ﬁxed point equations converge to a unique ﬁxed point that does not depend on the initial conditions. [sent-111, score-0.124]

60 Consider the mapping deﬁned by applying the ﬁxed point equations twice. [sent-113, score-0.114]

61 Lemma 1 in the supplement then shows that the same conclusion must hold for the mapping that applies the ﬁxed point equations only once. [sent-119, score-0.131]

62 This lemma provides a sufﬁcient condition for an additive noise model to be well-deﬁned in the bivariate case. [sent-120, score-0.329]

63 Now suppose we are given the joint distribution pX,Y of two real-valued random variables X, Y which is induced by an additive noise model. [sent-122, score-0.259]

64 The question is whether we can identify the causal graph corresponding with the true model out of the four possibilities (X Y , X → Y , Y → X, X Y ). [sent-123, score-0.448]

65 [11] have shown that if one excludes the cyclic case X Y , then in the generic case, the causal structure is identiﬁable. [sent-125, score-0.664]

66 Our aim is to prove a stronger identiﬁability result where the cyclic case is not excluded a priori. [sent-126, score-0.24]

67 What the theorem shows is that, apart from a small class of exceptions, bivariate additive Gaussian-noise models induce densities that allow a perfect reconstruction of the causal graph. [sent-133, score-0.666]

68 In a certain sense, the situation can be seen as similar to the well-known “faithfulness assumption” [3]: the latter assumption is often made in order to exclude the highly special cases of causal models which would spoil identiﬁability of the Markov equivalence class. [sent-134, score-0.475]

69 We assume EX ∼ N (0, αX ) and EY ∼ N (0, αY ) where αX = σX , αY = σY are the precisions (inverse variances) of the Gaussian noise variables. [sent-140, score-0.131]

70 the causal graphs of M and M ˜ ˜ For example, in the ﬁrst case, because fX = fY = 0, we obtain the following equation: 0 = αX fX (y) + αY fY (x) 1 − fX (y)fY (x) 2 − fX (y)fY (x) (11) This is a nonlinear partial differential equation in φ(x) := fY (x) and ψ(y) := fX (y). [sent-152, score-0.551]

71 3] that gives a general method for solving functional-differential equations of the form Φ1 (x)Ψ1 (y) + Φ2 (x)Ψ2 (y) + · · · + Φk (x)Ψk (y) = 0 (12) where the functionals Φi (x) and Ψi (y) depend only on x and y, respectively: Φi (x) = Φi (x, φ, φ ), Ψi (y) = Ψi (y, ψ, ψ ). [sent-155, score-0.116]

72 In the second case ˜ ˜ (where M has one arrow) the equations show that either M = M, or the model parameters should satisfy equations (5) and (6). [sent-164, score-0.198]

73 4 Learning additive noise models from observational data In this section, we propose a method to learn an additive noise model from a ﬁnite data set D := {x(n) }N . [sent-165, score-0.511]

74 We will only describe the bivariate case in detail, although the method can be extended n=1 to more than two variables in a straightforward way. [sent-166, score-0.146]

75 We ﬁrst consider how we can learn the causal mechanisms {fi }i∈V for a ﬁxed causal structure. [sent-167, score-0.871]

76 This can be done efﬁciently by a MAP estimate with respect to (the parameters of) the causal mechanisms. [sent-168, score-0.423]

77 Using (4), the MAP problem can be written as: N ˆ argmax p(f ) ˆ f I− (n) ˆ f x(n) n=1 p Ei x i (n) ˆ − fi xpa(i) (13) i∈V ˆ where p(f ) speciﬁes the prior distribution of the causal mechanisms. [sent-169, score-0.47]

78 In the cyclic case, however, the determinant is necessary in order to penalize dependencies between the estimated noise variables. [sent-171, score-0.408]

79 One can consider this as a special case of nonlinear independent component analysis, as the MAP estimate (13) can also be interpreted as the minimizer of the mutual ˆ information between the noise variables. [sent-172, score-0.236]

80 If the estimated functions lead to noise estimates Ei = ˆ Xi − fi (Xpa(i) ) which are mutually independent according to some independence test, then we accept the model. [sent-173, score-0.269]

81 One can try all possible causal graph structures and test which ones ﬁt the data. [sent-174, score-0.448]

82 The models that lead to independent estimated noise values are possible causal explanations of the data. [sent-175, score-0.62]

83 If multiple models with different causal graphs lead to independent estimated noise values, we prefer models with fewer arrows in the graph. [sent-176, score-0.677]

84 4 If the number of data points is large enough, Theorem 1 suggests that for two variables with Gaussian noise, in the generic case, a unique causal structure will be identiﬁed in this way. [sent-177, score-0.504]

85 For more than two variables, and for other noise distributions, the method can still be applied, but we do not know whether (in general and asymptotically) there will be a unique causal structure that explains the data. [sent-178, score-0.579]

86 i=1 2 2 Assuming Gaussian noise EX ∼ N (0, σX ), EY ∼ N (0, σY ) and using Gaussian Process priors for the causal mechanisms fX and fY , i. [sent-183, score-0.579]

87 , taking x := fX (y) ∼ N 0, KX (y) and ˆ 4 Note that if a certain model leads to independent noise terms, then adding more arrows will still allow independent noise terms, by setting some functions to 0—see also Figure 1 below. [sent-185, score-0.314]

88 We optimize simultaneously with respect to the noise values x, y and the hyperparameters log σX , log κX , log λX , log σY , log κY , log λY . [sent-192, score-0.245]

89 Because of space constraints, we only show the learned cyclic additive noise models, omitting the acyclic ones. [sent-195, score-0.534]

90 Rows 1a and 1b concern the same data generated from a nonlinear and acyclic model. [sent-198, score-0.145]

91 Even though we learned a causal model with cyclic structure, in the accepted solution, one of the learned causal mechanisms becomes (almost) constant. [sent-200, score-1.094]

92 Rows 3a and 3b show again two different solutions for the same data, now generated from a nonlinear cyclic model. [sent-201, score-0.294]

93 Row 4 shows data from a linear, cyclic model, where the ratio of the noise sources equals the ratio of the slopes of the causal mechanisms. [sent-203, score-0.777]

94 This makes this linear model part of the special class of unidentiﬁable additive noise models. [sent-204, score-0.236]

95 In this case, the MAP estimates for the causal mechanisms are quite different from the true ones. [sent-205, score-0.448]

96 6 Discussion and Conclusion We have studied a particular class of cyclic causal models given by nonlinear SEMs with additive noise. [sent-206, score-0.811]

97 We have discussed how these models can be interpreted to describe the equilibrium distribution of a dynamic system with noise that is constant in time. [sent-207, score-0.269]

98 We have looked in detail at the bivariate Gaussian-noise case and shown generic identiﬁability of the causal graph. [sent-208, score-0.549]

99 Investigating causal relations by econometric models and cross-spectral methods. [sent-269, score-0.443]

100 Modeling discrete interventional data using directed cyclic graphical models. [sent-296, score-0.26]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fy', 0.543), ('fx', 0.523), ('causal', 0.423), ('cyclic', 0.223), ('ey', 0.156), ('noise', 0.131), ('ex', 0.116), ('bivariate', 0.108), ('equations', 0.099), ('acyclic', 0.09), ('additive', 0.09), ('xpa', 0.082), ('kx', 0.082), ('identi', 0.07), ('ky', 0.063), ('equilibrium', 0.062), ('nonlinear', 0.055), ('cy', 0.052), ('ability', 0.049), ('observational', 0.049), ('fi', 0.047), ('reconstructed', 0.046), ('tanh', 0.044), ('causes', 0.042), ('independences', 0.041), ('discovery', 0.04), ('variables', 0.038), ('directed', 0.037), ('cycles', 0.036), ('ei', 0.034), ('cx', 0.034), ('equation', 0.032), ('estimated', 0.031), ('parents', 0.031), ('independence', 0.028), ('equilibration', 0.027), ('pei', 0.027), ('netherlands', 0.026), ('loops', 0.026), ('differential', 0.026), ('densities', 0.025), ('ordinary', 0.025), ('mechanisms', 0.025), ('unique', 0.025), ('graph', 0.025), ('dominik', 0.024), ('interventions', 0.024), ('joris', 0.024), ('generically', 0.024), ('intervention', 0.024), ('lacerda', 0.024), ('sems', 0.024), ('pa', 0.024), ('determinant', 0.023), ('feedback', 0.023), ('arrows', 0.022), ('causality', 0.022), ('janzing', 0.022), ('nijmegen', 0.022), ('radboud', 0.022), ('spirtes', 0.022), ('uncertainty', 0.021), ('mechanism', 0.02), ('xed', 0.02), ('jacobian', 0.02), ('mooij', 0.02), ('hoyer', 0.02), ('dynamic', 0.02), ('models', 0.02), ('minima', 0.02), ('interpreted', 0.02), ('log', 0.019), ('generic', 0.018), ('bingen', 0.018), ('bernhard', 0.018), ('accept', 0.017), ('arrow', 0.017), ('functionals', 0.017), ('supplement', 0.017), ('gram', 0.017), ('rows', 0.017), ('assumption', 0.017), ('intelligence', 0.017), ('sketch', 0.017), ('excluded', 0.017), ('gm', 0.017), ('gaussian', 0.017), ('concentrates', 0.016), ('arti', 0.016), ('system', 0.016), ('solutions', 0.016), ('sch', 0.016), ('pe', 0.016), ('px', 0.016), ('cd', 0.016), ('special', 0.015), ('cos', 0.015), ('graphs', 0.015), ('independent', 0.015), ('mapping', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

2 0.33661175 15 nips-2011-A rational model of causal inference with continuous causes

Author: Thomas L. Griffiths, Michael James

Abstract: Rational models of causal induction have been successful in accounting for people’s judgments about causal relationships. However, these models have focused on explaining inferences from discrete data of the kind that can be summarized in a 2× 2 contingency table. This severely limits the scope of these models, since the world often provides non-binary data. We develop a new rational model of causal induction using continuous dimensions, which aims to diminish the gap between empirical and theoretical approaches and real-world causal induction. This model successfully predicts human judgments from previous studies better than models of discrete causal inference, and outperforms several other plausible models of causal induction with continuous causes in accounting for people’s inferences in a new experiment. 1

3 0.11601163 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

Author: Jun-ichiro Hirayama, Aapo Hyvärinen

Abstract: Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or energies. Here, we propose a principled probabilistic model to model the energycorrelations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new feature is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new twostage model enables estimation of both the linear mixing and the interactions related to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. 1

4 0.074433595 139 nips-2011-Kernel Bayes' Rule

Author: Kenji Fukumizu, Le Song, Arthur Gretton

Abstract: A nonparametric kernel-based method for realizing Bayes’ rule is proposed, based on kernel representations of probabilities in reproducing kernel Hilbert spaces. The prior and conditional probabilities are expressed as empirical kernel mean and covariance operators, respectively, and the kernel mean of the posterior distribution is computed in the form of a weighted sample. The kernel Bayes’ rule can be applied to a wide variety of Bayesian inference problems: we demonstrate Bayesian computation without likelihood, and ﬁltering with a nonparametric statespace model. A consistency rate for the posterior estimate is established. 1

5 0.055828888 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

6 0.053827122 100 nips-2011-Gaussian Process Training with Input Noise

7 0.053644866 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

8 0.052435566 72 nips-2011-Distributed Delayed Stochastic Optimization

9 0.04370987 26 nips-2011-Additive Gaussian Processes

10 0.042458389 130 nips-2011-Inductive reasoning about chimeric creatures

11 0.038987789 301 nips-2011-Variational Gaussian Process Dynamical Systems

12 0.037170727 153 nips-2011-Learning large-margin halfspaces with more malicious noise

13 0.036527675 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

14 0.036167663 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

15 0.035591301 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

16 0.035565607 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

17 0.034898587 33 nips-2011-An Exact Algorithm for F-Measure Maximization

18 0.034337986 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.032852914 285 nips-2011-The Kernel Beta Process

20 0.032333735 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.105), (1, 0.006), (2, 0.03), (3, -0.04), (4, -0.035), (5, -0.026), (6, -0.017), (7, -0.04), (8, 0.041), (9, 0.031), (10, -0.065), (11, -0.058), (12, 0.043), (13, -0.032), (14, 0.093), (15, 0.039), (16, 0.145), (17, 0.041), (18, 0.05), (19, -0.062), (20, 0.019), (21, -0.009), (22, -0.053), (23, 0.04), (24, -0.016), (25, 0.173), (26, 0.041), (27, -0.156), (28, -0.225), (29, 0.205), (30, -0.111), (31, 0.041), (32, 0.256), (33, 0.047), (34, -0.136), (35, 0.018), (36, 0.117), (37, 0.056), (38, 0.153), (39, 0.021), (40, 0.042), (41, -0.038), (42, -0.076), (43, -0.107), (44, -0.026), (45, 0.083), (46, 0.013), (47, 0.046), (48, 0.161), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94790077 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

2 0.89952683 15 nips-2011-A rational model of causal inference with continuous causes

Author: Thomas L. Griffiths, Michael James

3 0.48901704 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

Author: Jun-ichiro Hirayama, Aapo Hyvärinen

4 0.41393188 130 nips-2011-Inductive reasoning about chimeric creatures

Author: Charles Kemp

Abstract: Given one feature of a novel animal, humans readily make inferences about other features of the animal. For example, winged creatures often ﬂy, and creatures that eat ﬁsh often live in the water. We explore the knowledge that supports these inferences and compare two approaches. The ﬁrst approach proposes that humans rely on abstract representations of dependency relationships between features, and is formalized here as a graphical model. The second approach proposes that humans rely on speciﬁc knowledge of previously encountered animals, and is formalized here as a family of exemplar models. We evaluate these models using a task where participants reason about chimeras, or animals with pairs of features that have not previously been observed to co-occur. The results support the hypothesis that humans rely on explicit representations of relationships between features. Suppose that an eighteenth-century naturalist learns about a new kind of animal that has fur and a duck’s bill. Even though the naturalist has never encountered an animal with this pair of features, he should be able to make predictions about other features of the animal—for example, the animal could well live in water but probably does not have feathers. Although the platypus exists in reality, from a eighteenth-century perspective it qualiﬁes as a chimera, or an animal that combines two or more features that have not previously been observed to co-occur. Here we describe a probabilistic account of inductive reasoning and use it to account for human inferences about chimeras. The inductive problems we consider are special cases of the more general problem in Figure 1a where a reasoner is given a partially observed matrix of animals by features then asked to infer the values of the missing entries. This general problem has been previously studied and is addressed by computational models of property induction, categorization, and generalization [1–7]. A challenge faced by all of these models is to capture the background knowledge that guides inductive inferences. Some accounts rely on similarity relationships between animals [6, 8], others rely on causal relationships between features [9, 10], and others incorporate relationships between animals and relationships between features [11]. We will evaluate graphical models that capture both kinds of relationships (Figure 1a), but will focus in particular on relationships between features. Psychologists have previously suggested that humans rely on explicit mental representations of relationships between features [12–16]. Often these representations are described as theories—for example, theories that specify a causal relationship between having wings and ﬂying, or living in the sea and eating ﬁsh. Relationships between features may take several forms: for example, one feature may cause, enable, prevent, be inconsistent with, or be a special case of another feature. For simplicity, we will treat all of these relationships as instances of dependency relationships between features, and will capture them using an undirected graphical model. Previous studies have used graphical models to account for human inferences about features but typically these studies consider toy problems involving a handful of novel features such as “has gene X14” or “has enzyme Y132” [9, 11]. Participants might be told, for example, that gene X14 leads to the production of enzyme Y132, then asked to use this information when reasoning about novel animals. Here we explore whether a graphical model approach can account for inferences 1 (a) slow heavy flies (b) wings hippo 1 1 0 0 rhino 1 1 0 0 sparrow 0 0 1 1 robin 0 0 1 1 new ? ? 1 ? o Figure 1: Inductive reasoning about animals and features. (a) Inferences about the features of a new animal onew that ﬂies may draw on similarity relationships between animals (the new animal is similar to sparrows and robins but not hippos and rhinos), and on dependency relationships between features (ﬂying and having wings are linked). (b) A graph product produced by combining the two graph structures in (a). about familiar features. Working with familiar features raises a methodological challenge since participants have a substantial amount of knowledge about these features and can reason about them in multiple ways. Suppose, for example, that you learn that a novel animal can ﬂy (Figure 1a). To conclude that the animal probably has wings, you might consult a mental representation similar to the graph at the top of Figure 1a that speciﬁes a dependency relationship between ﬂying and having wings. On the other hand, you might reach the same conclusion by thinking about ﬂying creatures that you have previously encountered (e.g. sparrows and robins) and noticing that these creatures have wings. Since the same conclusion can be reached in two different ways, judgments about arguments of this kind provide little evidence about the mental representations involved. The challenge of working with familiar features directly motivates our focus on chimeras. Inferences about chimeras draw on rich background knowledge but require the reasoner to go beyond past experience in a fundamental way. For example, if you learn that an animal ﬂies and has no legs, you cannot make predictions about the animal by thinking of ﬂying, no-legged creatures that you have previously encountered. You may, however, still be able to infer that the novel animal has wings if you understand the relationship between ﬂying and having wings. We propose that graphical models over features can help to explain how humans make inferences of this kind, and evaluate our approach by comparing it to a family of exemplar models. The next section introduces these models, and we then describe two experiments designed to distinguish between the models. 1 Reasoning about objects and features Our models make use of a binary matrix D where the rows {o1 , . . . , o129 } correspond to objects, and the columns {f 1 , . . . , f 56 } correspond to features. A subset of the objects is shown in Figure 2a, and the full set of features is shown in Figure 2b and its caption. Matrix D was extracted from the Leuven natural concept database [17], which includes 129 animals and 757 features in total. We chose a subset of these features that includes a mix of perceptual and behavioral features, and that includes many pairs of features that depend on each other. For example, animals that “live in water” typically “can swim,” and animals that have “no legs” cannot “jump far.” Matrix D can be used to formulate problems where a reasoner observes one or two features of a new object (i.e. animal o130 ) and must make inferences about the remaining features of the animal. The next two sections describe graphical models that can be used to address this problem. The ﬁrst graphical model O captures relationships between objects, and the second model F captures relationships between features. We then discuss how these models can be combined, and introduce a family of exemplar-style models that will be compared with our graphical models. A graphical model over objects Many accounts of inductive reasoning focus on similarity relationships between objects [6, 8]. Here we describe a tree-structured graphical model O that captures these relationships. The tree was constructed from matrix D using average linkage clustering and the Jaccard similarity measure, and part of the resulting structure is shown in Figure 2a. The subtree in Figure 2a includes clusters 2 alligator caiman crocodile monitor lizard dinosaur blindworm boa cobra python snake viper chameleon iguana gecko lizard salamander frog toad tortoise turtle anchovy herring sardine cod sole salmon trout carp pike stickleback eel flatfish ray plaice piranha sperm whale squid swordfish goldfish dolphin orca whale shark bat fox wolf beaver hedgehog hamster squirrel mouse rabbit bison elephant hippopotamus rhinoceros lion tiger polar bear deer dromedary llama giraffe zebra kangaroo monkey cat dog cow horse donkey pig sheep (a) (b) can swim lives in water eats fish eats nuts eats grain eats grass has gills can jump far has two legs has no legs has six legs has four legs can fly can be ridden has sharp teeth nocturnal has wings strong predator can see in dark eats berries lives in the sea lives in the desert crawls lives in the woods has mane lives in trees can climb well lives underground has feathers has scales slow has fur heavy Figure 2: Graph structures used to deﬁne graphical models O and F. (a) A tree that captures similarity relationships between animals. The full tree includes 129 animals, and only part of the tree is shown here. The grey points along the branches indicate locations where a novel animal o130 could be attached to the tree. (b) A network capturing pairwise dependency relationships between features. The edges capture both positive and negative dependencies. All edges in the network are shown, and the network also includes 20 isolated nodes for the following features: is black, is blue, is green, is grey, is pink, is red, is white, is yellow, is a pet, has a beak, stings, stinks, has a long neck, has feelers, sucks blood, lays eggs, makes a web, has a hump, has a trunk, and is cold-blooded. corresponding to amphibians and reptiles, aquatic creatures, and land mammals, and the subtree omitted for space includes clusters for insects and birds. We assume that the features in matrix D (i.e. the columns) are generated independently over O: P (f i |O, π i , λi ). P (D|O, π, λ) = i i i i The distribution P (f |O, π , λ ) is based on the intuition that nearby nodes in O tend to have the same value of f i . Previous researchers [8, 18] have used a directed graphical model where the distribution at the root node is based on the baserate π i , and any other node v with parent u has the following conditional probability distribution: i P (v = 1|u) = π i + (1 − π i )e−λ l , if u = 1 i π i − π i e−λ l , if u = 0 (1) where l is the length of the branch joining node u to node v. The variability parameter λi captures the extent to which feature f i is expected to vary over the tree. Note, for example, that any node v must take the same value as its parent u when λ = 0. To avoid free parameters, the feature baserates π i and variability parameters λi are set to their maximum likelihood values given the observed values of the features {f i } in the data matrix D. The conditional distributions in Equation 1 induce a joint distribution over all of the nodes in graph O, and the distribution P (f i |O, π i , λi ) is computed by marginalizing out the values of the internal nodes. Although we described O as a directed graphical model, the model can be converted into an equivalent undirected model with a potential for each edge in the tree and a potential for the root node. Here we use the undirected version of the model, which is a natural counterpart to the undirected model F described in the next section. The full version of structure O in Figure 2a includes 129 familiar animals, and our task requires inferences about a novel animal o130 that must be slotted into the structure. Let D′ be an expanded version of D that includes a row for o130 , and let O′ be an expanded version of O that includes a node for o130 . The edges in Figure 2a are marked with evenly spaced gray points, and we use a 3 uniform prior P (O′ ) over all trees that can be created by attaching o130 to one of these points. Some of these trees have identical topologies, since some edges in Figure 2a have multiple gray points. Predictions about o130 can be computed using: P (D′ |D) = P (D′ |O′ , D)P (O′ |D) ∝ O′ P (D′ |O′ , D)P (D|O′ )P (O′ ). (2) O′ Equation 2 captures the basic intuition that the distribution of features for o130 is expected to be consistent with the distribution observed for previous animals. For example, if o130 is known to ﬂy then the trees with high posterior probability P (O′ |D) will be those where o130 is near other ﬂying creatures (Figure 1a), and since these creatures have wings Equation 2 predicts that o130 probably also has wings. As this example suggests, model O captures dependency relationships between features implicitly, and therefore stands in contrast to models like F that rely on explicit representations of relationships between features. A graphical model over features Model F is an undirected graphical model deﬁned over features. The graph shown in Figure 2b was created by identifying pairs where one feature depends directly on another. The author and a research assistant both independently identiﬁed candidate sets of pairwise dependencies, and Figure 2b was created by merging these sets and reaching agreement about how to handle any discrepancies. As previous researchers have suggested [13, 15], feature dependencies can capture several kinds of relationships. For example, wings enable ﬂying, living in the sea leads to eating ﬁsh, and having no legs rules out jumping far. We work with an undirected graph because some pairs of features depend on each other but there is no clear direction of causal inﬂuence. For example, there is clearly a dependency relationship between being nocturnal and seeing in the dark, but no obvious sense in which one of these features causes the other. We assume that the rows of the object-feature matrix D are generated independently from an undirected graphical model F deﬁned over the feature structure in Figure 2b: P (oi |F). P (D|F) = i Model F includes potential functions for each node and for each edge in the graph. These potentials were learned from matrix D using the UGM toolbox for undirected graphical models [19]. The learned potentials capture both positive and negative relationships: for example, animals that live in the sea tend to eat ﬁsh, and tend not to eat berries. Some pairs of feature values never occur together in matrix D (there are no creatures that ﬂy but do not have wings). We therefore chose to compute maximum a posteriori values of the potential functions rather than maximum likelihood values, and used a diffuse Gaussian prior with a variance of 100 on the entries in each potential. After learning the potentials for model F, we can make predictions about a new object o130 using the distribution P (o130 |F). For example, if o130 is known to ﬂy (Figure 1a), model F predicts that o130 probably has wings because the learned potentials capture a positive dependency between ﬂying and having wings. Combining object and feature relationships There are two simple ways to combine models O and F in order to develop an approach that incorporates both relationships between features and relationships between objects. The output combination model computes the predictions of both models in isolation, then combines these predictions using a weighted sum. The resulting model is similar to a mixture-of-experts model, and to avoid free parameters we use a mixing weight of 0.5. The structure combination model combines the graph structures used by the two models and relies on a set of potentials deﬁned over the resulting graph product. An example of a graph product is shown in Figure 1b, and the potential functions for this graph are inherited from the component models in the natural way. Kemp et al. [11] use a similar approach to combine a functional causal model with an object model O, but note that our structure combination model uses an undirected model F rather than a functional causal model over features. Both combination models capture the intuition that inductive inferences rely on relationships between features and relationships between objects. The output combination model has the virtue of 4 simplicity, and the structure combination model is appealing because it relies on a single integrated representation that captures both relationships between features and relationships between objects. To preview our results, our data suggest that the combination models perform better overall than either O or F in isolation, and that both combination models perform about equally well. Exemplar models We will compare the family of graphical models already described with a family of exemplar models. The key difference between these model families is that the exemplar models do not rely on explicit representations of relationships between objects and relationships between features. Comparing the model families can therefore help to establish whether human inferences rely on representations of this sort. Consider ﬁrst a problem where a reasoner must predict whether object o130 has feature k after observing that it has feature i. An exemplar model addresses the problem by retrieving all previouslyobserved objects with feature i and computing the proportion that have feature k: P (ok = 1|oi = 1) = |f k & f i | |f i | (3) where |f k | is the number of objects in matrix D that have feature k, and |f k & f i | is the number that have both feature k and feature i. Note that we have streamlined our notation by using ok instead of o130 to refer to the kth feature value for object o130 . k Suppose now that the reasoner observes that object o130 has features i and j. The natural generalization of Equation 3 is: P (ok = 1|oi = 1, oj = 1) = |f k & f i & f j | |f i & f j | (4) Because we focus on chimeras, |f i & f j | = 0 and Equation 4 is not well deﬁned. We therefore evaluate an exemplar model that computes predictions for the two observed features separately then computes the weighted sum of these predictions: P (ok = 1|oi = 1, oj = 1) = wi |f k & f i | |f k & f j | + wj . i| |f |f j | (5) where the weights wi and wj must sum to one. We consider four ways in which the weights could be set. The ﬁrst strategy sets wi = wj = 0.5. The second strategy sets wi ∝ |f i |, and is consistent with an approach where the reasoner retrieves all exemplars in D that are most similar to the novel animal and reports the proportion of these exemplars that have feature k. The third strategy sets wi ∝ |f1i | , and captures the idea that features should be weighted by their distinctiveness [20]. The ﬁnal strategy sets weights according to the coherence of each feature [21]. A feature is coherent if objects with that feature tend to resemble each other overall, and we deﬁne the coherence of feature i as the expected Jaccard similarity between two randomly chosen objects from matrix D that both have feature i. Note that the ﬁnal three strategies are all consistent with previous proposals from the psychological literature, and each one might be expected to perform well. Because exemplar models and prototype models are often compared, it is natural to consider a prototype model [22] as an additional baseline. A standard prototype model would partition the 129 animals into categories and would use summary statistics for these categories to make predictions about the novel animal o130 . We will not evaluate this model because it corresponds to a coarser version of model O, which organizes the animals into a hierarchy of categories. The key characteristic shared by both models is that they explicitly capture relationships between objects but not features. 2 Experiment 1: Chimeras Our ﬁrst experiment explores how people make inferences about chimeras, or novel animals with features that have not previously been observed to co-occur. Inferences about chimeras raise challenges for exemplar models, and therefore help to establish whether humans rely on explicit representations of relationships between features. Each argument can be represented as f i , f j → f k 5 exemplar r = 0.42 7 feature F exemplar (wi = |f i |) (wi = 0.5) r = 0.44 7 object O r = 0.69 7 output combination r = 0.31 7 structure combination r = 0.59 7 r = 0.60 7 5 5 5 5 5 3 3 3 3 3 3 all 5 1 1 0 1 r = 0.06 7 conflict 0.5 1 1 0 0.5 1 r = 0.71 7 1 0 0.5 1 r = −0.02 7 1 0 0.5 1 r = 0.49 7 0 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.57 7 5 3 1 0 0.5 1 r = 0.51 7 edge 0.5 r = 0.17 7 1 1 0 0.5 1 r = 0.64 7 1 0 0.5 1 r = 0.83 7 1 0 0.5 1 r = 0.45 7 1 0 0.5 1 r = 0.76 7 0 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.79 7 5 3 1 1 0 0.5 1 r = 0.26 7 other 1 0 1 0 0.5 1 r = 0.25 7 1 0 0.5 1 r = 0.19 7 1 0 0.5 1 r = 0.25 7 1 0 0.5 1 r = 0.24 7 0 7 5 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.33 3 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 0 0.5 1 Figure 3: Argument ratings for Experiment 1 plotted against the predictions of six models. The y-axis in each panel shows human ratings on a seven point scale, and the x-axis shows probabilities according to one of the models. Correlation coefﬁcients are shown for each plot. where f i and f k are the premises (e.g. “has no legs” and “can ﬂy”) and f k is the conclusion (e.g. “has wings”). We are especially interested in conﬂict cases where the premises f i and f j lead to opposite conclusions when taken individually: for example, most animals with no legs do not have wings, but most animals that ﬂy do have wings. Our models that incorporate feature structure F can resolve this conﬂict since F includes a dependency between “wings” and “can ﬂy” but not between “wings” and “has no legs.” Our models that do not include F cannot resolve the conﬂict and predict that humans will be uncertain about whether the novel animal has wings. Materials. The object-feature matrix D includes 447 feature pairs {f i , f j } such that none of the 129 animals has both f i and f j . We selected 40 pairs (see the supporting material) and created 400 arguments in total by choosing 10 conclusion features for each pair. The arguments can be assigned to three categories. Conﬂict cases are arguments f i , f j → f k such that the single-premise arguments f i → f k and f j → f k lead to incompatible predictions. For our purposes, two singlepremise arguments with the same conclusion are deemed incompatible if one leads to a probability greater than 0.9 according to Equation 3, and the other leads to a probability less than 0.1. Edge cases are arguments f i , f j → f k such that the feature network in Figure 2b includes an edge between f k and either f i or f j . Note that some arguments are both conﬂict cases and edge cases. All arguments that do not fall into either one of these categories will be referred to as other cases. The 400 arguments for the experiment include 154 conﬂict cases, 153 edge cases, and 120 other cases. 34 arguments are both conﬂict cases and edge cases. We chose these arguments based on three criteria. First, we avoided premise pairs that did not co-occur in matrix D but that co-occur in familiar animals that do not belong to D. For example, “is pink” and “has wings” do not co-occur in D but “ﬂamingo” is a familiar animal that has both features. Second, we avoided premise pairs that speciﬁed two different numbers of legs—for example, {“has four legs,” “has six legs”}. Finally, we aimed to include roughly equal numbers of conﬂict cases, edge cases, and other cases. Method. 16 undergraduates participated for course credit. The experiment was carried out using a custom-built computer interface, and one argument was presented on screen at a time. Participants 6 rated the probability of the conclusion on seven point scale where the endpoints were labeled “very unlikely” and “very likely.” The ten arguments for each pair of premises were presented in a block, but the order of these blocks and the order of the arguments within these blocks were randomized across participants. Results. Figure 3 shows average human judgments plotted against the predictions of six models. The plots in the ﬁrst row include all 400 arguments in the experiment, and the remaining rows show results for conﬂict cases, edge cases, and other cases. The previous section described four exemplar models, and the two shown in Figure 3 are the best performers overall. Even though the graphical models include more numerical parameters than the exemplar models, recall that these parameters are learned from matrix D rather than ﬁt to the experimental data. Matrix D also serves as the basis for the exemplar models, which means that all of the models can be compared on equal terms. The ﬁrst row of Figure 3 suggests that the three models which include feature structure F perform better than the alternatives. The output combination model is the worst of the three models that incorporate F, and the correlation achieved by this model is signiﬁcantly greater than the correlation achieved by the best exemplar model (p < 0.001, using the Fisher transformation to convert correlation coefﬁcients to z scores). Our data therefore suggest that explicit representations of relationships between features are needed to account for inductive inferences about chimeras. The model that includes the feature structure F alone performs better than the two models that combine F with the object structure O, which may not be surprising since Experiment 1 focuses speciﬁcally on novel animals that do not slot naturally into structure O. Rows two through four suggest that the conﬂict arguments in particular raise challenges for the models which do not include feature structure F. Since these conﬂict cases are arguments f i , f j → f k where f i → f k has strength greater than 0.9 and f j → f k has strength less than 0.1, the ﬁrst exemplar model averages these strengths and assigns an overall strength of around 0.5 to each argument. The second exemplar model is better able to differentiate between the conﬂict arguments, but still performs substantially worse than the three models that include structure F. The exemplar models perform better on the edge arguments, but are outperformed by the models that include F. Finally, all models achieve roughly the same level of performance on the other arguments. Although the feature model F performs best overall, the predictions of this model still leave room for improvement. The two most obvious outliers in the third plot in the top row represent the arguments {is blue, lives in desert → lives in woods} and {is pink, lives in desert → lives in woods}. Our participants sensibly infer that any animal which lives in the desert cannot simultaneously live in the woods. In contrast, the Leuven database indicates that eight of the twelve animals that live in the desert also live in the woods, and the edge in Figure 2b between “lives in the desert” and “lives in the woods” therefore represents a positive dependency relationship according to model F. This discrepancy between model and participants reﬂects the fact that participants made inferences about individual animals but the Leuven database is based on features of animal categories. Note, for example, that any individual animal is unlikely to live in the desert and the woods, but that some animal categories (including snakes, salamanders, and lizards) are found in both environments. 3 Experiment 2: Single-premise arguments Our results so far suggest that inferences about chimeras rely on explicit representations of relationships between features but provide no evidence that relationships between objects are important. It would be a mistake, however, to conclude that relationships between objects play no role in inductive reasoning. Previous studies have used object structures like the example in Figure 2a to account for inferences about novel features [11]—for example, given that alligators have enzyme Y132 in their blood, it seems likely that crocodiles also have this enzyme. Inferences about novel objects can also draw on relationships between objects rather than relationships between features. For example, given that a novel animal has a beak you will probably predict that it has feathers, not because there is any direct dependency between these two features, but because the beaked animals that you know tend to have feathers. Our second experiment explores inferences of this kind. Materials and Method. 32 undergraduates participated for course credit. The task was identical to Experiment 1 with the following exceptions. Each two-premise argument f i , f j → f k from Experiment 1 was converted into two one-premise arguments f i → f k and f j → f k , and these 7 feature F exemplar r = 0.78 7 object O r = 0.54 7 output combination r = 0.75 7 structure combination r = 0.75 7 all 5 5 5 5 5 3 3 3 3 3 1 1 0 edge 0.5 1 r = 0.87 7 1 0 0.5 1 r = 0.87 7 1 0 0.5 1 r = 0.84 7 1 0 0.5 1 r = 0.86 7 0 5 5 5 3 3 3 1 5 3 0.5 r = 0.85 7 5 3 1 1 0 0.5 1 r = 0.79 7 other r = 0.77 7 1 0 0.5 1 r = 0.21 7 1 0 0.5 1 r = 0.74 7 1 0 0.5 1 r = 0.66 7 0 5 5 5 5 3 3 3 3 1 r = 0.73 7 5 0.5 3 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 0 0.5 1 Figure 4: Argument ratings and model predictions for Experiment 2. one-premise arguments were randomly assigned to two sets. 16 participants rated the 400 arguments in the ﬁrst set, and the other 16 rated the 400 arguments in the second set. Results. Figure 4 shows average human ratings for the 800 arguments plotted against the predictions of ﬁve models. Unlike Figure 3, Figure 4 includes a single exemplar model since there is no need to consider different feature weightings in this case. Unlike Experiment 1, the feature model F performs worse than the other alternatives (p < 0.001 in all cases). Not surprisingly, this model performs relatively well for edge cases f j → f k where f j and f k are linked in Figure 2b, but the ﬁnal row shows that the model performs poorly across the remaining set of arguments. Taken together, Experiments 1 and 2 suggest that relationships between objects and relationships between features are both needed to account for human inferences. Experiment 1 rules out an exemplar approach but models that combine graph structures over objects and features perform relatively well in both experiments. We considered two methods for combining these structures and both performed equally well. Combining the knowledge captured by these structures appears to be important, and future studies can explore in detail how humans achieve this combination. 4 Conclusion This paper proposed that graphical models are useful for capturing knowledge about animals and their features and showed that a graphical model over features can account for human inferences about chimeras. A family of exemplar models and a graphical model deﬁned over objects were unable to account for our data, which suggests that humans rely on mental representations that explicitly capture dependency relationships between features. Psychologists have previously used graphical models to capture relationships between features, but our work is the ﬁrst to focus on chimeras and to explore models deﬁned over a large set of familiar features. Although a simple undirected model accounted relatively well for our data, this model is only a starting point. The model incorporates dependency relationships between features, but people know about many speciﬁc kinds of dependencies, including cases where one feature causes, enables, prevents, or is inconsistent with another. An undirected graph with only one class of edges cannot capture this knowledge in full, and richer representations will ultimately be needed in order to provide a more complete account of human reasoning. Acknowledgments I thank Madeleine Clute for assisting with this research. This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] R. N. Shepard. Towards a universal law of generalization for psychological science. Science, 237:1317– 1323, 1987. [2] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3):409–429, 1991. [3] E. Heit. A Bayesian analysis of some forms of inductive reasoning. In M. Oaksford and N. Chater, editors, Rational models of cognition, pages 248–274. Oxford University Press, Oxford, 1998. [4] J. B. Tenenbaum and T. L. Grifﬁths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [5] C. Kemp and J. B. Tenenbaum. Structured statistical models of inductive reasoning. Psychological Review, 116(1):20–58, 2009. [6] D. N. Osherson, E. E. Smith, O. Wilkie, A. Lopez, and E. Shaﬁr. Category-based induction. Psychological Review, 97(2):185–200, 1990. [7] D. J. Navarro. Learning the context of a category. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1795–1803. 2010. [8] C. Kemp, T. L. Grifﬁths, S. Stromsten, and J. B. Tenenbaum. Semi-supervised learning with trees. In Advances in Neural Information Processing Systems 16, pages 257–264. MIT Press, Cambridge, MA, 2004. [9] B. Rehder. A causal-model theory of conceptual representation and categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29:1141–1159, 2003. [10] B. Rehder and R. Burnett. Feature inference and the causal structure of categories. Cognitive Psychology, 50:264–314, 2005. [11] C. Kemp, P. Shafto, and J. B. Tenenbaum. An integrated account of generalization across objects and features. Cognitive Psychology, in press. [12] S. E. Barrett, H. Abdi, G. L. Murphy, and J. McCarthy Gallagher. Theory-based correlations and their role in children’s concepts. Child Development, 64:1595–1616, 1993. [13] S. A. Sloman, B. C. Love, and W. Ahn. Feature centrality and conceptual coherence. Cognitive Science, 22(2):189–228, 1998. [14] D. Yarlett and M. Ramscar. A quantitative model of counterfactual reasoning. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 123–130. MIT Press, Cambridge, MA, 2002. [15] W. Ahn, J. K. Marsh, C. C. Luhmann, and K. Lee. Effect of theory-based feature correlations on typicality judgments. Memory and Cognition, 30(1):107–118, 2002. [16] D. C. Meehan C. McNorgan, R. A. Kotack and K. McRae. Feature-feature causal relations and statistical co-occurrences in object concepts. Memory and Cognition, 35(3):418–431, 2007. [17] S. De Deyne, S. Verheyen, E. Ameel, W. Vanpaemel, M. J. Dry, W. Voorspoels, and G. Storms. Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behavior Research Methods, 40(4):1030–1048, 2008. [18] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, 2001. [19] M. Schmidt. UGM: A Matlab toolbox for probabilistic undirected graphical models. 2007. Available at http://people.cs.ubc.ca/∼schmidtm/Software/UGM.html. [20] L. J. Nelson and D. T. Miller. The distinctiveness effect in social categorization: you are what makes you unusual. Psychological Science, 6:246–249, 1995. [21] A. L. Patalano, S. Chin-Parker, and B. H. Ross. The importance of being coherent: category coherence, cross-classiﬁcation and reasoning. Journal of memory and language, 54:407–424, 2006. [22] S. K. Reed. Pattern recognition and categorization. Cognitive Psychology, 3:393–407, 1972. 9

5 0.34087291 90 nips-2011-Evaluating the inverse decision-making approach to preference learning

Author: Alan Jern, Christopher G. Lucas, Charles Kemp

Abstract: Psychologists have recently begun to develop computational accounts of how people infer others’ preferences from their behavior. The inverse decision-making approach proposes that people infer preferences by inverting a generative model of decision-making. Existing data sets, however, do not provide sufﬁcient resolution to thoroughly evaluate this approach. We introduce a new preference learning task that provides a benchmark for evaluating computational accounts and use it to compare the inverse decision-making approach to a feature-based approach, which relies on a discriminative combination of decision features. Our data support the inverse decision-making approach to preference learning. A basic principle of decision-making is that knowing people’s preferences allows us to predict how they will behave: if you know your friend likes comedies and hates horror ﬁlms, you can probably guess which of these options she will choose when she goes to the theater. Often, however, we do not know what other people like and we can only infer their preferences from their behavior. If you know that a different friend saw a comedy today, does that mean that he likes comedies in general? The conclusion you draw will likely depend on what else was playing and what movie choices he has made in the past. A goal for social cognition research is to develop a computational account of people’s ability to infer others’ preferences. One computational approach is based on inverse decision-making. This approach begins with a model of how someone’s preferences lead to a decision. Then, this model is inverted to determine the most likely preferences that motivated an observed decision. An alternative approach might simply learn a functional mapping between features of an observed decision and the preferences that motivated it. For instance, in your friend’s decision to see a comedy, perhaps the more movie options he turned down, the more likely it is that he has a true preference for comedies. The difference between the inverse decision-making approach and the feature-based approach maps onto the standard dichotomy between generative and discriminative models. Economists have developed an instance of the inverse decision-making approach known as the multinomial logit model [1] that has been widely used to infer consumer’s preferences from their choices. This model has recently been explored as a psychological model [2, 3, 4], but there are few behavioral data sets for evaluating it as a model of how people learn others’ preferences. Additionally, the data sets that do exist tend to be drawn from the developmental literature, which focuses on simple tasks that collect only one or two judgments from children [5, 6, 7]. The limitations of these data sets make it difﬁcult to evaluate the multinomial logit model with respect to alternative accounts of preference learning like the feature-based approach. In this paper, we use data from a new experimental task that elicits a detailed set of preference judgments from a single participant in order to evaluate the predictions of several preference learning models from both the inverse decision-making and feature-based classes. Our task requires each participant to sort a large number of observed decisions on the basis of how strongly they indicate 1 (a) (b) (c) d c c (d) b b a a x d x d c b a x 1. Number of chosen effects (−/+) 2. Number of forgone effects (+/+) 3. Number of forgone options (+/+) 4. Number of forgone options containing x (−/−) 5. Max/min number of effects in a forgone option (+/−) 6. Is x in every option? (−/−) 7. Chose only option with x? (+/+) 8. Is x the only difference between options? (+/+) 9. Do all options have same number of effects? (+/+) 10. Chose option with max/min number of effects? (−/−) Figure 1: (a)–(c) Examples of the decisions used in the experiments. Each column represents one option and the boxes represent different effects. The chosen option is indicated by the black rectangle. (d) Features used by the weighted feature and ranked feature models. Features 5 and 10 involved maxima in Experiment 1, which focused on all positive effects, and minima in Experiment 2, which focused on all negative effects. The signs in parentheses indicate the direction of the feature that suggests a stronger preference in Experiment 1 / Experiment 2. a preference for a chosen item. Because the number of decisions is large and these decisions vary on multiple dimensions, predicting how people will order them offers a challenging benchmark on which to compare computational models of preference learning. Data sets from these sorts of detailed tasks have proved fruitful in other domains. For example, data reported by Shepard, Hovland, and Jenkins [8]; Osherson, Smith, Wilkie, L´ pez, and Shaﬁr [9]; and Wasserman, Elek, Chatlosh, o and Baker [10] have motivated much subsequent research on category learning, inductive reasoning, and causal reasoning, respectively. We ﬁrst describe our preference learning task in detail. We then present several inverse decisionmaking and feature-based models of preference learning and compare these models’ predictions to people’s judgments in two experiments. The data are well predicted by models that follow the inverse decision-making approach, suggesting that this computational approach may help explain how people learn others’ preferences. 1 Multi-attribute decisions and revealed preferences We designed a task that can be used to elicit a large number of preference judgments from a single participant. The task involves a set of observed multi-attribute decisions, some examples of which are represented visually in Figure 1. Each decision is among a set of options and each option produces a set of effects. Figure 1 shows several decisions involving a total of ﬁve effects distributed among up to ﬁve options. The differently colored boxes represent different effects and the chosen option is marked by a black rectangle. For example, 1a shows a choice between an option with four effects and an option with a single effect; here, the decision maker chose the second option. In our task, people are asked to rank a large number of these decisions by how strongly they suggest that the decision maker had a preference for a particular effect (e.g., effect x in Figure 1). By imposing some minimal constraints, the space of unique multi-attribute decisions is ﬁnite and we can obtain rankings for every decision in the space. For example, Figure 2c shows a complete list of 47 unique decisions involving up to ﬁve effects, subject to several constraints described later. Three of these decisions are shown in Figure 1. If all the effects are positive—pieces of candy, for example—the ﬁrst decision (1a) suggests a strong preference for candy x, because the decision maker turned down four pieces in favor of one. The second decision (1b), however, offers much weaker evidence because nearly everyone would choose four pieces of candy over one, even without a speciﬁc preference for x. The third decision (1c) provides evidence that is strong but perhaps not quite as strong as the ﬁrst decision. When all effects are negative—like electric shocks at different body locations—decision makers may still ﬁnd some effects more tolerable than others, but different inferences are sometimes supported. For example, for negative effects, 1a provides weak evidence that x is relatively tolerable because nearly everyone would choose one shock over four. 2 A computational account of preference learning We now describe a simple computational model for learning a person’s preferences after observing that person make a decision like the ones in Figure 1. We assume that there are n available options 2 {o1 , . . . , on }, each of which produces one or more effects from the set {f1 , f2 , ..., fm }. For simplicity, we assume that effects are binary. Let ui denote the utility the decision maker assigns to effect fi . We begin by specifying a model of decision-making that makes the standard assumptions that decision makers tend to choose things with greater utility and that utilities are additive. That is, if fj is a binary vector indicating the effects produced by option oj and u is a vector of utilities assigned to each of the m effects, then the total utility associated with option oj can be expressed as Uj = fj T u. We complete the speciﬁcation of the model by applying the Luce choice rule [11], a common psychological model of choice behavior, as the function that chooses among the options: p(c = oj |u, f ) = exp(Uj ) = exp(Uk ) n k=1 exp(fj T u) n T k=1 exp(fk u) (1) where c denotes the choice made. This model can predict the choice someone will make among a speciﬁed set of options, given the utilities that person assigns to the effects in each option. To obtain estimates of someone’s utilities, we invert this model by applying Bayes’ rule: p(u|c, F) = p(c|u, F)p(u) p(c|F) (2) where F = {f1 , . . . , fn } speciﬁes the available options and their corresponding effects. This is the multinomial logit model [1], a standard econometric model. In order to apply Equation 2 we must specify a prior p(u) on the utilities. We adopt a standard approach that places independent Gaussian priors on the utilities: ui ∼ N (µ, σ 2 ). For decisions where effects are positive—like candies—we set µ = 2σ, which corresponds to a prior distribution that places approximately 2% of the probability mass below zero. Similarly, for negative effects—like electric shocks—we set µ = −2σ. 2.1 Ordering a set of observed decisions Equation 2 speciﬁes a posterior probability distribution over utilities for a single observed decision but does not provide a way to compare the inferences drawn from multiple decisions for the purposes of ordering them. Suppose we are interested in a decision maker’s preference for effect x and we wish to order a set of decisions by how strongly they support this preference. Two criteria for ordering the decisions are as follows: Absolute utility Relative utility p(c|ux , F)p(ux ) p(c|F) p(c|∀j ux ≥ uj , F)p(∀j ux ≥ uj ) p(∀j ux ≥ uj |c, F) = p(c|F) E(ux |c, F) = Eux The absolute utility model orders decisions by the mean posterior utility for effect x. This criterion is perhaps the most natural way to assess how much a decision indicates a preference for x, but it requires an inference about the utility of x in isolation, and research suggests that people often think about the utility of an effect only in relation to other salient possibilities [12]. The relative utility model applies this idea to preference learning by ordering decisions based on how strongly they suggest that x has a greater utility than all other effects. The decisions in Figures 1b and 1c are cases where the two models lead to different predictions. If the effects are all negative (e.g., electric shocks), the absolute utility model predicts that 1b provides stronger evidence for a tolerance for x because the decision maker chose to receive four shocks instead of just one. The relative utility model predicts that 1c provides stronger evidence because 1b offers no way to determine the relative tolerance of the four chosen effects with respect to one another. Like all generative models, the absolute and relative models incorporate three qualitatively different components: the likelihood term p(c|u, F), the prior p(u), and the reciprocal of the marginal likelihood 1/p(c|F). We assume that the total number of effects is ﬁxed in advance and, as a result, the prior term will be the same for all decisions that we consider. The two other components, however, will vary across decisions. The inverse decision-making approach predicts that both components should inﬂuence preference judgments, and we will test this prediction by comparing our 3 two inverse decision-making models to two alternatives that rely only one of these components as an ordering criterion: p(c|∀j ux ≥ uj , F) 1/p(c|F) Representativeness Surprise The representativeness model captures how likely the observed decision would be if the utility for x were high, and previous research has shown that people sometimes rely on a representativeness computation of this kind [13]. The surprise model captures how unexpected the observed decision is overall; surprising decisions may be best explained in terms of a strong preference for x, but unsurprising decisions provide little information about x in particular. 2.2 Feature-based models We also consider a class of feature-based models that use surface features to order decisions. The ten features that we consider are shown in Figure 1d, where x is the effect of interest. As an example, the ﬁrst feature speciﬁes the number of effects chosen; because x is always among the chosen effects, decisions where few or no other effects belong to the chosen option suggest the strongest preference for x (when all effects are positive). This and the second feature were previously identiﬁed by Newtson [14]; we included the eight additional features shown in Figure 1d in an attempt to include all possible features that seemed both simple and relevant. We consider two methods for combining this set of features to order a set of decisions by how strongly they suggest a preference for x. The ﬁrst model is a standard linear regression model, which we refer to as the weighted feature model. The model learns a weight for each feature, and the rank of a given decision is determined by a weighted sum of its features. The second model is a ranked feature model that sorts the observed decisions with respect to a strict ranking of the features. The top-ranked feature corresponds to the primary sort key, the second-ranked feature to the secondary sort key, and so on. For example, suppose that the top-ranked feature is the number of chosen effects and the second-ranked feature is the number of forgone options. Sorting the three decisions in Figure 1 according to this criterion produces the following ordering: 1a,1c,1b. This notion of sorting items on the basis of ranked features has been applied before to decision-making [15, 16] and other domains of psychology [17], but we are not aware of any previous applications to preference learning. Although our inverse decision-making and feature-based models represent two very different approaches, both may turn out to be valuable. An inverse decision-making approach may be the appropriate account of preference learning at Marr’s [18] computational level, and a feature-based approach may capture the psychological processes by which the computational-level account is implemented. Our goal, therefore, is not necessarily to accept one of these approaches and dismiss the other. Instead, we entertain three distinct possibilities. First, both approaches may account well for the data, which would support the idea that they are valid accounts operating at different levels of analysis. Second, the inverse decision-making approach may offer a better account, suggesting that process-level accounts other than the feature-based approach should be explored. Finally, the feature-based approach may offer a better account, suggesting that inverse decision-making does not constitute an appropriate computational-level account of preference learning. 3 Experiment 1: Positive effects Our ﬁrst experiment focuses on decisions involving only positive effects. The full set of 47 decisions we used is shown in Figure 2c. This set includes every possible unique decision with up to ﬁve different effects, subject to the following constraints: (1) one of the effects (effect x) must always appear in the chosen option, (2) there are no repeated options, (3) each effect may appear in an option at most once, (4) only effects in the chosen option may be repeated in other options, and (5) when effects appear in multiple options, the number of effects is held constant across options. The ﬁrst constraint is necessary for the sorting task, the second two constraints create a ﬁnite space of decisions, and the ﬁnal two constraints limit attention to what we deemed the most interesting cases. Method 43 Carnegie Mellon undergraduates participated for course credit. Each participant was given a set of cards, with one decision printed on each card. The decisions were represented visually 4 (a) (c) Decisions 42 40 45 Mean human rankings 38 30 23 20 22 17 13 12 11 10 9 8 7 6 19 18 31 34 28 21 26 36 35 33 37 27 29 32 25 24 16 15 14 5 4 3 2 1 Absolute utility model rankings (b) Mean human rankings (Experiment 1) 47 43 44 46 45 38 37 36 34 35 30 32 33 31 29 28 24 26 27 25 21 19 22 20 18 16 17 12 13 7 6 11 5 9 4 10 8 1 2 3 42 40 41 39 47 46 44 41 43 39 23 15 14 Mean human rankings (Experiment 2) 1. dcbax 2. cbax 3. bax 4. ax 5. x 6. dcax | bcax 7. dx | cx | bx | ax 8. cax | bax 9. bdx | bcx | bax 10. dcx | bax 11. bx | ax 12. bdx | cax | bax 13. cx | bx | ax 14. d | cbax 15. c | bax 16. b | ax 17. d | c | bax 18. dc | bax 19. c | b | ax 20. dc | bx | ax 21. bdc | bax 22. ad | cx | bx | ax 23. d | c | b | ax 24. bad | bcx | bax 25. ac | bx | ax 26. cb | ax 27. cbad | cbax 28. dc | b | ax 29. ad | ac | bx | ax 30. ab | ax 31. bad | bax 32. dc | ab | ax 33. dcb | ax 34. a | x 35. bad | bac | bax 36. ac | ab | ax 37. ad | ac | ab | ax 38. b | a | x 39. ba | x 40. c | b | a | x 41. cb | a | x 42. d | c | b | a | x 43. cba | x 44. dc | ba | x 45. dc | b | a | x 46. dcb | a | x 47. dcba | x Figure 2: (a) Comparison between the absolute utility model rankings and the mean human rankings for Experiment 1. Each point represents one decision, numbered with respect to the list in panel c. (b) Comparison between the mean human rankings in Experiments 1 and 2. In both scatter plots, the solid diagonal lines indicate a perfect correspondence between the two sets of rankings. (c) The complete set of decisions, ordered by the mean human rankings from Experiment 1. Options are separated by vertical bars and the chosen option is always at the far right. Participants were always asked about a preference for effect x. as in Figure 1 but without the letter labels. Participants were told that the effects were different types of candy and each option was a bag containing one or more pieces of candy. They were asked to sort the cards by how strongly each decision suggested that the decision maker liked a particular target candy, labeled x in Figure 2c. They sorted the cards freely on a table but reported their ﬁnal rankings by writing them on a sheet of paper, from weakest to strongest evidence. They were instructed to order the cards as completely as possible, but were told that they could assign the same ranking to a set of cards if they believed those cards provided equal evidence. 3.1 Results Two participants were excluded as outliers based on the criterion that their rankings for at least ﬁve decisions were at least three standard deviations from the mean rankings. We performed a hierarchical clustering analysis of the remaining 41 participants’ rankings using rank correlation as a similarity metric. Participants’ rankings were highly correlated: cutting the resulting dendrogram at 0.2 resulted in one cluster that included 33 participants and the second largest cluster included 5 Surprise MAE = 17.8 MAE = 7.0 MAE = 4.3 MAE = 17.3 MAE = 9.5 Human rankings Experiment 2 Negative effects Representativeness MAE = 2.3 MAE = 6.7 Experiment 1 Positive effects Relative utility MAE = 2.3 Human rankings Absolute utility Model rankings Model rankings Model rankings Model rankings Figure 3: Comparison between human rankings in both experiments and predicted rankings from four models. The solid diagonal lines indicate a perfect correspondence between human and model rankings. only 3 participants. Thus, we grouped all participants together and analyzed their mean rankings. The 0.2 threshold was chosen because it produced the most informative clustering in Experiment 2. Inverse decision-making models We implemented the inverse decision-making models using importance sampling with 5 million samples drawn from the prior distribution p(u). Because all the effects were positive, we used a prior on utilities that placed nearly all probability mass above zero (µ = 4, σ = 2). The mean human rankings are compared with the absolute utility model rankings in Figure 2a, and the mean human rankings are listed in order in 2c. Fractional rankings were used for both the human data and the model predictions. The human rankings in the ﬁgure are the means of participants’ fractional rankings. The ﬁrst row of Figure 3 contains similar plots that allow comparison of the four models we considered. In these plots, the solid diagonal lines indicate a perfect correspondence between model and human rankings. Thus, the largest deviations from this line represent the largest deviations in the data from the model’s predictions. Figure 3 shows that the absolute and relative utility models make virtually identical predictions and both models provide a strong account of the human rankings as measured by mean absolute error (MAE = 2.3 in both cases). Moreover, both models correctly predict the highest ranked decision and the set of lowest ranked decisions. The only clear discrepancy between the model predictions and the data is the cluster of points at the lower left, labeled as Decisions 6–13 in Figure 2a. These are all cases in which effect x appears in all options and therefore these decisions provide no information about a decision maker’s preference for x. Consequently, the models assign the same ranking to this group as to the group of decisions in which there is only a single option (Decisions 1–5). Although people appeared to treat these groups somewhat differently, the models still correctly predict that the entire group of decisions 1–13 is ranked lower than all other decisions. The surprise and representativeness models do not perform nearly as well (MAE = 7.0 and 17.8, respectively). Although the surprise model captures some of the general trends in the human rankings, it makes several major errors. For example, consider Decision 7: dx|cx|bx|ax. This decision provides no information about a preference for x because it appears in every option. The decision is surprising, however, because a decision maker choosing at random from these options would make the observed choice only 1/4 of the time. The representativeness model performs even worse, primarily because it does not take into account alternative explanations for why an option was chosen, such as the fact that no other options were available (e.g., Decision 1 in Figure 2c). The failure of these models to adequately account for the data suggests that both the likelihood p(c|u, F) and marginal likelihood p(c|F) are important components of the absolute and relative utility models. Feature-based models We compared the performance of the absolute and relative utility models to our two feature-based models: the weighted feature and ranked feature models. For each participant, 6 (b) Ranked feature 10 10 5 Figure 4: Results of the feature-based model analysis from Experiment 1 for (a) the weighted feature models and (b) the ranked feature models. The histograms show the minimum number of features needed to match the accuracy (measured by MAE) of the absolute utility model for each participant. 15 5 1 2 3 4 5 6 >6 15 1 2 3 4 5 6 7 8 9 10 >10 Number of participants (a) Weighted feature Number of features needed we considered every subset of features1 in Figure 1d in order to determine the minimum number of features needed by the two models to achieve the same level of accuracy as the absolute utility model, as measured by mean absolute error. The results of these analyses are shown in Figure 4. For the majority of participants, at least four features were needed by both models to match the accuracy of the absolute utility model. For the weighted feature model, 14 participants could not be ﬁt as well as the absolute utility model even when all ten features were considered. These results indicate that a feature-based account of people’s inferences in our task must be supplied with a relatively large number of features. By contrast, the inverse decision-making approach provides a relatively parsimonious account of the data. 4 Experiment 2: Negative effects Experiment 2 focused on a setting in which all effects are negative, motivated by the fact that the inverse decision-making models predict several major differences in orderings when effects are negative rather than positive. For instance, the absolute utility model’s relative rankings of the decisions in Figures 1a and 1b are reversed when all effects are negative rather than positive. Method 42 Carnegie Mellon undergraduates participated for course credit. The experimental design was identical to Experiment 1 except that participants were told that the effects were electric shocks at different body locations. They were asked to sort the cards on the basis of how strongly each decision suggested that the decision maker ﬁnds shocks at the target location relatively tolerable. The model predictions were derived in the same way as for Experiment 1, but with a prior distribution on utilities that placed nearly all probability mass below zero (µ = −4, σ = 2) to reﬂect the fact that effects were all negative. 4.1 Results Three participants were excluded as outliers by the same criterion applied in Experiment 1. The resulting mean rankings are compared with the corresponding rankings from Experiment 1 in Figure 2b. The ﬁgure shows that responses based on positive and negative effects were substantially different in a number of cases. Figure 3 shows how the mean rankings compare to the predictions of the four models we considered. Although the relative utility model is fairly accurate, no model achieves the same level of accuracy as the absolute and relative utility models in Experiment 1. In addition, the relative utility model provides a poor account of the responses of many individual participants. To better understand responses at the individual level, we repeated the hierarchical clustering analysis described in Experiment 1, which revealed that 29 participants could be grouped into one of four clusters, with the remaining participants each in their own clusters. We analyzed these four clusters independently, excluding the 10 participants that could not be naturally grouped. We compared the mean rankings of each cluster to the absolute and relative utility models, as well as all one- and two-feature weighted feature and ranked feature models. Figure 5 shows that the mean rankings of participants in Cluster 1 (N = 8) were best ﬁt by the absolute utility model, the mean rankings of participants in Cluster 2 (N = 12) were best ﬁt by the relative utility model, and the mean rankings of participants in Clusters 3 (N = 3) and 4 (N = 6) were better ﬁt by feature-based models than by either the absolute or relative utility models. 1 A maximum of six features was considered for the ranked feature model because considering more features was computationally intractable. 7 Cluster 4 N =6 MAE = 4.9 MAE = 14.0 MAE = 7.9 MAE = 5.3 MAE = 2.6 MAE = 13.0 MAE = 6.2 Human rankings Relative utility Cluster 3 N =3 MAE = 2.6 Absolute utility Cluster 2 N = 12 Human rankings Cluster 1 N =8 Factors: 1,3 Factors: 1,8 MAE = 2.3 MAE = 5.2 Model rankings Best−fitting weighted feature Factors: 6,7 MAE = 4.0 Model rankings Model rankings Model rankings Human rankings Factors: 3,8 MAE = 4.8 Figure 5: Comparison between human rankings for four clusters of participants identiﬁed in Experiment 2 and predicted rankings from three models. Each point in the plots corresponds to one decision and the solid diagonal lines indicate a perfect correspondence between human and model rankings. The third row shows the predictions of the best-ﬁtting two-factor weighted feature model for each cluster. The two factors listed refer to Figure 1d. To examine how well the models accounted for individuals’ rankings within each cluster, we compared the predictions of the inverse decision-making models to the best-ﬁtting two-factor featurebased model for each participant. In Cluster 1, 7 out of 8 participants were best ﬁt by the absolute utility model; in Cluster 2, 8 out of 12 participants were best ﬁt by the relative utility model; in Clusters 3 and 4, all participants were better ﬁt by feature-based models. No single feature-based model provided the best ﬁt for more than two participants, suggesting that participants not ﬁt well by the inverse decision-making models were not using a single alternative strategy. Applying the feature-based model analysis from Experiment 1 to the current results revealed that the weighted feature model required an average of 6.0 features to match the performance of the absolute utility model for participants in Cluster 1, and an average of 3.9 features to match the performance of the relative utility model for participants in Cluster 2. Thus, although a single model did not ﬁt all participants well in the current experiment, many participants were ﬁt well by one of the two inverse decision-making models, suggesting that this general approach is useful for explaining how people reason about negative effects as well as positive effects. 5 Conclusion In two experiments, we found that an inverse decision-making approach offered a good computational account of how people make judgments about others’ preferences. Although this approach is conceptually simple, our analyses indicated that it captures the inﬂuence of a fairly large number of relevant decision features. Indeed, the feature-based models that we considered as potential process models of preference learning could only match the performance of the inverse decision-making approach when supplied with a relatively large number of features. We feel that this result rules out the feature-based approach as psychologically implausible, meaning that alternative process-level accounts will need to be explored. One possibility is sampling, which has been proposed as a psychological mechanism for approximating probabilistic inferences [19, 20]. However, even if process models that use large numbers of features are considered plausible, the inverse decision-making approach provides a valuable computational-level account that helps to explain which decision features are informative. Acknowledgments This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers in Econometrics. Amademic Press, New York, 1973. [2] C. G. Lucas, T. L. Grifﬁths, F. Xu, and C. Fawcett. A rational model of preference learning and choice prediction by children. In Proceedings of Neural Information Processing Systems 21, 2009. [3] L. Bergen, O. R. Evans, and J. B. Tenenbaum. Learning structured preferences. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 2010. [4] A. Jern and C. Kemp. Decision factors that support preference learning. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011. [5] T. Kushnir, F. Xu, and H. M. Wellman. Young children use statistical sampling to infer the preferences of other people. Psychological Science, 21(8):1134–1140, 2010. [6] L. Ma and F. Xu. Young children’s use of statistical sampling evidence to infer the subjectivity of preferences. Cognition, in press. [7] M. J. Doherty. Theory of Mind: How Children Understand Others’ Thoughts and Feelings. Psychology Press, New York, 2009. [8] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classiﬁcations. Psychological Monographs, 75, Whole No. 517, 1961. [9] D. N. Osherson, E. E. Smith, O. Wilkie, A. L´ pez, and E. Shaﬁr. Category-based induction. Psychological o Review, 97(2):185–200, 1990. [10] E. A. Wasserman, S. M. Elek, D. L. Chatlosh, and A. G. Baker. Rating causal relations: Role of probability in judgments of response-outcome contingency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1):174–188, 1993. [11] R. D. Luce. Individual choice behavior. John Wiley, 1959. [12] D. Ariely, G. Loewenstein, and D. Prelec. Tom Sawyer and the construction of value. Journal of Economic Behavior & Organization, 60:1–10, 2006. [13] D. Kahneman and A. Tversky. Subjective probability: A judgment of representativeness. Cognitive Psychology, 3(3):430–454, 1972. [14] D. Newtson. Dispositional inference from effects of actions: Effects chosen and effects forgone. Journal of Experimental Social Psychology, 10:489–496, 1974. [15] P. C. Fishburn. Lexicographic orders, utilities and decision rules: A survey. Management Science, 20(11):1442–1471, 1974. [16] G. Gigerenzer and P. M. Todd. Fast and frugal heuristics: The adaptive toolbox. Oxford University Press, New York, 1999. [17] A. Prince and P. Smolensky. Optimality Theory: Constraint Interaction in Generative Grammar. WileyBlackwell, 2004. [18] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. [19] A. N. Sanborn, T. L. Grifﬁths, and D. J. Navarro. Rational approximations to rational models: Alternative algorithms for category learning. Psychological Review, 117:1144–1167, 2010. [20] L. Shi and T. L. Grifﬁths. Neural implementation of Bayesian inference by importance sampling. In Proceedings of Neural Information Processing Systems 22, 2009. 9

6 0.30916795 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database

7 0.30301765 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

8 0.27304658 139 nips-2011-Kernel Bayes' Rule

9 0.26474088 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

10 0.24895556 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation

11 0.2483955 69 nips-2011-Differentially Private M-Estimators

12 0.24225652 26 nips-2011-Additive Gaussian Processes

13 0.24076129 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

14 0.23999389 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors

15 0.235724 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

16 0.23178521 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

17 0.22971024 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample

18 0.22619195 221 nips-2011-Priors over Recurrent Continuous Time Processes

19 0.22248067 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

20 0.21549419 100 nips-2011-Gaussian Process Training with Input Noise

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (4, 0.061), (20, 0.018), (26, 0.022), (31, 0.116), (33, 0.018), (38, 0.279), (43, 0.078), (45, 0.071), (48, 0.011), (57, 0.029), (65, 0.011), (74, 0.06), (83, 0.066), (99, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76662976 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

Author: Joris M. Mooij, Dominik Janzing, Tom Heskes, Bernhard Schölkopf

2 0.66934735 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classiﬁers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-speciﬁc word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difﬁcult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a ﬁnancial institution” or “a river bank”. In this paper, we propose a novel context-speciﬁc word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-speciﬁc, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-speciﬁc word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-speciﬁc word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-speciﬁc embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More speciﬁcally, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously ﬁnd the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to ﬁnd the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classiﬁer (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to ﬁnd a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context speciﬁc embeddings for the tokens (Algorithm 2). For supervised learning, these context speciﬁc embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as preﬁxes and sufﬁxes or membership in domain-speciﬁc lexicons, and used as features in a classiﬁer. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a signiﬁcant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to deﬁne the CCA function that computes the left and right singular vectors for a pair of matrices: Deﬁnition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Deﬁne the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Deﬁnition 2. Deﬁne the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Deﬁne A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of ﬁnding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Speciﬁcally, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufﬁcient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efﬁcient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of inﬁnite data) no beneﬁt to ﬁrst estimating state by ﬁnding the CCA between the left and right contexts and then ﬁnding the CCA between the estimated state and the words. One could instead just directly ﬁnd the CCA between the combined left and rights contexts and the words. However, because of the Zipﬁan distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas ﬁnding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully ﬁnd the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufﬁcient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisﬁed, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 ﬁnds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context speciﬁc embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Speciﬁc Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the ﬁnal embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-speciﬁc representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the ﬁnal classiﬁer. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Preﬁxes and sufﬁxes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classiﬁer, they do not exert extra inﬂuence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classiﬁer. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs signiﬁcantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the ﬁnal accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped signiﬁcantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efﬁcient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

3 0.65053642 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

Author: Guido F. Montufar, Johannes Rauh, Nihat Ay

Abstract: We present explicit classes of probability distributions that can be learned by Restricted Boltzmann Machines (RBMs) depending on the number of units that they contain, and which are representative for the expressive power of the model. We use this to show that the maximal Kullback-Leibler divergence to the RBM model with n visible and m hidden units is bounded from above by (n−1)−log(m+1). In this way we can specify the number of hidden units that guarantees a sufﬁciently rich model containing different classes of distributions and respecting a given error tolerance. 1

4 0.56615704 75 nips-2011-Dynamical segmentation of single trials from population neural data

Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identiﬁed by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identiﬁed without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the ﬁring rate of an isolated neuron based on the ﬁring rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reﬂect its computational role. 1

5 0.54562318 249 nips-2011-Sequence learning with hidden units in spiking neural networks

Author: Johanni Brea, Walter Senn, Jean-pascal Pfister

Abstract: We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons signiﬁcantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.

6 0.54495317 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

7 0.54436117 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

8 0.5427317 86 nips-2011-Empirical models of spiking in neural populations

9 0.54146004 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

10 0.54023391 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

11 0.54019952 258 nips-2011-Sparse Bayesian Multi-Task Learning

12 0.53931189 301 nips-2011-Variational Gaussian Process Dynamical Systems

13 0.53779 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

14 0.5374884 219 nips-2011-Predicting response time and error rates in visual search

15 0.53671455 229 nips-2011-Query-Aware MCMC

16 0.53571719 158 nips-2011-Learning unbelievable probabilities

17 0.53532428 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

18 0.53519744 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.53393286 102 nips-2011-Generalised Coupled Tensor Factorisation

20 0.53366059 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation