nips nips2009 nips2009-66 knowledge-graph by maker-knowledge-mining

66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Source: pdf

Author: Anne Hsu, Thomas L. Griffiths

Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-speciﬁc knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. [sent-4, score-0.712]

2 The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. [sent-8, score-1.437]

3 The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. [sent-9, score-0.55]

4 With other verbs, alternations are restricted, and they are grammatical in only one form. [sent-13, score-0.568]

5 For example, “The rabbit disappeared” is grammatical whereas “I disappeared the rabbit” is ungrammatical. [sent-14, score-0.516]

6 A central part of the debate arises from the fact that a child mostly learns language only by hearing adults speak grammatical sentences, known as positive evidence. [sent-16, score-0.771]

7 Children are believed to learn language mostly from positive evidence because research has found that children rarely receive indications from parents that a sentence is not grammatical, and they ignore these indications when they do recieve them. [sent-17, score-0.782]

8 An explicit indication that a sentence is not grammatical is known as negative evidence [5, 6, 7]. [sent-18, score-1.002]

9 One perspective is that language is acquired by learning rules for identifying grammatically acceptable and unacceptable sentences in a way that is robust to the actual distribution of observed sentences. [sent-23, score-0.56]

10 In particular, linguistic exceptions, such as the restrictions on verb alternations mentioned above, are cited as being impossible to learn empirically. [sent-25, score-0.639]

11 In addition to these general theoretical results, statistical learning models have been shown to be capable of learning exceptions in language from positive examples only in a variety of domains, including verb alternations [14, 15, 16, 17, 18, 19]. [sent-29, score-0.705]

12 In the former approach, the goal is to learn to identify grammatical sentences without making assumptions about the distribution from which they are drawn. [sent-32, score-0.871]

13 We show that without negative evidence, the generative model will judge a verb structure that is absent in the input to be ungrammatical, while the discriminative model will judge it to be grammatical. [sent-44, score-0.717]

14 We then conduct an experiment designed to encourage human participants to adopt either a generative or discriminative language learning perspective. [sent-45, score-0.605]

15 The experimental results indicate that human learners behave in accordance with model predictions: absent verb structures are rejected as ungrammatical under a generative learning perspective and accepted as grammatical under a discriminative one. [sent-46, score-1.473]

16 Finally, because our generative learning condition is much more similar to actual child language learning, our results lend weight to the argument that children can learn language empirically from positive input. [sent-49, score-0.662]

17 2 Models of language learning: Generative and discriminative Generative approaches seek to infer the probability distribution over sentences that characterizes the language, while discriminative models seek to identify a function that indicates whether a sentence is grammatical. [sent-50, score-1.223]

18 Here, we compare a simple generative and discriminative model’s predictions of how implicit negative evidence is used to learn verb alternations. [sent-54, score-0.876]

19 1 Generative model: Hierarchical Bayes In the generative model, the problem of learning verb alternations is formulated as follows. [sent-56, score-0.598]

20 The ni sentences containing verb i can be summarized in a k-dimensional vector y i containing the verb occurrence frequency in each of the k sentence structures. [sent-62, score-1.604]

21 For example if we had three possible sentence structure types and verb i occurred in the ﬁrst type two times, the second type four times and the third type zero times, y i would be [2, 4, 0] and ni would be 6. [sent-63, score-0.857]

22 the observed frequency of different grammatical sentence structures for verb i) given ni occurrences of that verb, as summarized above. [sent-67, score-1.374]

23 θ i captures the distribution over sentence structures associated with verb i, assuming that sentences are generated independently and i structure k is generated with probability θk . [sent-68, score-1.216]

24 The hyperparameters α and β represent generalizations about the kinds of sentence structures that typically occur. [sent-69, score-0.445]

25 More precisely, β represents the distribution of sentence structures across all verbs, with βk being the mean probability of sentence structure k, while α represents the extent to which verbs tends to appear in only one sentence structure type. [sent-70, score-1.518]

26 In this model, the number of verbs and the number of possible sentence structures are both ﬁxed. [sent-71, score-0.644]

27 In the context of language learning, the observations are sentences and the classiﬁcation problem is deciding whether each sentence is grammatical. [sent-80, score-0.931]

28 xn , but now each sentence xj is associated with a variable cj indicating whether the sentence is grammatical (cj = +1) or ungrammatical (cj = −1). [sent-84, score-1.531]

29 Each sentence is associated with a feature vector f (xj ) that uses dummy variables to encode the verb, the sentence structure, and the interaction of the two (ie. [sent-85, score-0.822]

30 With m verbs and k sentence structures, this results in m verb features, k sentence structure features, and mk interaction features, each of which take the value 1 when they match the sentence and 0 when they do not. [sent-87, score-1.857]

31 For example, a sentence containing the second of four verbs in the ﬁrst of three sentence structures would be encoded with the binary feature vector 0100100000100000000. [sent-88, score-1.069]

32 3 Testing the models on an artiﬁcial language To examine the predictions that these two models make about the use of implicit negative evidence in learning verb alternations, we applied them to a simple artiﬁcial language based on that used in [20]. [sent-95, score-0.952]

33 This language has four transitive verbs and three possible sentence structures. [sent-96, score-0.81]

34 Three of the verbs only appear in one sentence structure (non-alternating), while one verb appears in two possible sentence structures (alternating). [sent-97, score-1.48]

35 The language consisted of three-word sentences, each containing a subject (N1), object (N2) and verb (V), with the order depending on the particular sentence structure. [sent-98, score-1.015]

36 2 Syntax and grammar In our language of three-word sentences, a verb could appear in 3 different positions (as the 1st, 2nd or 3rd word). [sent-107, score-0.608]

37 In our experiment, the mapping from sentence structure to word order was randomized among participants. [sent-110, score-0.437]

38 There was always one sentence structure, which we denote S3, that was never grammatical for any of the verbs. [sent-112, score-0.892]

39 We designed our language to have 1 alternating verb and 3 non-alternating verbs. [sent-114, score-0.602]

40 One of the three non-alternating verbs was only grammatical in S1. [sent-115, score-0.68]

41 The other two non-alternating verbs were only grammatical in S2. [sent-116, score-0.68]

42 If semz was non-alternating, and only allowed in S2, nagid tombat semz would be grammatical and nagid tombat semz would be ungrammatical. [sent-119, score-0.965]

43 3 Modeling results The generative hierarchical Bayesian model and the discriminative logistic regression model outlined in the previous section were applied to a corpus of sentences generated from this language. [sent-123, score-0.652]

44 + and - indicate grammatical and ungrammatical respectively, while ? [sent-126, score-0.654]

45 The number in parentheses is the frequency with which each sentence was presented to model and human learners in our experiment. [sent-128, score-0.539]

46 Verb V4 was never shown in sentence structure S2. [sent-129, score-0.437]

47 Grammaticality predictions for sentences containing this verb were used to explore the interpretation of implicit negative evidence. [sent-130, score-0.904]

48 In parentheses next to the verb index in the title of each plot is the sentence structure(s) that were shown to be grammatical for that verb in the training corpus. [sent-136, score-1.69]

49 The frequencies of each verb and sentence structure combination are also shown in Table 1. [sent-137, score-0.836]

50 We were particularly interested in the predictions that the two models made about the grammaticality of verb V4 in sentence structure S2, since this combination of verb and sentence structure never occurs in the data. [sent-138, score-1.893]

51 As a consequence, a generative learner receives implicit negative evidence that S2 is not grammatical for V4, while a discriminative learner receives no information. [sent-139, score-0.93]

52 We trained the HBM on the grammatical instances of the sentences, using 10,000 iterations of MCMC. [sent-140, score-0.481]

53 The results indicate that V1 is expected to occur in both S1 and S2 50% of the time, while all other verbs are expected to occur 100% of the time in the one sentence structure for which they are grammatical, accurately reﬂecting the distribution in our language input. [sent-141, score-0.81]

54 Predictions for grammaticality are extracted from the HBM model as follows: The ith verb is grammatical in sentence i structure k if the probability of sentence structure k, θk is greater than or equal to ǫ and ungrammatical otherwise, where ǫ is a small number. [sent-142, score-2.11]

55 5 for V1 in S1 and S2, and either 0 or 1 for other verb and sentence structure combinations, resulting in clear grammaticality predictions. [sent-145, score-1.019]

56 Logistic regression was performed using all sentences in our corpus, both grammatical and ungrammatical. [sent-148, score-0.843]

57 4 Generative and discriminative learning in humans The simulations above illustrate how generative and discriminative approaches to language learning differ in their treatment of implicit negative evidence. [sent-153, score-0.653]

58 In our experiment, participants learned the artiﬁcial language used to generate the model predictions in the previous section by watching computer animated scenes accompanied by spoken and written sentences describing each scene. [sent-156, score-0.804]

59 Participants were also provided with information about whether the sentence was grammatical or ungrammatical. [sent-157, score-0.892]

60 Participants in both conditions were exposed to exactly the same sentences and grammaticality information. [sent-159, score-0.55]

61 2 Stimuli As summarized in Table 1, participants viewed each of the 4 verbs 24 times, 18 grammatical sentences and 6 ungrammatical sentences. [sent-164, score-1.352]

62 The non-alternating verbs were shown 18 times each in their respectively grammatical sentence structures and 3 times each in the 2 ungrammatical structures. [sent-166, score-1.298]

63 Presentation of sentences was ordered as follows: Two chains of sentences were constructed, one grammatical and one ungrammatical. [sent-167, score-1.173]

64 The grammatical chain consisted of 72 sentences (18 for each verb) and the ungrammatical chain consisted of 24 sentences (6 for each verb). [sent-168, score-1.424]

65 For each sentence chain, verbs were presented cyclically and randomized within cycles. [sent-169, score-0.61]

66 For the grammatical chain, V1 occurrences of S1 and S2 were cycled through in semi-random order (verbs V2-V4 appeared grammatically in only one sentence construction). [sent-170, score-1.012]

67 While participants were being trained on the language, presentation of one sentence from the ungrammatical chain was randomly interleaved within every three presentations of sentences from the grammatical chain. [sent-172, score-1.586]

68 Subject-object noun pairs were randomized for each verb across presentations. [sent-173, score-0.439]

69 During pre-training they heard and saw each word along with pictures of each noun and scenes corresponding to each verb along with spoken audio of each noun/verb. [sent-177, score-0.568]

70 1 Generative learning condition In the generative learning condition, participants were told that they would listen to an adult speaker who was always spoke grammatical sentences and a child speaker who always spoke ungrammatically. [sent-184, score-1.369]

71 We hypothesized that participants in this condition would behave similarly to a generative model: they would build a probabilistic representation of the language from the grammatical sentences produced by the adult speaker. [sent-187, score-1.351]

72 2 Discriminative learning condition In the discriminative learning condition, participants were presented with spoken and written sentences describing each scene and asked to choose whether each of the presented sentences were grammatical or not. [sent-190, score-1.545]

73 They were assured that only relevant words were used and they only had to ﬁgure out if the verb occurred in a grammatical location. [sent-191, score-0.88]

74 This Proportion grammatical a) V1 (S1,S2) 1 b) V2 (S2) 1 c) V3 (S1) 1 d) V4 (S1) 1 generative discriminative 0. [sent-196, score-0.723]

75 5 S1 S2 S3 0 S1 S2 S3 Figure 3: Human grammar judgments, showing proportion grammatical for each sentence structure. [sent-200, score-0.927]

76 ” The main difference from the generative condition is that in the discriminative condition, the presented sentences are assumed to be chosen at random, whereas in the generative learning condition, sentences from the adult speaker are assumed to have been sampled from the language distribution. [sent-202, score-1.335]

77 We hypothesized that participants in the discriminative condition would behave similarly to a discriminative model: they would use feedback about both grammatical and ungrammatical sentences to formulate rules about what made sentences grammatical. [sent-203, score-1.841]

78 In this testing phase, participants were shown a series of written sentences and asked to rate the sentence as either grammatical or ungrammatical. [sent-207, score-1.406]

79 Here, all sentences had blergen as the subject and nagid as the object. [sent-208, score-0.454]

80 Participants also underwent a production test in which they were shown a scene and asked to type in a sentence describing that scene. [sent-211, score-0.477]

81 Most notably, the generative learners overwhelmingly judged verb V4 to be ungrammatical in S2, while the majority of discriminative learners deemed V4 in to be grammatical in S2 (see Figure 3d). [sent-219, score-1.501]

82 Another difference we found between the two conditions was that discriminative learners were more willing to consider verbs to be alternating (i. [sent-225, score-0.471]

83 allow those verbs to be grammatical in two sentence structures. [sent-227, score-1.091]

84 ) This is evidenced by the fact that participants in the generative condition rated occurrences of V1 (the alternating verb) in S1 and S2 as grammatical only 68% and 72% of the time. [sent-228, score-0.858]

85 This is because many participants judged V1 to be grammatical in either S1 or S2 and not both. [sent-229, score-0.656]

86 On the other hand, participants in the discriminative condition rated occurrences of V1 in S1 and S2 grammatical 100% of the time (see Figure 3a). [sent-230, score-0.847]

87 From post-experiment questioning, we learned that many participants in the generative condition did not think verbs would occur in two possible sentence d) V4 (S1) Production 1 b) V2 (S2) 1 c) V3 (S1) 1 d) V4 (S1) 1 generative discriminative 0. [sent-236, score-1.151]

88 5 S1 S2 S3other 0 S1 S2 S3other Figure 4: Human production data, showing proportion of productions in each sentence structure. [sent-240, score-0.462]

89 Why the two conditions prompted signiﬁcantly different prior assumptions about the prevalence of verb alternations will be a question for future research, but is particularly interesting in the context of the HBM, which can learn a prior expressing similar constraints. [sent-243, score-0.57]

90 Production test results showed that participants tended to use verbs in the sentences structure that they heard them in (see Figure 4). [sent-244, score-0.776]

91 Notably, even though the majority of the learners in the discriminative condition rated verb V4 in S2 as grammatical, only 20% of the productions of V4 were in S2. [sent-245, score-0.695]

92 This is in line with previous results that show that how often a sentence structure is produced is proportional to how often that structure is heard, and rarely heard structures are rarely produced, even if they are believed to be grammatical [20]. [sent-246, score-1.03]

93 5 Discussion We have shown that artiﬁcial language learners may or may not learn restrictions on verb alternations, depending on the learning context. [sent-247, score-0.719]

94 Our simulations of generative and discriminative learners made predictions about how these approaches deal with implicit negative evidence, and these predictions were borne out in an experiment with human learners. [sent-248, score-0.553]

95 Participants in both experimental conditions viewed exactly the same sentences and were told whether each sentence was grammatical or ungrammatical. [sent-249, score-1.286]

96 In the discriminative condition, participants were given yes/no grammaticality feedback on sentences presumed to be sampled at random. [sent-251, score-0.826]

97 Because of the random sampling assumption, the absence of a verb in a given sentence structure did not provide implicit negative evidence against the grammaticality of that construction. [sent-252, score-1.186]

98 Participants in our generative condition heard sentences spoken by a grammatical speaker, similar to the way children learn by listening to adult speech. [sent-257, score-1.179]

99 In post-experiment questioning, generative learners also stated that they ignored all negative evidence from the ungrmamatical child speaker, similar to the way children ignore negative evidence in real language acquisition. [sent-258, score-0.692]

100 Addressing the learnability of verb subcategorizations with Bayesian inference. [sent-334, score-0.453]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grammatical', 0.481), ('sentence', 0.411), ('verb', 0.399), ('sentences', 0.346), ('verbs', 0.199), ('grammaticality', 0.183), ('language', 0.174), ('ungrammatical', 0.173), ('participants', 0.153), ('discriminative', 0.13), ('generative', 0.112), ('learners', 0.092), ('alternations', 0.087), ('nagid', 0.086), ('linguistic', 0.084), ('tombat', 0.075), ('hbm', 0.065), ('evidence', 0.06), ('implicit', 0.057), ('debate', 0.057), ('cycled', 0.054), ('ern', 0.054), ('semz', 0.054), ('learnability', 0.054), ('heard', 0.052), ('negative', 0.05), ('children', 0.049), ('speaker', 0.046), ('child', 0.045), ('spoken', 0.04), ('noun', 0.04), ('predictions', 0.038), ('cj', 0.038), ('human', 0.036), ('grammar', 0.035), ('adult', 0.035), ('condition', 0.034), ('acquisition', 0.034), ('structures', 0.034), ('logistic', 0.033), ('production', 0.032), ('languages', 0.032), ('exceptions', 0.031), ('learn', 0.03), ('alternating', 0.029), ('occurrences', 0.028), ('told', 0.027), ('transitive', 0.026), ('structure', 0.026), ('analyses', 0.026), ('restrictions', 0.024), ('perspectives', 0.023), ('chain', 0.022), ('judged', 0.022), ('blergen', 0.022), ('grammatically', 0.022), ('indications', 0.022), ('innate', 0.022), ('opened', 0.022), ('poverty', 0.022), ('spoke', 0.022), ('ni', 0.021), ('conditions', 0.021), ('rated', 0.021), ('learner', 0.02), ('nouns', 0.02), ('pictures', 0.02), ('productions', 0.019), ('animated', 0.019), ('disappeared', 0.019), ('perfors', 0.019), ('prompted', 0.019), ('questioning', 0.019), ('underwent', 0.019), ('cognitive', 0.019), ('rules', 0.018), ('consisted', 0.017), ('accompanied', 0.017), ('scenes', 0.017), ('xj', 0.017), ('seek', 0.016), ('appeared', 0.016), ('regression', 0.016), ('cartoon', 0.016), ('grammars', 0.016), ('rabbit', 0.016), ('behave', 0.016), ('asserts', 0.015), ('opposing', 0.015), ('argument', 0.015), ('asked', 0.015), ('impossible', 0.015), ('hierarchical', 0.015), ('door', 0.015), ('alternation', 0.015), ('lend', 0.015), ('assumptions', 0.014), ('feedback', 0.014), ('positive', 0.014), ('containing', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

2 0.23215388 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

Author: Jie Luo, Barbara Caputo, Vittorio Ferrari

Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to ﬁnd out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1

3 0.117037 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

Abstract: Many researchers have suggested that the psychological complexity of a concept is related to the length of its representation in a language of thought. As yet, however, there are few concrete proposals about the nature of this language. This paper makes one such proposal: the language of thought allows ﬁrst order quantiﬁcation (quantiﬁcation over objects) more readily than second-order quantiﬁcation (quantiﬁcation over features). To support this proposal we present behavioral results from a concept learning study inspired by the work of Shepard, Hovland and Jenkins. Humans can learn and think about many kinds of concepts, including natural kinds such as elephant and water and nominal kinds such as grandmother and prime number. Understanding the mental representations that support these abilities is a central challenge for cognitive science. This paper proposes that quantiﬁcation plays a role in conceptual representation—for example, an animal X qualiﬁes as a predator if there is some animal Y such that X hunts Y . The concepts we consider are much simpler than real-world examples such as predator, but even simple laboratory studies can provide important clues about the nature of mental representation. Our approach to mental representation is based on the language of thought hypothesis [1]. As pursued here, the hypothesis proposes that mental representations are constructed in a compositional language of some kind, and that the psychological complexity of a concept is closely related to the length of its representation in this language [2, 3, 4]. Following previous researchers [2, 4], we operationalize the psychological complexity of a concept in terms of the ease with which it is learned and remembered. Given these working assumptions, the remaining challenge is to specify the representational resources provided by the language of thought. Some previous studies have relied on propositional logic as a representation language [2, 5], but we believe that the resources of predicate logic are needed to capture the structure of many human concepts. In particular, we suggest that the language of thought can accommodate relations, functions, and quantiﬁcation, and focus here on the role of quantiﬁcation. Our primary proposal is that quantiﬁcation is supported by the language of thought, but that quantiﬁcation over objects is psychologically more natural than quantiﬁcation over features. To test this idea we compare concept learning in two domains which are very similar except for one critical difference: one domain allows quantiﬁcation over objects, and the other allows quantiﬁcation over features. We consider several logical languages that can be used to formulate concepts in both domains, and ﬁnd that learning times are best predicted by a language that supports quantiﬁcation over objects but not features. Our work illustrates how theories of mental representation can be informed by comparing concept learning across two or more domains. Existing studies work with a range of domains, and it is useful to consider a “conceptual universe” that includes these possibilities along with many others that have not yet been studied. Table 1 charts a small fragment of this universe, and the penultimate column shows example stimuli that will be familiar from previous studies of concept learning. Previous studies have made important contributions by choosing a single domain in Table 1 and explaining 1 why some concepts within this domain are easier to learn than others [2, 4, 6, 7, 8, 9]. Comparisons across domains can also provide important information about learning and mental representation, and we illustrate this claim by comparing learning times across Domains 3 and 4. The next section introduces the conceptual universe in Table 1 in more detail. We then present a formal approach to concept learning that relies on a logical language and compare three candidate languages. Language OQ (for object quantiﬁcation) supports quantiﬁcation over objects but not features, language F Q (for feature quantiﬁcation) supports quantiﬁcation over features but not objects, and language OQ + F Q supports quantiﬁcation over both objects and features. We use these languages to predict learning times across Domains 3 and 4, and present an experiment which suggests that language OQ comes closest to the language of thought. 1 The conceptual universe Table 1 provides an organizing framework for thinking about the many domains in which learning can occur. The table includes 8 domains, each of which is deﬁned by specifying some number of objects, features, and relations, and by specifying the range of each feature and each relation. We refer to the elements in each domain as items, and the penultimate column of Table 1 shows items from each domain. The ﬁrst row shows a domain commonly used by studies of Boolean concept learning. Each item in this domain includes a single object a and speciﬁes whether that object has value v1 (small) or v2 (large) on feature F (size), value v3 (white) or v4 (gray) on feature G (color), and value v5 (vertical) or v6 (horizontal) on feature H (texture). Domain 2 also includes three features, but now each item includes three objects and each feature applies to only one of the objects. For example, feature H (texture) applies to only the third object in the domain (i.e. the third square on each card). Domain 3 is similar to Domain 1, but now the three features can be aligned— for any given item each feature will be absent (value 0) or present. The example in Table 1 uses three features (boundary, dots, and slash) that can each be added to an unadorned gray square. Domain 4 is similar to Domain 2, but again the feature values can be aligned, and the feature for each object will be absent (value 0) or present. Domains 5 and 6 are similar to domains 2 and 4 respectively, but each one includes relations rather than features. In Domain 6, for example, the relation R assigns value 0 (absent) or value 1 (present) to each undirected pair of objects. The ﬁrst six domains in Table 1 are all variants of Domain 1, which is the domain typically used by studies of Boolean concept learning. Focusing on six related domains helps to establish some of the dimensions along which domains can differ, but the ﬁnal two domains in Table 1 show some of the many alternative possibilities. Domain 7 includes two categorical features, each of which takes three rather than two values. Domain 8 is similar to Domain 6, but now the number of objects is 6 rather than 3 and relation R is directed rather than undirected. To mention just a handful of possibilities which do not appear in Table 1, domains may also have categorical features that are ordered (e.g. a size feature that takes values small, medium, and large), continuous valued features or relations, relations with more than two places, and objects that contain sub-objects or parts. Several learning problems can be formulated within any given domain. The most basic is to learn a single item—for example, a single item from Domain 8 [4]. A second problem is to learn a class of items—for example, a class that includes four of the items in Domain 1 and excludes the remaining four [6]. Learning an item class can be formalized as learning a unary predicate deﬁned over items, and a natural extension is to consider predicates with two or more arguments. For example, problems of the form A is to B as C is to ? can be formulated as problems where the task is to learn a binary relation analogous(·, ·) given the single example analogous(A, B). Here, however, we focus on the task of learning item classes or unary predicates. Since we focus on the role of quantiﬁcation, we will work with domains where quantiﬁcation is appropriate. Quantiﬁcation over objects is natural in cases like Domain 4 where the feature values for all objects can be aligned. Note, for example, that the statement “every object has its feature” picks out the ﬁnal example item in Domain 4 but that no such statement is possible in Domain 2. Quantiﬁcation over features is natural in cases like Domain 3 where the ranges of each feature can be aligned. For example, “object a has all three features” picks out the ﬁnal example item in Domain 3 but no such statement is possible in Domain 1. We therefore focus on Domains 3 and 4, and explore the problem of learning item classes in each domain. 2 3 {a} {a, b, c} {a} {a, b, c} {a, b, c} {a, b, c} {a} {a, b, c, d, e, f } 1 2 3 4 5 6 7 8 R : O × O → {0, 1} — F : O → {v1 , v2 , v3 } G : O → {v4 , v5 , v6 } — R : O × O → {0, 1} R : (a, b) → {v1 , v2 } S : (a, c) → {v3 , v4 } T : (b, c) → {v5 , v6 } — — — — Relations — — Domain speciﬁcation Features F : O → {v1 , v2 } G : O → {v3 , v4 } H : O → {v5 , v6 } F : a → {v1 , v2 } G : b → {v3 , v4 } H : c → {v5 , v6 } F : O → {0, v1 } G : O → {0, v2 } H : O → {0, v3 } F : a → {0, v1 } G : b → {0, v2 } H : c → {0, v3 } , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... , ... , Example Items , , , , , , , , , , , , , ... , [4] [8, 9] [13] [6] [12] [6] [2, 6, 7, 10, 11] Ref. Table 1: The conceptual universe. Eight domains are shown, and each one is deﬁned by a set of objects, a set of features, and a set of relations. We call the members of each domain items, and an item is created by specifying the extension of each feature and relation in the domain. The six domains above the double lines are closely related to the work of Shepard et al. [6]. Each one includes eight items which differ along three dimensions. These dimensions, however, emerge from different underlying representations in the six cases. Objects O # (a) (b) 1 (I) 2 (II) 3 (III) 4 (III) 5 (IV) 6 (IV) 7 (V) 8 (V) 9 (V) 10 (VI) 111 110 101 011 100 010 001 000 Figure 1: (a) A stimulus lattice for domains (e.g. Domains 3, 4, and 6) that can be encoded as a triple of binary values where 0 represents “absent” and 1 represents “present.” (b) If the order of the values in the triple is not signiﬁcant, there are 10 distinct ways to partition the lattice into two classes of four items. The SHJ type for each partition is shown in parentheses. Domains 3 and 4 both include 8 items each and we will consider classes that include exactly four of these items. Each item in these domains can be represented as a triple of binary values, where 0 indicates that a feature is absent and value 1 indicates that a feature is present. Each triple represents the values of the three features (Domain 3) or the feature values for the three objects (Domain 4). By representing each domain in this way, we have effectively adopted domain speciﬁcations that are simpliﬁcations of those shown in Table 1. Domain 3 is represented using three features of the form F, G, H : O → {0, 1}, and Domain 4 is represented using a single feature of the form F : O → {0, 1}. Simpliﬁcations of this kind are possible because the features in each domain can be aligned—notice that no corresponding simpliﬁcations are possible for Domains 1 and 2. The eight binary triples in each domain can be organized into the lattice shown in Figure 1a. Here we consider all ways to partition the vertices of the lattice into two groups of four. If partitions that differ only up to a permutation of the features (Domain 3) or objects (Domain 4) are grouped into equivalence classes, there are ten of these classes, and a representative of each is shown in Figure 1b. Previous researchers [6] have pointed out that the stimuli in Domain 1 can be organized into a cube similar to Figure 1a, and that there are six ways to partition these stimuli into two groups of four up to permutations of the features and permutations of the range of each feature. We refer to these equivalence classes as the six Shepard-Hovland-Jenkins types (or SHJ types), and each partition in Figure 1b is labeled with its corresponding SHJ type label. Note, for example, that partitions 3 and 4 are both examples of SHJ type III. For us, partitions 3 and 4 are distinct since items 000 (all absent) and 111 (all present) are uniquely identiﬁable, and partition 3 assigns these items to different classes but partition 4 does not. Previous researchers have considered differences between some of the ﬁrst six domains in Table 1. Shepard et al. [6] ran experiments using compact stimuli (Domain 1) and distributed stimuli (Domains 2 and 4), and observed the same difﬁculty ranking of the six SHJ types in all cases. Their work, however, does not acknowledge that Domain 4 leads to 10 distinct types rather than 6, and therefore fails to address issues such as the relative complexities of concepts 5 and 6 in Figure 1. Social psychologists [13, 14] have studied Domain 6 and found that learning patterns depart from the standard SHJ order—in particular, that SHJ type VI (Concept 10 in Figure 1) is simpler than types III, IV and V. This ﬁnding has been used to support the claim that social learning relies on a domain-speciﬁc principle of structural balance [14]. We will see, however, that the relative simplicity of type VI in domains like 4 and 6 is consistent with a domain-general account based on representational economy. 2 A representation length approach to concept learning The conceptual universe in Table 1 calls for an account of learning that can apply across many domains. One candidate is the representation length approach, which proposes that concepts are mentally represented in a language of thought, and that the subjective complexity of a concept is 4 determined by the length of its representation in this language [4]. We consider the case where a concept corresponds to a class of items, and explore the idea that these concepts are mentally represented in a logical language. More formally, a concept is represented as a logical sentence, and the concept includes all models of this sentence, or all items that make the sentence true. The predictions of this representation length approach depend critically on the language chosen. Here we consider three languages—an object quantiﬁcation language OQ that supports quantiﬁcation over objects, a feature quantiﬁcation language F Q that supports quantiﬁcation over features, and a language OQ + F Q that supports quantiﬁcation over both objects and features. Language OQ is based on a standard logical language known as predicate logic with equality. The language includes symbols representing objects (e.g. a and b), and features (e.g. F and G) and these symbols can be combined to create literals that indicate that an object does (Fa ) or does not have a certain feature (Fa ′ ). Literals can be combined using two connectives: AND (Fa Ga ) and OR (Fa + Ga ). The language includes two quantiﬁers—for all (∀) and there exists (∃)—and allows quantiﬁcation over objects (e.g. ∀x Fx , where x is a variable that ranges over all objects in the domain). Finally, language OQ includes equality and inequality relations (= and =) which can be used to compare objects and object variables (e.g. =xa or =xy ). Table 2 shows several sentences formulated in language OQ. Suppose that the OQ complexity of each sentence is deﬁned as the number of basic propositions it contains, where a basic proposition can be a positive or negative literal (Fa or Fa ′ ) or an equality or inequality statement (=xa or =xy ). Equivalently, the complexity of a sentence is the total number of ANDs plus the total number of ORs plus one. This measure is equivalent by design to Feldman’s [2] notion of Boolean complexity when applied to a sentence without quantiﬁcation. The complexity values in Table 2 show minimal complexity values for each concept in Domains 3 and 4. Table 2 also shows a single sentence that achieves each of these complexity values, although some concepts admit multiple sentences of minimal complexity. The complexity values in Table 2 were computed using an “enumerate then combine” approach. We began by enumerating a set of sentences according to criteria described in the next paragraph. Each sentence has an extension that speciﬁes which items in the domain are consistent with the sentence. Given the extensions of all sentences generated during the enumeration phase, the combination phase considered all possible ways to combine these extensions using conjunctions or disjunctions. The procedure terminated once extensions corresponding to all of the concepts in the domain had been found. Although the number of possible sentences grows rapidly as the complexity of these sentences increases, the number of extensions is ﬁxed and relatively small (28 for domains of size 8). The combination phase is tractable since sentences with the same extension can be grouped into a single equivalence class. The enumeration phase considered all formulae which had at most two quantiﬁers and which had a complexity value lower than four. For example, this phase did not include the formula ∃x ∃y ∃z =yz F′ Fy Fz (too many quantiﬁers) or the formula ∀x ∃y =xy Fy (Fx + Gx + Hx ) (complexity x too high). Despite these restrictions, we believe that the complexity values in Table 2 are identical to the values that would be obtained if we had considered all possible sentences. Language F Q is similar to OQ but allows quantiﬁcation over features rather than objects. For example, F Q includes the statement ∀Q Qa , where Q is a variable that ranges over all features in the domain. Language F Q also allows features and feature variables to be compared for equality or inequality (e.g. =QF or =QR ). Since F Q and OQ are closely related, it follows that the F Q complexity values for Domains 3 and 4 are identical to the OQ complexity values for Domains 4 and 3. For example, F Q can express concept 5 in Domain 3 as ∀Q ∃R =QR Ra . We can combine OQ and F Q to create a language OQ + F Q that allows quantiﬁcation over both objects and features. Allowing both kinds of quantiﬁcation leads to identical complexity values for Domains 3 and 4. Language OQ + F Q can express each of the formulae for Domain 4 in Table 2, and these formulae can be converted into corresponding formulae for Domain 3 by translating each instance of object quantiﬁcation into an instance of feature quantiﬁcation. Logicians distinguish between ﬁrst-order logic, which allows quantiﬁcation over objects but not predicates, and second-order logic, which allows quantiﬁcation over objects and predicates. The difference between languages OQ and OQ + F Q is superﬁcially similar to the difference between ﬁrst-order and second-order logic, but does not cut to the heart of this matter. Since language 5 # 1 Domain 3 Domain 4 C 1 Ga C 1 Fb 2 Fa Ha + Fa Ha 4 Fa Fc + Fa Fc 4 3 Fa ′ Ga + Fa Ha 4 Fa ′ Fb + Fa Fc 4 4 Fa ′ Ga ′ + Fa Ha 4 Fa ′ Fb ′ + Fa Fc 4 5 Ga (Fa + Ha ) + Fa Ha 2 6 7 8 ′ ′ ′ ′ 5 ∀x ∃y =xy Fy ′ 5 ′ ′ 6 Ga (Fa + Ha ) + Fa Ha Ga (Fa + Ha ) + Fa Ga Ha 3 (∀x Fx ) + Fb ∃y Fy ′ ′ ′ (∀x Fx ) + Fb (Fa + Fc ) 4 ′ ′ ′ 6 ′ ′ 6 (∀x Fx ) + Fa (Fb + Fc ) 4 10 (∀x Fx ) + ∃y ∀z Fy (=zy +Fz ′ ) 4 Ha (Fa + Ga ) + Fa Ga Ha 9 Fa (Ga + Ha ) + Fa Ga Ha 10 Ga ′ (Fa Ha ′ + Fa ′ Ha ) + Ga (Fa ′ Ha ′ + Fa Ha ) ′ ′ ′ Fc (Fa + Fb ) + Fa Fb Fc ′ ′ 6 Table 2: Complexity values C and corresponding formulae for language OQ. Boolean complexity predicts complexity values for both domains that are identical to the OQ complexity values shown here for Domain 3. Language F Q predicts complexity values for Domains 3 and 4 that are identical to the OQ values for Domains 4 and 3 respectively. Language OQ + F Q predicts complexity values for both domains that are identical to the OQ complexity values for Domain 4. OQ + F Q only supports quantiﬁcation over a pre-speciﬁed set of features, it is equivalent to a typed ﬁrst order logic that includes types for objects and features [15]. Future studies, however, can explore the cognitive relevance of higher-order logic as developed by logicians. 3 Experiment Now that we have introduced languages OQ, F Q and OQ + F Q our theoretical proposals can be sharply formulated. We suggest that quantiﬁcation over objects plays an important role in mental representations, and predict that OQ complexity will account better for human learning than Boolean complexity. We also propose that quantiﬁcation over objects is more natural than quantiﬁcation over features, and predict that OQ complexity will account better for human learning than both F Q complexity and OQ + F Q complexity. We tested these predictions by designing an experiment where participants learned concepts from Domains 3 and 4. Method. 20 adults participated for course credit. Each participant was assigned to Domain 3 or Domain 4 and learned all ten concepts from that domain. The items used for each domain were the cards shown in Table 1. Note, for example, that each Domain 3 card showed one square, and that each Domain 4 card showed three squares. These items are based on stimuli developed by Sakamoto and Love [12]. The experiment was carried out using a custom built graphical interface. For each learning problem in each domain, all eight items were simultaneously presented on the screen, and participants were able to drag them around and organize them however they liked. Each problem had three phases. During the learning phase, the four items belonging to the current concept had red boundaries, and the remaining four items had blue boundaries. During the memory phase, these colored boundaries were removed, and participants were asked to sort the items into the red group and the blue group. If they made an error they returned to the learning phase, and could retake the test whenever they were ready. During the description phase, participants were asked to provide a written description of the two groups of cards. The color assignments (red or blue) were randomized across participants— in other words, the “red groups” learned by some participants were identical to the “blue groups” learned by others. The order in which participants learned the 10 concepts was also randomized. Model predictions. The OQ complexity values for the ten concepts in each domain are shown in Table 2 and plotted in Figure 2a. The complexity values in Figure 2a have been normalized so that they sum to one within each domain, and the differences of these normalized scores are shown in the ﬁnal row of Figure 2a. The two largest bars in the difference plot indicate that Concepts 10 and 5 are predicted to be easier to learn in Domain 4 than in Domain 3. Language OQ can express 6 OQ complexity Domain 3 a) Learning time b) 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Difference Domain 4 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 2: Normalized OQ complexity values and normalized learning times for the 10 concepts in Domains 3 and 4. statements like “either 1 or 3 objects have F ” (Concept 10 in Domain 4), or “2 or more objects have F ” (Concept 5 in Domain 4). Since quantiﬁcation over features is not permitted, however, analogous statements (e.g. “object a has either 1 or 3 features”) cannot be formulated in Domain 3. Concept 10 corresponds to SHJ type VI, which often emerges as the most difﬁcult concept in studies of Boolean concept learning. Our model therefore predicts that the standard ordering of the SHJ types will not apply in Domain 4. Our model also predicts that concepts assigned to the same SHJ type will have different complexities. In Domain 4 the model predicts that Concept 6 will be harder to learn than Concept 5 (both are examples of SHJ type IV), and that Concept 8 will be harder to learn than Concepts 7 or 9 (all three are examples of SHJ type V). Results. The computer interface recorded the amount of time participants spent on the learning phase for each concept. Domain 3 was a little more difﬁcult than Domain 4 overall: on average, Domain 3 participants took 557 seconds and Domain 4 participants took 467 seconds to learn the 10 concepts. For all remaining analyses, we consider learning times that are normalized to sum to 1 for each participant. Figure 2b shows the mean values for these normalized times, and indicates the relative difﬁculties of the concepts within each condition. The difference plot in Figure 2b supports the two main predictions identiﬁed previously. Concepts 10 and 5 are the cases that differ most across the domains, and both concepts are easier to learn in Domain 3 than Domain 4. As predicted, Concept 5 is substantially easier than Concept 6 in Domain 4 even though both correspond to the same SHJ type. Concepts 7 through 9 also correspond to the same SHJ type, and the data for Domain 4 suggest that Concept 8 is the most difﬁcult of the three, although the difference between Concepts 8 and 7 is not especially large. Four sets of complexity predictions are plotted against the human data in Figure 3. Boolean complexity and OQ complexity make identical predictions about Domain 3, and OQ complexity and OQ + F Q complexity make identical predictions about Domain 4. Only OQ complexity, however, accounts for the results observed in both domains. The concept descriptions generated by participants provide additional evidence that there are psychologically important differences between Domains 3 and 4. If the descriptions for concepts 5 and 10 are combined, 18 out of 20 responses in Domain 4 referred to quantiﬁcation or counting. One representative description of Concept 5 stated that “red has multiple ﬁlled” and that “blue has one ﬁlled or none.” Only 3 of 20 responses in Domain 3 mentioned quantiﬁcation. One representative description of Concept 5 stated that “red = multiple features” and that “blue = only one feature.” 7 r=0.84 0.2 r=0.84 0.2 r=0.26 0.2 r=0.26 0.2 Learning time (Domain 3) 0.1 0.1 0 (Domain 4) 0.2 r=0.27 0.2 Learning time 0.1 0.1 0 0.2 r=0.83 0.2 0.1 0.1 0 0.1 0.2 0 0.1 0.2 r=0.27 0.2 0.1 Boolean complexity 0.1 0.1 0.2 OQ complexity 0.1 0.2 r=0.83 0.2 0.1 0 0 0.1 0 0.1 0.2 F Q complexity 0 0.1 0.2 OQ + F Q complexity Figure 3: Normalized learning times for each domain plotted against normalized complexity values predicted by four languages: Boolean logic, OQ, F Q and OQ + F Q. These results suggest that people can count or quantify over features, but that it is psychologically more natural to quantify over objects rather than features. Although we have focused on three speciﬁc languages, the results in Figure 2b can be used to evaluate alternative proposals about the language of thought. One such alternative is an extension of Language OQ that allows feature values to be compared for equality. This extended language supports concise representations of Concept 2 in both Domain 3 (Fa = Ha ) and Domain 4 (Fa = Fc ), and predicts that Concept 2 will be easier to learn than all other concepts except Concept 1. Note, however, that this prediction is not compatible with the data in Figure 2b. Other languages might also be considered, but we know of no simple language that will account for our data better than OQ. 4 Conclusion Comparing concept learning across qualitatively different domains can provide valuable information about the nature of mental representation. We compared two domains that that are similar in many respects, but that differ according to whether they include a single object (Domain 3) or multiple objects (Domain 4). Quantiﬁcation over objects is possible in Domain 4 but not Domain 3, and this difference helps to explain the different learning patterns we observed across the two domains. Our results suggest that concept representations can incorporate quantiﬁcation, and that quantifying over objects is more natural than quantifying over features. The model predictions we reported are based on a language (OQ) that is a generic version of ﬁrst order logic with equality. Our results therefore suggest that some of the languages commonly considered by logicians (e.g. ﬁrst order logic with equality) may indeed capture some aspects of the “laws of thought” [16]. A simple language like OQ offers a convenient way to explore the role of quantiﬁcation, but this language will need to be reﬁned and extended in order to provide a more accurate account of mental representation. For example, a comprehensive account of the language of thought will need to support quantiﬁcation over features in some cases, but might be formulated so that quantiﬁcation over features is typically more costly than quantiﬁcation over objects. Many possible representation languages can be imagined and a large amount of empirical data will be needed to identify the language that comes closest to the language of thought. Many relevant studies have already been conducted [2, 6, 8, 9, 13, 17], but there are vast regions of the conceptual universe (Table 1) that remain to be explored. Navigating this universe is likely to involve several challenges, but web-based experiments [18, 19] may allow it to be explored at a depth and scale that are currently unprecedented. Characterizing the language of thought is undoubtedly a long term project, but modern methods of data collection may support rapid progress towards this goal. Acknowledgments I thank Maureen Satyshur for running the experiment. This work was supported in part by NSF grant CDI-0835797. 8 References [1] J. A. Fodor. The language of thought. Harvard University Press, Cambridge, 1975. [2] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407: 630–633, 2000. [3] D. Fass and J. Feldman. Categorization under complexity: A uniﬁed MDL account of human learning of regular and irregular categories. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 35–34. MIT Press, Cambridge, MA, 2003. [4] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [5] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Grifﬁths. A rational analysis of rule-based concept learning. Cognitive Science, 32(1):108–154, 2008. [6] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classiﬁcations. Psychological Monographs, 75(13), 1961. Whole No. 517. [7] R. M. Nosofsky, M. Gluck, T. J. Palmeri, S. C. McKinley, and P. Glauthier. Comparing models of rule-based classiﬁcation learning: A replication and extension of Shepard, Hovland, and Jenkins (1961). Memory and Cognition, 22:352–369, 1994. [8] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin and Review, 9(1):43–58, 2002. [9] C. D. Aitkin and J. Feldman. Subjective complexity of categories deﬁned over three-valued features. In R. Sun and N. Miyake, editors, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 961–966. Psychology Press, New York, 2006. [10] F. Mathy and J. Bradmetz. A theory of the graceful complexiﬁcation of concepts and their learnability. Current Psychology of Cognition, 22(1):41–82, 2004. [11] R. Vigo. A note on the complexity of Boolean concepts. Journal of Mathematical Psychology, 50:501–510, 2006. [12] Y. Sakamoto and B. C. Love. Schematic inﬂuences on category learning and recognition memory. Journal of Experimental Psychology: General, 133(4):534–553, 2004. [13] W. H. Crockett. Balance, agreement and positivity in the cognition of small social structures. In Advances in Experimental Social Psychology, Vol 15, pages 1–57. Academic Press, 1982. [14] N. B. Cottrell. Heider’s structural balance principle as a conceptual rule. Journal of Personality and Social Psychology, 31(4):713–720, 1975. [15] H. B. Enderton. A mathematical introduction to logic. Academic Press, New York, 1972. [16] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probabilities. 1854. [17] B. C. Love and A. B. Markman. The nonindependence of stimulus properties in human category learning. Memory and Cognition, 31(5):790–799, 2003. [18] L. von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006. [19] R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008. 9

4 0.10346539 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

Abstract: Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming ﬁxed tic-tac-toe pruning. 1

5 0.084306613 21 nips-2009-Abstraction and Relational learning

Author: Charles Kemp, Alan Jern

Abstract: Most models of categorization learn categories deﬁned by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our ﬁrst experiment suggests that abstraction can help to explain some of the ﬁndings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the ﬁnal word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that speciﬁes which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our ﬁrst experiment uses a standard classiﬁcation task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we ﬁnd that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are ﬁgures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories deﬁned by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that speciﬁes which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7  ﬀ ˘ ¯ ∀x D (x) =, =, <, > vk ∃xﬀ  i  ﬀ ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔  ﬀ ﬀ ﬀ ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1  ﬀ <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1  ﬀ ﬀ <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantiﬁcation (∀x) and existential quantiﬁcation (∃x) are both permitted, and the language includes quantiﬁcation over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and ﬁxed in advance. The bottom level of the hierarchy speciﬁes observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible ﬁnal word. To convert Figure 1 into a fully-speciﬁed probabilistic model it remains to deﬁne a prior distribution P (s) over schemata. An appealing approach is to consider all of the inﬁnitely many sentences in the logical language already mentioned, and to deﬁne a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but ﬁnite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simpliﬁed instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our ﬁrst experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with ﬁve values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantiﬁcation over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantiﬁcation over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantiﬁcation over objects precedes quantiﬁcation over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we deﬁne our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classiﬁcation (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our ﬁrst experiment explores all three of these problems. 2 Experiment 1: Relational classiﬁcation Our ﬁrst experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our ﬁrst experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of ﬁgures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show ﬁve different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The ﬁrst choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. ﬁrst by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their ﬁnding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantiﬁcation over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantiﬁcation is no more complex than other kinds of quantiﬁcation. Future applications of our approach can explore the idea that existential quantiﬁcation over dimensions (∃Q) is psychologically more complex than universal quantiﬁcation over dimensions (∀Q) or existential quantiﬁcation over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a proﬁle that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show proﬁles predicted by the model, and the black plots immediately above show proﬁles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make conﬁdent inferences given a single instance of a category [15, 20], and it is difﬁcult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the ﬁgures in Experiment 2 could now take ﬁve values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the ﬁrst two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and conﬁrm that a single example can provide sufﬁcient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in ﬁve out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a ﬁnite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The speciﬁc logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to ﬁll this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artiﬁcial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Grifﬁths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9

6 0.067120343 115 nips-2009-Individuation, Identification and Object Discovery

7 0.055969186 112 nips-2009-Human Rademacher Complexity

8 0.052847426 97 nips-2009-Free energy score space

9 0.052015264 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

10 0.048941735 70 nips-2009-Discriminative Network Models of Schizophrenia

11 0.047041953 233 nips-2009-Streaming Pointwise Mutual Information

12 0.038845014 2 nips-2009-3D Object Recognition with Deep Belief Nets

13 0.036735896 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

14 0.030707426 72 nips-2009-Distribution Matching for Transduction

15 0.030465899 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

16 0.029952547 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

17 0.029488362 253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

18 0.028290262 260 nips-2009-Zero-shot Learning with Semantic Output Codes

19 0.027682086 25 nips-2009-Adaptive Design Optimization in Experiments with People

20 0.027222551 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.084), (1, -0.047), (2, -0.026), (3, -0.046), (4, 0.002), (5, -0.001), (6, -0.012), (7, -0.019), (8, -0.041), (9, 0.031), (10, 0.048), (11, -0.054), (12, 0.005), (13, -0.128), (14, 0.144), (15, 0.011), (16, 0.087), (17, 0.059), (18, -0.078), (19, 0.067), (20, -0.019), (21, -0.025), (22, 0.044), (23, 0.078), (24, -0.058), (25, 0.089), (26, 0.13), (27, 0.041), (28, 0.064), (29, 0.189), (30, -0.126), (31, -0.03), (32, -0.135), (33, -0.038), (34, 0.035), (35, -0.046), (36, 0.019), (37, -0.194), (38, 0.113), (39, 0.304), (40, 0.007), (41, -0.019), (42, -0.086), (43, 0.152), (44, -0.072), (45, -0.183), (46, 0.113), (47, 0.05), (48, 0.112), (49, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97033787 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

2 0.76575053 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

Author: Jie Luo, Barbara Caputo, Vittorio Ferrari

3 0.55609012 233 nips-2009-Streaming Pointwise Mutual Information

Author: Benjamin V. Durme, Ashwin Lall

Abstract: Recent work has led to the ability to perform space efﬁcient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1

4 0.53621203 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

5 0.35277939 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Author: Alexandre Bouchard-côté, Slav Petrov, Dan Klein

6 0.33146581 21 nips-2009-Abstraction and Relational learning

7 0.31600296 115 nips-2009-Individuation, Identification and Object Discovery

8 0.29101959 112 nips-2009-Human Rademacher Complexity

9 0.27322829 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

10 0.26785025 39 nips-2009-Bayesian Belief Polarization

11 0.24302687 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

12 0.23585831 236 nips-2009-Structured output regression for detection with partial truncation

13 0.23083927 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

14 0.22901218 124 nips-2009-Lattice Regression

15 0.22834581 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

16 0.22501674 25 nips-2009-Adaptive Design Optimization in Experiments with People

17 0.22247234 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

18 0.22186579 152 nips-2009-Measuring model complexity with the prior predictive

19 0.21554674 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory

20 0.20621154 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.026), (25, 0.046), (35, 0.027), (36, 0.068), (39, 0.074), (58, 0.059), (61, 0.444), (71, 0.072), (86, 0.052), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.857512 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

2 0.82378477 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difﬁcult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.76410335 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any ﬁnite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

4 0.70198566 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

5 0.66681677 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

6 0.57897156 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

7 0.53860176 134 nips-2009-Learning to Explore and Exploit in POMDPs

8 0.5380336 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

9 0.51790619 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

10 0.46806759 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

11 0.46393147 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

12 0.45978573 113 nips-2009-Improving Existing Fault Recovery Policies

13 0.44708914 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

14 0.44487789 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

15 0.43474117 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

16 0.43192101 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

17 0.41706026 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

18 0.41639528 48 nips-2009-Bootstrapping from Game Tree Search

19 0.41249856 52 nips-2009-Code-specific policy gradient rules for spiking neurons

20 0.40606052 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output