nips nips2009 nips2009-112 knowledge-graph by maker-knowledge-mining

112 nips-2009-Human Rademacher Complexity

Source: pdf

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner’s ability to ﬁt random labels, and can be used to bound the learner’s true error based on the observed training sample error. We ﬁrst review the deﬁnition of Rademacher complexity and its generalization bound. We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overﬁtting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. [sent-9, score-0.237]

2 Rademacher complexity measures a learner’s ability to ﬁt random labels, and can be used to bound the learner’s true error based on the observed training sample error. [sent-10, score-0.378]

3 We ﬁrst review the deﬁnition of Rademacher complexity and its generalization bound. [sent-11, score-0.2]

4 We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. [sent-12, score-0.237]

5 Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. [sent-14, score-0.445]

6 How much information can human beings hold in mind and deploy in simple memory tasks [19, 15, 6]? [sent-16, score-0.359]

7 How do human beings avoid over-ﬁtting learning examples when acquiring knowledge that allows them to generalize [20]? [sent-19, score-0.25]

8 Machine learning offers a variety of formal approaches to measuring the capacity of a learning system, with concepts such as Vapnik-Chervonenkis (VC) dimension [27, 25, 12] and Rademacher complexity [1, 13, 24]. [sent-21, score-0.307]

9 Based on these notions of capacity, one can quantify the generalization performance of a classiﬁer, and the danger of over-ﬁtting, by bounding its future test error using its observed training sample error. [sent-22, score-0.308]

10 We assess whether human capacity varies depending on the nature of the materials to be categorized, and empirically test whether human generalization behavior respects the error bounds in a variety of categorization tasks. [sent-25, score-0.738]

11 The results validate Rademacher complexity as a meaningful measure of human learning capacity, and provide a new perspective on the human tendency to overﬁt training data in category learning tasks. [sent-26, score-0.722]

12 We note that our aim is not to develop a new formal approach to complexity, but rather to show how a well-studied formal measure can be computed for human beings. [sent-27, score-0.285]

13 Let X be a domain of interest, which in psychology corresponds to a stimulus space. [sent-29, score-0.167]

14 Rademacher complexity (see for example [1]) measures the capacity of the hypothesis space F by how easy it is for F to ﬁt random noise. [sent-43, score-0.292]

15 That is, we are ﬁtting the particular training sample by ﬁnding the hypothesis in F with the best ﬁt. [sent-60, score-0.164]

16 For a set of real-valued functions F with domain X , a distribution PX on X , and a size n, the Rademacher complexity R(F, X , PX , n) is 2 sup n f ∈F R(F, X , PX , n) = Exσ iid n σi f (xi ) , (1) i=1 iid where the expectation is over x = x1 , . [sent-69, score-0.391]

17 Rademacher complexity depends on the hypothesis space F, the domain X , the distribution on the domain PX , as well as the training sample size n. [sent-76, score-0.491]

18 For example, in Section 4 we will discuss two different tasks on the same X (set of words): classifying a word by its emotional valence, or by its length. [sent-79, score-0.229]

19 In Section 3, we will discuss our procedure to approximate the supremum in the case of human learning. [sent-95, score-0.251]

20 On a training sample {(xi , yi )}n of size n, the observed training sample error of f is i=1 n 1 e(f ) = n i=1 (yi = f (xi )). [sent-105, score-0.301]

21 Rademacher complexity iid (x,y) ∼ PXY allows us to bound the true error using training sample error as follows. [sent-109, score-0.494]

22 Let iid {(xi , yi )}n ∼ PXY be a training sample of size n. [sent-113, score-0.205]

23 The bound has two factors, one from the Rademacher complexity and the other from the conﬁdence parameter δ and training sample size n. [sent-117, score-0.337]

24 When the bound is tight, training sample error is a good indicator of true error, and we can be conﬁdent that overﬁtting is unlikely. [sent-118, score-0.223]

25 A tight bound requires the Rademacher complexity to be close to zero. [sent-119, score-0.246]

26 On the other hand, if the Rademacher complexity is large, or n is too small, or the requested conﬁdence 1 − δ is overly stringent, the bound can be loose. [sent-120, score-0.207]

27 We will demonstrate this generalization error bound on four different human classiﬁcation tasks in Section 4. [sent-122, score-0.403]

28 3 Measuring Human Rademacher Complexity by Learning the Noise Our aim is to measure the Rademacher complexity of the human learning system for a given stimulus space X , distribution of instances PX , and sample-size n. [sent-123, score-0.484]

29 By “human learning system,” we mean the set of binary classiﬁcation functions that an average human subject can come up with on the domain X , under the experiment conditions described below. [sent-124, score-0.38]

30 3 With the two assumptions above, we can compute human Rademacher complexity for a given stimulus domain X , distribution PX , and sample size n, by assessing how well human participants are able to learn randomly-assigned labels. [sent-135, score-0.888]

31 Each participant is presented with a training sample {(xi , σi )}n where the σ’s are random ±1 labels, and asked to learn the instance-label i=1 mapping. [sent-136, score-0.257]

32 We assume that the subject will search within Ha for the best hypothesis (“rule”), which is the one that minimizes training error: n ˆ f ∗ = argmaxf ∈Ha i=1 σi f (xi ) = argminf ∈Ha e(f ). [sent-138, score-0.177]

33 Later, we ask the subject to classify the same training instances {xi }n using what she has learned. [sent-140, score-0.208]

34 For the measured Rademacher complexity to reﬂect actual learning capacity on the set Ha , it is important to prevent participants from simply doing rote learning. [sent-146, score-0.407]

35 With these considerations, we propose the following procedure to estimate human Rademacher complexity. [sent-147, score-0.213]

36 Given domain X , distribution PX , training sample size n, and number of subjects m, (1) (1) (m) (m) we generate m random samples of size n each: {(xi , σi )}n , . [sent-149, score-0.256]

37 , {(xi , σi )}n , where i=1 i=1 (j) iid (j) iid 1 xi ∼ PX and σi ∼ Bernoulli( 2 , 1 ) with value ±1, for j = 1 . [sent-152, score-0.22]

38 Participant j is shown a printed sheet with the training sample {(xi , σi )}n , where i=1 (j) (j) each instance xi is paired with its random label σi (shown as “A” and “B” instead of -1,1 for convenience). [sent-157, score-0.265]

39 the participant is informed that there are only two categories; the order does not matter; they have three minutes to study the sheet; and later they will be asked to use what they have learned to categorize more instances into “A” or “B”. [sent-158, score-0.241]

40 To prevent active maintenance of training items in working memory the participant performs a ﬁller task consisting of ten two-digit addition/subtraction questions. [sent-161, score-0.335]

41 The participant is given another sheet with the same training instances {xi }n but no i=1 labels. [sent-163, score-0.316]

42 The participant is not told that they are the same training instances, and is asked to categorize each instance as “A” or “B” and is encouraged to guess if necessary. [sent-165, score-0.296]

43 They are not of speciﬁc interest in themselves but nicely illustrate many interesting properties of human Rademacher complexity: (1) The “Shape” Domain. [sent-176, score-0.213]

44 A few instances and their emotion ratings are shown in Figure 1(b). [sent-187, score-0.162]

45 Each group of m = 10 subjects worked on a unique combination of the Shape or the Word domain, and training sample size n in 5, 10, 20, or 40, using the procedure deﬁned previously. [sent-192, score-0.17]

46 95 (b) examples from the Word domain Rademacher complexity Rademacher complexity 0 1/4 1/2 3/4 1 (a) examples from the Shape domain killer -5. [sent-199, score-0.482]

47 Figures 1(c,d) show the measured human Rademacher complexities on the domains X =Shape and Word respectively, with distribution PX =uniform, and with different training sample sizes n. [sent-204, score-0.45]

48 Several interesting observations can be made from the data: Observation 1: human Rademacher complexities in both domains decrease as n increases. [sent-206, score-0.32]

49 Indeed, when n = 5, our interviews show that, in both domains, 9 out of 10 participants offered some spurious rules of the random labels. [sent-208, score-0.25]

50 For example, one subject thought the shape categories were determined by whether the shape “faces” downward; another thought the word categories indicated whether the word contains the letter T. [sent-209, score-0.585]

51 In contrast, when n = 40, about half the participants indicated that they believed the labels to be random, as spurious “rules” are more difﬁcult to ﬁnd. [sent-211, score-0.276]

52 Observation 2: human Rademacher complexities are signiﬁcantly higher in the Word domain than in the Shape domain, for n = 10, 20, 40 respectively (t-tests, p < 0. [sent-212, score-0.373]

53 The higher complexity indicates that, for the same sample sizes, participants are better able to ﬁnd spurious explanations of the training data for the Words than for the Shapes. [sent-214, score-0.505]

54 Two distinct strategies were apparent in the Word domain interviews: (i) Some participants created mnemonics. [sent-215, score-0.235]

55 ” (ii) Other participants came up with idiosyncratic, but often imperfect, rules. [sent-217, score-0.174]

56 ” We speculate that human Rademacher complexities on other domains can be drastically different too, reﬂecting the richness of the participant’s pre-existing knowledge about the domain. [sent-220, score-0.32]

57 Observation 3: many of these human Rademacher complexities are relatively large. [sent-221, score-0.287]

58 This means that under those X , PX , n, humans have a large capacity to learn arbitrary labels, and so will be more prone to overﬁt on real (i. [sent-222, score-0.165]

59 We will present human generalization experiments in Section 4. [sent-225, score-0.258]

60 It is also interesting to note that both Rademacher complexities at n = 5 are less than 2: under our procedure, participants are not perfect at remembering the labels of merely ﬁve instances. [sent-226, score-0.31]

61 4 Bounding Human Generalization Errors We reiterate the interpretation of human Rademacher complexity for psychology. [sent-227, score-0.368]

62 The tasks are: (1) Shape-+: Recall the Shape domain is parametrized by x ∈ [0, 1]. [sent-360, score-0.167]

63 Note, however, that the two shape tasks share the same Rademacher complexity, and therefore have the same bound for the same n. [sent-375, score-0.177]

64 P (y = 1|x) = 0 if word x has a negative emotion rating in the Wisconsin Perceptual Attribute Ratings Database, and P (y = 1|x) = 1 otherwise. [sent-377, score-0.177]

65 The two word tasks are drastically different in that one focuses on semantics and the other on orthography, but they share the same Rademacher complexity and thus the same bound (for the same n), because the underlying domain is the same. [sent-379, score-0.466]

66 The procedure is identical to that in Section 3 except for two things: (i) Instead iid of random labels σ, we sample labels y ∼ P (y|x) appropriate for each task. [sent-381, score-0.234]

67 (ii) In step 3, (j) in addition to the training instances {xi }n , the j th subject is also given 100 test instances i=1 (j) n+100 {xi }i=n+1 , sampled from PX . [sent-382, score-0.299]

68 The order of the training and test instances is randomized. [sent-383, score-0.176]

69 We compute the participant’s training sam(j) n ple error as e(f (j) ) = 1/n i=1 yi = f (j) (xi ) , and estimate her generalization error as ˆ e(f (j) ) = 1/100 n+100 i=n+1 (j) yi = f (j) (xi ) . [sent-385, score-0.212]

70 We present the performance of individual participants in Table 1: e is the observed trainˆ ing error for that subject, “bound e” is the 95% conﬁdence (i. [sent-389, score-0.19]

71 5 2 Rademacher complexity Figure 2: Human Rademacher complexity predicts the trend of overﬁtting. [sent-396, score-0.31]

72 Table 1 provides empirical support that our application of computational learning theory to human learning is viable. [sent-401, score-0.213]

73 05, Theorem 2 allows the bound to fail on about two (5% of 40) participants – which did not happen. [sent-403, score-0.201]

74 Figure 2 shows that as R increases, e − e increases (solid line; error bar ±standard error; ˆ ˆ averaged over the two different tasks with the same domain and n, as noted in the graph). [sent-408, score-0.179]

75 For example, the Word domain and n = 5 has a large Rademacher complexity 1. [sent-411, score-0.241]

76 76, and both task WordLength and task WordEmotion severely overﬁt: In task WordLength with n = 5, all subjects had zero training error but had large test error, suggesting that their good performance on the training items reﬂects overﬁtting. [sent-412, score-0.441]

77 Accordingly, the explanations offered during the post-test interviews for this group spuriously ﬁt the training items but did not reﬂect the true categorization rule. [sent-413, score-0.266]

78 Subject 111 thought that the class decision indicated “things you can go inside,” while subject 114 thought the class indicated an odd or even number of syllables. [sent-414, score-0.176]

79 Subject 102 received the training items (daylight, 1), (hospital, -1), (termite, -1), (envy, -1), (scream, -1), and concluded that class 1 is “anything related to omitting[sic] light,” and proceeded to classify the test items as such. [sent-416, score-0.223]

80 This distinction is not necessarily relevant here, as we adopt an abstract perspective which analyzes the human system as a black box that produces labels, and both learning and memory contribute to the process being executed in that black box. [sent-418, score-0.27]

81 Much recent research has explored the relationship between the statistical complexity of some categorization task and the ease with which humans learn the task [7, 5, 9, 11]. [sent-420, score-0.332]

82 Rademacher complexity is different: it indexes not the complexity of the X → Y categorization task, but the sophistication of the learner in domain X (note Y does not appear in Rademacher complexity). [sent-421, score-0.465]

83 Greater complexity indicates, not a more difﬁcult categorization task, but a greater tendency to overﬁt sparse data. [sent-422, score-0.23]

84 7 On the other hand, our deﬁnition of Rademacher complexity depends only on the domain, distribution, and sample size. [sent-423, score-0.2]

85 In human learning, other factors also contribute to learnability, such as the instructions, motivation, time to study, and should probably be incorporated into the complexity. [sent-424, score-0.213]

86 Unfortunately, human VC-dimension seems difﬁcult to measure experimentally: First, shattering requires validating an exponential (2m ) number of classiﬁcations on a given subset. [sent-440, score-0.237]

87 Normalized Maximum Likelihood (NML) uses a similar complexity measure for a model class [21], the connection merits further study ([23], p. [sent-445, score-0.179]

88 Human Rademacher complexity might help to advance theories of human cognition in many ways. [sent-447, score-0.395]

89 First, human Rademacher complexity can provide a means of testing computational models of human concept learning. [sent-448, score-0.613]

90 Traditionally, such models are assessed by comparing their performance to human performance in terms of classiﬁcation error. [sent-449, score-0.213]

91 A new approach would be to derive or empirically estimate the Rademacher complexity of the computational models, and compare that to human Rademacher complexity. [sent-450, score-0.368]

92 Second, our procedure could be used to measure human Rademacher complexity in individuals or special populations, including typically and atypically-developing children and adults. [sent-452, score-0.392]

93 Third, human Rademacher complexity may help explain the human tendency to discern patterns in random stimuli, such as the well-known Rorschach inkblot test, “illusory correlations” [4], or “falsememory” effect [22]. [sent-456, score-0.613]

94 These effects may be viewed as spurious rule-ﬁtting to (or generalization of) the observed data, and Human Rademacher complexity may quantify the possibility of observing such an effect. [sent-457, score-0.245]

95 Fourth, cognitive psychologists have long entertained an interest in characterizing the capacity of different mental processes such as, for instance, the capacity limitations of short-term memory [19, 6]. [sent-458, score-0.365]

96 In this vein, our work suggests a different kind of metric for assessing the capacity of the human learning system. [sent-459, score-0.316]

97 Finally, human Rademacher complexity can help experimental psychologists to determine the propensity of overﬁtting in their stimulus materials. [sent-460, score-0.42]

98 We have seen that human Rademacher complexity can be much higher in some domains (e. [sent-461, score-0.401]

99 Our procedure could be used to measure the human Rademacher complexity of many standard concept-learning materials in cognitive science, such as the Greebles used by Tarr and colleagues [8] and the circle-and-line stimuli of McKinley & Nosofsky [17]. [sent-466, score-0.499]

100 XZ thanks Michael Coen for discussions that lead to the realization of the difﬁculties in measuring human VC dimension. [sent-468, score-0.238]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rademacher', 0.731), ('px', 0.235), ('human', 0.213), ('complexity', 0.155), ('participants', 0.149), ('ha', 0.146), ('word', 0.121), ('capacity', 0.103), ('participant', 0.101), ('wordemotion', 0.093), ('wordlength', 0.093), ('domain', 0.086), ('training', 0.085), ('cognitive', 0.077), ('tting', 0.076), ('iid', 0.075), ('pxy', 0.075), ('complexities', 0.074), ('shape', 0.073), ('xi', 0.07), ('sheet', 0.065), ('instances', 0.065), ('humans', 0.062), ('subject', 0.058), ('memory', 0.057), ('labels', 0.057), ('items', 0.056), ('emotion', 0.056), ('emotional', 0.056), ('interviews', 0.056), ('valence', 0.056), ('psychology', 0.054), ('tasks', 0.052), ('bound', 0.052), ('magical', 0.049), ('categorize', 0.049), ('wisconsin', 0.048), ('spurious', 0.045), ('sample', 0.045), ('generalization', 0.045), ('categorization', 0.043), ('danger', 0.042), ('ratings', 0.041), ('error', 0.041), ('subjects', 0.04), ('tight', 0.039), ('supremum', 0.038), ('beings', 0.037), ('grenade', 0.037), ('illusory', 0.037), ('meadow', 0.037), ('shattered', 0.037), ('supf', 0.037), ('shapes', 0.037), ('vc', 0.036), ('task', 0.036), ('told', 0.035), ('hypothesis', 0.034), ('thought', 0.034), ('domains', 0.033), ('classi', 0.033), ('mckinley', 0.033), ('concept', 0.032), ('tendency', 0.032), ('ect', 0.031), ('attribute', 0.03), ('materials', 0.03), ('remembering', 0.03), ('mendelson', 0.03), ('parametrized', 0.029), ('rhs', 0.028), ('xm', 0.027), ('cognition', 0.027), ('stimulus', 0.027), ('verbal', 0.026), ('explanations', 0.026), ('iq', 0.026), ('vapnik', 0.026), ('students', 0.026), ('learner', 0.026), ('words', 0.026), ('test', 0.026), ('asked', 0.026), ('bartlett', 0.026), ('psychologists', 0.025), ('came', 0.025), ('measuring', 0.025), ('indicated', 0.025), ('theorem', 0.025), ('formal', 0.024), ('respects', 0.024), ('queen', 0.024), ('measure', 0.024), ('questions', 0.024), ('perceptual', 0.024), ('bounding', 0.024), ('separability', 0.023), ('categories', 0.023), ('rich', 0.023), ('binary', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

2 0.13325304 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

Author: Peter Orbanz

Abstract: We consider the general problem of constructing nonparametric Bayesian models on inﬁnite-dimensional random objects, such as functions, inﬁnite graphs or inﬁnite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their ﬁnite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of ﬁnitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate ﬁnite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the ﬁnite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on inﬁnite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. 1

3 0.12458058 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

Abstract: Many researchers have suggested that the psychological complexity of a concept is related to the length of its representation in a language of thought. As yet, however, there are few concrete proposals about the nature of this language. This paper makes one such proposal: the language of thought allows ﬁrst order quantiﬁcation (quantiﬁcation over objects) more readily than second-order quantiﬁcation (quantiﬁcation over features). To support this proposal we present behavioral results from a concept learning study inspired by the work of Shepard, Hovland and Jenkins. Humans can learn and think about many kinds of concepts, including natural kinds such as elephant and water and nominal kinds such as grandmother and prime number. Understanding the mental representations that support these abilities is a central challenge for cognitive science. This paper proposes that quantiﬁcation plays a role in conceptual representation—for example, an animal X qualiﬁes as a predator if there is some animal Y such that X hunts Y . The concepts we consider are much simpler than real-world examples such as predator, but even simple laboratory studies can provide important clues about the nature of mental representation. Our approach to mental representation is based on the language of thought hypothesis [1]. As pursued here, the hypothesis proposes that mental representations are constructed in a compositional language of some kind, and that the psychological complexity of a concept is closely related to the length of its representation in this language [2, 3, 4]. Following previous researchers [2, 4], we operationalize the psychological complexity of a concept in terms of the ease with which it is learned and remembered. Given these working assumptions, the remaining challenge is to specify the representational resources provided by the language of thought. Some previous studies have relied on propositional logic as a representation language [2, 5], but we believe that the resources of predicate logic are needed to capture the structure of many human concepts. In particular, we suggest that the language of thought can accommodate relations, functions, and quantiﬁcation, and focus here on the role of quantiﬁcation. Our primary proposal is that quantiﬁcation is supported by the language of thought, but that quantiﬁcation over objects is psychologically more natural than quantiﬁcation over features. To test this idea we compare concept learning in two domains which are very similar except for one critical difference: one domain allows quantiﬁcation over objects, and the other allows quantiﬁcation over features. We consider several logical languages that can be used to formulate concepts in both domains, and ﬁnd that learning times are best predicted by a language that supports quantiﬁcation over objects but not features. Our work illustrates how theories of mental representation can be informed by comparing concept learning across two or more domains. Existing studies work with a range of domains, and it is useful to consider a “conceptual universe” that includes these possibilities along with many others that have not yet been studied. Table 1 charts a small fragment of this universe, and the penultimate column shows example stimuli that will be familiar from previous studies of concept learning. Previous studies have made important contributions by choosing a single domain in Table 1 and explaining 1 why some concepts within this domain are easier to learn than others [2, 4, 6, 7, 8, 9]. Comparisons across domains can also provide important information about learning and mental representation, and we illustrate this claim by comparing learning times across Domains 3 and 4. The next section introduces the conceptual universe in Table 1 in more detail. We then present a formal approach to concept learning that relies on a logical language and compare three candidate languages. Language OQ (for object quantiﬁcation) supports quantiﬁcation over objects but not features, language F Q (for feature quantiﬁcation) supports quantiﬁcation over features but not objects, and language OQ + F Q supports quantiﬁcation over both objects and features. We use these languages to predict learning times across Domains 3 and 4, and present an experiment which suggests that language OQ comes closest to the language of thought. 1 The conceptual universe Table 1 provides an organizing framework for thinking about the many domains in which learning can occur. The table includes 8 domains, each of which is deﬁned by specifying some number of objects, features, and relations, and by specifying the range of each feature and each relation. We refer to the elements in each domain as items, and the penultimate column of Table 1 shows items from each domain. The ﬁrst row shows a domain commonly used by studies of Boolean concept learning. Each item in this domain includes a single object a and speciﬁes whether that object has value v1 (small) or v2 (large) on feature F (size), value v3 (white) or v4 (gray) on feature G (color), and value v5 (vertical) or v6 (horizontal) on feature H (texture). Domain 2 also includes three features, but now each item includes three objects and each feature applies to only one of the objects. For example, feature H (texture) applies to only the third object in the domain (i.e. the third square on each card). Domain 3 is similar to Domain 1, but now the three features can be aligned— for any given item each feature will be absent (value 0) or present. The example in Table 1 uses three features (boundary, dots, and slash) that can each be added to an unadorned gray square. Domain 4 is similar to Domain 2, but again the feature values can be aligned, and the feature for each object will be absent (value 0) or present. Domains 5 and 6 are similar to domains 2 and 4 respectively, but each one includes relations rather than features. In Domain 6, for example, the relation R assigns value 0 (absent) or value 1 (present) to each undirected pair of objects. The ﬁrst six domains in Table 1 are all variants of Domain 1, which is the domain typically used by studies of Boolean concept learning. Focusing on six related domains helps to establish some of the dimensions along which domains can differ, but the ﬁnal two domains in Table 1 show some of the many alternative possibilities. Domain 7 includes two categorical features, each of which takes three rather than two values. Domain 8 is similar to Domain 6, but now the number of objects is 6 rather than 3 and relation R is directed rather than undirected. To mention just a handful of possibilities which do not appear in Table 1, domains may also have categorical features that are ordered (e.g. a size feature that takes values small, medium, and large), continuous valued features or relations, relations with more than two places, and objects that contain sub-objects or parts. Several learning problems can be formulated within any given domain. The most basic is to learn a single item—for example, a single item from Domain 8 [4]. A second problem is to learn a class of items—for example, a class that includes four of the items in Domain 1 and excludes the remaining four [6]. Learning an item class can be formalized as learning a unary predicate deﬁned over items, and a natural extension is to consider predicates with two or more arguments. For example, problems of the form A is to B as C is to ? can be formulated as problems where the task is to learn a binary relation analogous(·, ·) given the single example analogous(A, B). Here, however, we focus on the task of learning item classes or unary predicates. Since we focus on the role of quantiﬁcation, we will work with domains where quantiﬁcation is appropriate. Quantiﬁcation over objects is natural in cases like Domain 4 where the feature values for all objects can be aligned. Note, for example, that the statement “every object has its feature” picks out the ﬁnal example item in Domain 4 but that no such statement is possible in Domain 2. Quantiﬁcation over features is natural in cases like Domain 3 where the ranges of each feature can be aligned. For example, “object a has all three features” picks out the ﬁnal example item in Domain 3 but no such statement is possible in Domain 1. We therefore focus on Domains 3 and 4, and explore the problem of learning item classes in each domain. 2 3 {a} {a, b, c} {a} {a, b, c} {a, b, c} {a, b, c} {a} {a, b, c, d, e, f } 1 2 3 4 5 6 7 8 R : O × O → {0, 1} — F : O → {v1 , v2 , v3 } G : O → {v4 , v5 , v6 } — R : O × O → {0, 1} R : (a, b) → {v1 , v2 } S : (a, c) → {v3 , v4 } T : (b, c) → {v5 , v6 } — — — — Relations — — Domain speciﬁcation Features F : O → {v1 , v2 } G : O → {v3 , v4 } H : O → {v5 , v6 } F : a → {v1 , v2 } G : b → {v3 , v4 } H : c → {v5 , v6 } F : O → {0, v1 } G : O → {0, v2 } H : O → {0, v3 } F : a → {0, v1 } G : b → {0, v2 } H : c → {0, v3 } , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... , ... , Example Items , , , , , , , , , , , , , ... , [4] [8, 9] [13] [6] [12] [6] [2, 6, 7, 10, 11] Ref. Table 1: The conceptual universe. Eight domains are shown, and each one is deﬁned by a set of objects, a set of features, and a set of relations. We call the members of each domain items, and an item is created by specifying the extension of each feature and relation in the domain. The six domains above the double lines are closely related to the work of Shepard et al. [6]. Each one includes eight items which differ along three dimensions. These dimensions, however, emerge from different underlying representations in the six cases. Objects O # (a) (b) 1 (I) 2 (II) 3 (III) 4 (III) 5 (IV) 6 (IV) 7 (V) 8 (V) 9 (V) 10 (VI) 111 110 101 011 100 010 001 000 Figure 1: (a) A stimulus lattice for domains (e.g. Domains 3, 4, and 6) that can be encoded as a triple of binary values where 0 represents “absent” and 1 represents “present.” (b) If the order of the values in the triple is not signiﬁcant, there are 10 distinct ways to partition the lattice into two classes of four items. The SHJ type for each partition is shown in parentheses. Domains 3 and 4 both include 8 items each and we will consider classes that include exactly four of these items. Each item in these domains can be represented as a triple of binary values, where 0 indicates that a feature is absent and value 1 indicates that a feature is present. Each triple represents the values of the three features (Domain 3) or the feature values for the three objects (Domain 4). By representing each domain in this way, we have effectively adopted domain speciﬁcations that are simpliﬁcations of those shown in Table 1. Domain 3 is represented using three features of the form F, G, H : O → {0, 1}, and Domain 4 is represented using a single feature of the form F : O → {0, 1}. Simpliﬁcations of this kind are possible because the features in each domain can be aligned—notice that no corresponding simpliﬁcations are possible for Domains 1 and 2. The eight binary triples in each domain can be organized into the lattice shown in Figure 1a. Here we consider all ways to partition the vertices of the lattice into two groups of four. If partitions that differ only up to a permutation of the features (Domain 3) or objects (Domain 4) are grouped into equivalence classes, there are ten of these classes, and a representative of each is shown in Figure 1b. Previous researchers [6] have pointed out that the stimuli in Domain 1 can be organized into a cube similar to Figure 1a, and that there are six ways to partition these stimuli into two groups of four up to permutations of the features and permutations of the range of each feature. We refer to these equivalence classes as the six Shepard-Hovland-Jenkins types (or SHJ types), and each partition in Figure 1b is labeled with its corresponding SHJ type label. Note, for example, that partitions 3 and 4 are both examples of SHJ type III. For us, partitions 3 and 4 are distinct since items 000 (all absent) and 111 (all present) are uniquely identiﬁable, and partition 3 assigns these items to different classes but partition 4 does not. Previous researchers have considered differences between some of the ﬁrst six domains in Table 1. Shepard et al. [6] ran experiments using compact stimuli (Domain 1) and distributed stimuli (Domains 2 and 4), and observed the same difﬁculty ranking of the six SHJ types in all cases. Their work, however, does not acknowledge that Domain 4 leads to 10 distinct types rather than 6, and therefore fails to address issues such as the relative complexities of concepts 5 and 6 in Figure 1. Social psychologists [13, 14] have studied Domain 6 and found that learning patterns depart from the standard SHJ order—in particular, that SHJ type VI (Concept 10 in Figure 1) is simpler than types III, IV and V. This ﬁnding has been used to support the claim that social learning relies on a domain-speciﬁc principle of structural balance [14]. We will see, however, that the relative simplicity of type VI in domains like 4 and 6 is consistent with a domain-general account based on representational economy. 2 A representation length approach to concept learning The conceptual universe in Table 1 calls for an account of learning that can apply across many domains. One candidate is the representation length approach, which proposes that concepts are mentally represented in a language of thought, and that the subjective complexity of a concept is 4 determined by the length of its representation in this language [4]. We consider the case where a concept corresponds to a class of items, and explore the idea that these concepts are mentally represented in a logical language. More formally, a concept is represented as a logical sentence, and the concept includes all models of this sentence, or all items that make the sentence true. The predictions of this representation length approach depend critically on the language chosen. Here we consider three languages—an object quantiﬁcation language OQ that supports quantiﬁcation over objects, a feature quantiﬁcation language F Q that supports quantiﬁcation over features, and a language OQ + F Q that supports quantiﬁcation over both objects and features. Language OQ is based on a standard logical language known as predicate logic with equality. The language includes symbols representing objects (e.g. a and b), and features (e.g. F and G) and these symbols can be combined to create literals that indicate that an object does (Fa ) or does not have a certain feature (Fa ′ ). Literals can be combined using two connectives: AND (Fa Ga ) and OR (Fa + Ga ). The language includes two quantiﬁers—for all (∀) and there exists (∃)—and allows quantiﬁcation over objects (e.g. ∀x Fx , where x is a variable that ranges over all objects in the domain). Finally, language OQ includes equality and inequality relations (= and =) which can be used to compare objects and object variables (e.g. =xa or =xy ). Table 2 shows several sentences formulated in language OQ. Suppose that the OQ complexity of each sentence is deﬁned as the number of basic propositions it contains, where a basic proposition can be a positive or negative literal (Fa or Fa ′ ) or an equality or inequality statement (=xa or =xy ). Equivalently, the complexity of a sentence is the total number of ANDs plus the total number of ORs plus one. This measure is equivalent by design to Feldman’s [2] notion of Boolean complexity when applied to a sentence without quantiﬁcation. The complexity values in Table 2 show minimal complexity values for each concept in Domains 3 and 4. Table 2 also shows a single sentence that achieves each of these complexity values, although some concepts admit multiple sentences of minimal complexity. The complexity values in Table 2 were computed using an “enumerate then combine” approach. We began by enumerating a set of sentences according to criteria described in the next paragraph. Each sentence has an extension that speciﬁes which items in the domain are consistent with the sentence. Given the extensions of all sentences generated during the enumeration phase, the combination phase considered all possible ways to combine these extensions using conjunctions or disjunctions. The procedure terminated once extensions corresponding to all of the concepts in the domain had been found. Although the number of possible sentences grows rapidly as the complexity of these sentences increases, the number of extensions is ﬁxed and relatively small (28 for domains of size 8). The combination phase is tractable since sentences with the same extension can be grouped into a single equivalence class. The enumeration phase considered all formulae which had at most two quantiﬁers and which had a complexity value lower than four. For example, this phase did not include the formula ∃x ∃y ∃z =yz F′ Fy Fz (too many quantiﬁers) or the formula ∀x ∃y =xy Fy (Fx + Gx + Hx ) (complexity x too high). Despite these restrictions, we believe that the complexity values in Table 2 are identical to the values that would be obtained if we had considered all possible sentences. Language F Q is similar to OQ but allows quantiﬁcation over features rather than objects. For example, F Q includes the statement ∀Q Qa , where Q is a variable that ranges over all features in the domain. Language F Q also allows features and feature variables to be compared for equality or inequality (e.g. =QF or =QR ). Since F Q and OQ are closely related, it follows that the F Q complexity values for Domains 3 and 4 are identical to the OQ complexity values for Domains 4 and 3. For example, F Q can express concept 5 in Domain 3 as ∀Q ∃R =QR Ra . We can combine OQ and F Q to create a language OQ + F Q that allows quantiﬁcation over both objects and features. Allowing both kinds of quantiﬁcation leads to identical complexity values for Domains 3 and 4. Language OQ + F Q can express each of the formulae for Domain 4 in Table 2, and these formulae can be converted into corresponding formulae for Domain 3 by translating each instance of object quantiﬁcation into an instance of feature quantiﬁcation. Logicians distinguish between ﬁrst-order logic, which allows quantiﬁcation over objects but not predicates, and second-order logic, which allows quantiﬁcation over objects and predicates. The difference between languages OQ and OQ + F Q is superﬁcially similar to the difference between ﬁrst-order and second-order logic, but does not cut to the heart of this matter. Since language 5 # 1 Domain 3 Domain 4 C 1 Ga C 1 Fb 2 Fa Ha + Fa Ha 4 Fa Fc + Fa Fc 4 3 Fa ′ Ga + Fa Ha 4 Fa ′ Fb + Fa Fc 4 4 Fa ′ Ga ′ + Fa Ha 4 Fa ′ Fb ′ + Fa Fc 4 5 Ga (Fa + Ha ) + Fa Ha 2 6 7 8 ′ ′ ′ ′ 5 ∀x ∃y =xy Fy ′ 5 ′ ′ 6 Ga (Fa + Ha ) + Fa Ha Ga (Fa + Ha ) + Fa Ga Ha 3 (∀x Fx ) + Fb ∃y Fy ′ ′ ′ (∀x Fx ) + Fb (Fa + Fc ) 4 ′ ′ ′ 6 ′ ′ 6 (∀x Fx ) + Fa (Fb + Fc ) 4 10 (∀x Fx ) + ∃y ∀z Fy (=zy +Fz ′ ) 4 Ha (Fa + Ga ) + Fa Ga Ha 9 Fa (Ga + Ha ) + Fa Ga Ha 10 Ga ′ (Fa Ha ′ + Fa ′ Ha ) + Ga (Fa ′ Ha ′ + Fa Ha ) ′ ′ ′ Fc (Fa + Fb ) + Fa Fb Fc ′ ′ 6 Table 2: Complexity values C and corresponding formulae for language OQ. Boolean complexity predicts complexity values for both domains that are identical to the OQ complexity values shown here for Domain 3. Language F Q predicts complexity values for Domains 3 and 4 that are identical to the OQ values for Domains 4 and 3 respectively. Language OQ + F Q predicts complexity values for both domains that are identical to the OQ complexity values for Domain 4. OQ + F Q only supports quantiﬁcation over a pre-speciﬁed set of features, it is equivalent to a typed ﬁrst order logic that includes types for objects and features [15]. Future studies, however, can explore the cognitive relevance of higher-order logic as developed by logicians. 3 Experiment Now that we have introduced languages OQ, F Q and OQ + F Q our theoretical proposals can be sharply formulated. We suggest that quantiﬁcation over objects plays an important role in mental representations, and predict that OQ complexity will account better for human learning than Boolean complexity. We also propose that quantiﬁcation over objects is more natural than quantiﬁcation over features, and predict that OQ complexity will account better for human learning than both F Q complexity and OQ + F Q complexity. We tested these predictions by designing an experiment where participants learned concepts from Domains 3 and 4. Method. 20 adults participated for course credit. Each participant was assigned to Domain 3 or Domain 4 and learned all ten concepts from that domain. The items used for each domain were the cards shown in Table 1. Note, for example, that each Domain 3 card showed one square, and that each Domain 4 card showed three squares. These items are based on stimuli developed by Sakamoto and Love [12]. The experiment was carried out using a custom built graphical interface. For each learning problem in each domain, all eight items were simultaneously presented on the screen, and participants were able to drag them around and organize them however they liked. Each problem had three phases. During the learning phase, the four items belonging to the current concept had red boundaries, and the remaining four items had blue boundaries. During the memory phase, these colored boundaries were removed, and participants were asked to sort the items into the red group and the blue group. If they made an error they returned to the learning phase, and could retake the test whenever they were ready. During the description phase, participants were asked to provide a written description of the two groups of cards. The color assignments (red or blue) were randomized across participants— in other words, the “red groups” learned by some participants were identical to the “blue groups” learned by others. The order in which participants learned the 10 concepts was also randomized. Model predictions. The OQ complexity values for the ten concepts in each domain are shown in Table 2 and plotted in Figure 2a. The complexity values in Figure 2a have been normalized so that they sum to one within each domain, and the differences of these normalized scores are shown in the ﬁnal row of Figure 2a. The two largest bars in the difference plot indicate that Concepts 10 and 5 are predicted to be easier to learn in Domain 4 than in Domain 3. Language OQ can express 6 OQ complexity Domain 3 a) Learning time b) 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Difference Domain 4 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 2: Normalized OQ complexity values and normalized learning times for the 10 concepts in Domains 3 and 4. statements like “either 1 or 3 objects have F ” (Concept 10 in Domain 4), or “2 or more objects have F ” (Concept 5 in Domain 4). Since quantiﬁcation over features is not permitted, however, analogous statements (e.g. “object a has either 1 or 3 features”) cannot be formulated in Domain 3. Concept 10 corresponds to SHJ type VI, which often emerges as the most difﬁcult concept in studies of Boolean concept learning. Our model therefore predicts that the standard ordering of the SHJ types will not apply in Domain 4. Our model also predicts that concepts assigned to the same SHJ type will have different complexities. In Domain 4 the model predicts that Concept 6 will be harder to learn than Concept 5 (both are examples of SHJ type IV), and that Concept 8 will be harder to learn than Concepts 7 or 9 (all three are examples of SHJ type V). Results. The computer interface recorded the amount of time participants spent on the learning phase for each concept. Domain 3 was a little more difﬁcult than Domain 4 overall: on average, Domain 3 participants took 557 seconds and Domain 4 participants took 467 seconds to learn the 10 concepts. For all remaining analyses, we consider learning times that are normalized to sum to 1 for each participant. Figure 2b shows the mean values for these normalized times, and indicates the relative difﬁculties of the concepts within each condition. The difference plot in Figure 2b supports the two main predictions identiﬁed previously. Concepts 10 and 5 are the cases that differ most across the domains, and both concepts are easier to learn in Domain 3 than Domain 4. As predicted, Concept 5 is substantially easier than Concept 6 in Domain 4 even though both correspond to the same SHJ type. Concepts 7 through 9 also correspond to the same SHJ type, and the data for Domain 4 suggest that Concept 8 is the most difﬁcult of the three, although the difference between Concepts 8 and 7 is not especially large. Four sets of complexity predictions are plotted against the human data in Figure 3. Boolean complexity and OQ complexity make identical predictions about Domain 3, and OQ complexity and OQ + F Q complexity make identical predictions about Domain 4. Only OQ complexity, however, accounts for the results observed in both domains. The concept descriptions generated by participants provide additional evidence that there are psychologically important differences between Domains 3 and 4. If the descriptions for concepts 5 and 10 are combined, 18 out of 20 responses in Domain 4 referred to quantiﬁcation or counting. One representative description of Concept 5 stated that “red has multiple ﬁlled” and that “blue has one ﬁlled or none.” Only 3 of 20 responses in Domain 3 mentioned quantiﬁcation. One representative description of Concept 5 stated that “red = multiple features” and that “blue = only one feature.” 7 r=0.84 0.2 r=0.84 0.2 r=0.26 0.2 r=0.26 0.2 Learning time (Domain 3) 0.1 0.1 0 (Domain 4) 0.2 r=0.27 0.2 Learning time 0.1 0.1 0 0.2 r=0.83 0.2 0.1 0.1 0 0.1 0.2 0 0.1 0.2 r=0.27 0.2 0.1 Boolean complexity 0.1 0.1 0.2 OQ complexity 0.1 0.2 r=0.83 0.2 0.1 0 0 0.1 0 0.1 0.2 F Q complexity 0 0.1 0.2 OQ + F Q complexity Figure 3: Normalized learning times for each domain plotted against normalized complexity values predicted by four languages: Boolean logic, OQ, F Q and OQ + F Q. These results suggest that people can count or quantify over features, but that it is psychologically more natural to quantify over objects rather than features. Although we have focused on three speciﬁc languages, the results in Figure 2b can be used to evaluate alternative proposals about the language of thought. One such alternative is an extension of Language OQ that allows feature values to be compared for equality. This extended language supports concise representations of Concept 2 in both Domain 3 (Fa = Ha ) and Domain 4 (Fa = Fc ), and predicts that Concept 2 will be easier to learn than all other concepts except Concept 1. Note, however, that this prediction is not compatible with the data in Figure 2b. Other languages might also be considered, but we know of no simple language that will account for our data better than OQ. 4 Conclusion Comparing concept learning across qualitatively different domains can provide valuable information about the nature of mental representation. We compared two domains that that are similar in many respects, but that differ according to whether they include a single object (Domain 3) or multiple objects (Domain 4). Quantiﬁcation over objects is possible in Domain 4 but not Domain 3, and this difference helps to explain the different learning patterns we observed across the two domains. Our results suggest that concept representations can incorporate quantiﬁcation, and that quantifying over objects is more natural than quantifying over features. The model predictions we reported are based on a language (OQ) that is a generic version of ﬁrst order logic with equality. Our results therefore suggest that some of the languages commonly considered by logicians (e.g. ﬁrst order logic with equality) may indeed capture some aspects of the “laws of thought” [16]. A simple language like OQ offers a convenient way to explore the role of quantiﬁcation, but this language will need to be reﬁned and extended in order to provide a more accurate account of mental representation. For example, a comprehensive account of the language of thought will need to support quantiﬁcation over features in some cases, but might be formulated so that quantiﬁcation over features is typically more costly than quantiﬁcation over objects. Many possible representation languages can be imagined and a large amount of empirical data will be needed to identify the language that comes closest to the language of thought. Many relevant studies have already been conducted [2, 6, 8, 9, 13, 17], but there are vast regions of the conceptual universe (Table 1) that remain to be explored. Navigating this universe is likely to involve several challenges, but web-based experiments [18, 19] may allow it to be explored at a depth and scale that are currently unprecedented. Characterizing the language of thought is undoubtedly a long term project, but modern methods of data collection may support rapid progress towards this goal. Acknowledgments I thank Maureen Satyshur for running the experiment. This work was supported in part by NSF grant CDI-0835797. 8 References [1] J. A. Fodor. The language of thought. Harvard University Press, Cambridge, 1975. [2] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407: 630–633, 2000. [3] D. Fass and J. Feldman. Categorization under complexity: A uniﬁed MDL account of human learning of regular and irregular categories. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 35–34. MIT Press, Cambridge, MA, 2003. [4] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [5] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Grifﬁths. A rational analysis of rule-based concept learning. Cognitive Science, 32(1):108–154, 2008. [6] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classiﬁcations. Psychological Monographs, 75(13), 1961. Whole No. 517. [7] R. M. Nosofsky, M. Gluck, T. J. Palmeri, S. C. McKinley, and P. Glauthier. Comparing models of rule-based classiﬁcation learning: A replication and extension of Shepard, Hovland, and Jenkins (1961). Memory and Cognition, 22:352–369, 1994. [8] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin and Review, 9(1):43–58, 2002. [9] C. D. Aitkin and J. Feldman. Subjective complexity of categories deﬁned over three-valued features. In R. Sun and N. Miyake, editors, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 961–966. Psychology Press, New York, 2006. [10] F. Mathy and J. Bradmetz. A theory of the graceful complexiﬁcation of concepts and their learnability. Current Psychology of Cognition, 22(1):41–82, 2004. [11] R. Vigo. A note on the complexity of Boolean concepts. Journal of Mathematical Psychology, 50:501–510, 2006. [12] Y. Sakamoto and B. C. Love. Schematic inﬂuences on category learning and recognition memory. Journal of Experimental Psychology: General, 133(4):534–553, 2004. [13] W. H. Crockett. Balance, agreement and positivity in the cognition of small social structures. In Advances in Experimental Social Psychology, Vol 15, pages 1–57. Academic Press, 1982. [14] N. B. Cottrell. Heider’s structural balance principle as a conceptual rule. Journal of Personality and Social Psychology, 31(4):713–720, 1975. [15] H. B. Enderton. A mathematical introduction to logic. Academic Press, New York, 1972. [16] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probabilities. 1854. [17] B. C. Love and A. B. Markman. The nonindependence of stimulus properties in human category learning. Memory and Cognition, 31(5):790–799, 2003. [18] L. von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006. [19] R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008. 9

4 0.10505503 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

Author: Adam Sanborn, Nick Chater, Katherine A. Heller

Abstract: Existing models of categorization typically represent to-be-classiﬁed items as points in a multidimensional space. While from a mathematical point of view, an inﬁnite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Speciﬁcally, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1

5 0.090476371 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei

Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject ﬁrst studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle ﬁlter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1

6 0.088623233 115 nips-2009-Individuation, Identification and Object Discovery

7 0.087529011 260 nips-2009-Zero-shot Learning with Semantic Output Codes

8 0.079346552 21 nips-2009-Abstraction and Relational learning

9 0.072831504 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.065533884 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

11 0.064485282 152 nips-2009-Measuring model complexity with the prior predictive

12 0.059267767 72 nips-2009-Distribution Matching for Transduction

13 0.058867391 55 nips-2009-Compressed Least-Squares Regression

14 0.057183322 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

15 0.055969186 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

16 0.052336123 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

17 0.05164386 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

18 0.050570723 96 nips-2009-Filtering Abstract Senses From Image Search Results

19 0.050458681 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

20 0.048215557 157 nips-2009-Multi-Label Prediction via Compressed Sensing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.168), (1, -0.021), (2, -0.024), (3, -0.04), (4, 0.007), (5, -0.003), (6, -0.032), (7, -0.039), (8, -0.077), (9, 0.063), (10, 0.061), (11, -0.055), (12, 0.034), (13, -0.176), (14, 0.154), (15, 0.006), (16, 0.077), (17, 0.146), (18, -0.064), (19, 0.068), (20, 0.012), (21, -0.048), (22, 0.049), (23, 0.034), (24, -0.083), (25, -0.014), (26, 0.012), (27, -0.029), (28, -0.1), (29, 0.005), (30, -0.075), (31, -0.057), (32, -0.025), (33, 0.079), (34, -0.011), (35, 0.068), (36, 0.159), (37, -0.058), (38, -0.008), (39, -0.012), (40, -0.01), (41, -0.021), (42, 0.028), (43, 0.009), (44, 0.104), (45, 0.065), (46, -0.02), (47, -0.022), (48, -0.054), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95032412 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

2 0.72132063 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory

Author: Harold Pashler, Nicholas Cepeda, Robert Lindsey, Ed Vul, Michael C. Mozer

Abstract: When individuals learn facts (e.g., foreign language vocabulary) over multiple study sessions, the temporal spacing of study has a signiﬁcant impact on memory retention. Behavioral experiments have shown a nonmonotonic relationship between spacing and retention: short or long intervals between study sessions yield lower cued-recall accuracy than intermediate intervals. Appropriate spacing of study can double retention on educationally relevant time scales. We introduce a Multiscale Context Model (MCM) that is able to predict the inﬂuence of a particular study schedule on retention for speciﬁc material. MCM’s prediction is based on empirical data characterizing forgetting of the material following a single study session. MCM is a synthesis of two existing memory models (Staddon, Chelaru, & Higa, 2002; Raaijmakers, 2003). On the surface, these models are unrelated and incompatible, but we show they share a core feature that allows them to be integrated. MCM can determine study schedules that maximize the durability of learning, and has implications for education and training. MCM can be cast either as a neural network with inputs that ﬂuctuate over time, or as a cascade of leaky integrators. MCM is intriguingly similar to a Bayesian multiscale model of memory (Kording, Tenenbaum, & Shadmehr, 2007), yet MCM is better able to account for human declarative memory. 1

3 0.70964509 115 nips-2009-Individuation, Identification and Object Discovery

Author: Charles Kemp, Alan Jern, Fei Xu

Abstract: Humans are typically able to infer how many objects their environment contains and to recognize when the same object is encountered twice. We present a simple statistical model that helps to explain these abilities and evaluate it in three behavioral experiments. Our ﬁrst experiment suggests that humans rely on prior knowledge when deciding whether an object token has been previously encountered. Our second and third experiments suggest that humans can infer how many objects they have seen and can learn about categories and their properties even when they are uncertain about which tokens are instances of the same object. From an early age, humans and other animals [1] appear to organize the ﬂux of experience into a series of encounters with discrete and persisting objects. Consider, for example, a young child who grows up in a home with two dogs. At a relatively early age the child will solve the problem of object discovery and will realize that her encounters with dogs correspond to views of two individuals rather than one or three. The child will also solve the problem of identiﬁcation, and will be able to reliably identify an individual (e.g. Fido) each time it is encountered. This paper presents a Bayesian approach that helps to explain both object discovery and identiﬁcation. Bayesian models are appealing in part because they help to explain how inferences are guided by prior knowledge. Imagine, for example, that you see some photographs taken by your friends Alice and Bob. The ﬁrst shot shows Alice sitting next to a large statue and eating a sandwich, and the second is similar but features Bob rather than Alice. The statues in each photograph look identical, and probably you will conclude that the two photographs are representations of the same statue. The sandwiches in the photographs also look identical, but probably you will conclude that the photographs show different sandwiches. The prior knowledge that contributes to these inferences appears rather complex, but we will explore some much simpler cases where prior knowledge guides identiﬁcation. A second advantage of Bayesian models is that they help to explain how learners cope with uncertainty. In some cases a learner may solve the problem of object discovery but should maintain uncertainty when faced with identiﬁcation problems. For example, I may be quite certain that I have met eight different individuals at a dinner party, even if I am unable to distinguish between two guests who are identical twins. In other cases a learner may need to reason about several related problems even if there is no deﬁnitive solution to any one of them. Consider, for example, a young child who must simultaneously discover which objects her world contains (e.g. Mother, Father, Fido, and Rex) and organize them into categories (e.g. people and dogs). Many accounts of categorization seem to implicitly assume that the problem of identiﬁcation must be solved before categorization can begin, but we will see that a probabilistic approach can address both problems simultaneously. Identiﬁcation and object discovery have been discussed by researchers from several disciplines, including psychology [2, 3, 4, 5, 6], machine learning [7, 8], statistics [9], and philosophy [10]. Many machine learning approaches can handle identity uncertainty, or uncertainty about whether two tokens correspond to the same object. Some approaches such such as BLOG [8] are able in addition to handle problems where the number of objects is not speciﬁed in advance. We propose 1 that some of these approaches can help to explain human learning, and this paper uses a simple BLOG-style approach [8] to account for human inferences. There are several existing psychological models of identiﬁcation, and the work of Shepard [11], Nosofsky [3] and colleagues is probably the most prominent. Models in this tradition usually focus on problems where the set of objects is speciﬁed in advance and where identity uncertainty arises as a result of perceptual noise. In contrast, we focus on problems where the number of objects must be inferred and where identity uncertainty arises from partial observability rather than noise. A separate psychological tradition focuses on problems where the number of objects is not ﬁxed in advance. Developmental psychologists, for example, have used displays where only one object token is visible at any time to explore whether young infants can infer how many different objects have been observed in total [4]. Our work emphasizes some of the same themes as this developmental research, but we go beyond previous work in this area by presenting and evaluating a computational approach to object identiﬁcation and discovery. The problem of deciding how many objects have been observed is sometimes called individuation [12] but here we treat individuation as a special case of object discovery. Note, however, that object discovery can also refer to cases where learners infer the existence of objects that have never been observed. Unobserved-object discovery has received relatively little attention in the psychological literature, but is addressed by statistical models including including species-sampling models [9] and capture-recapture models [13]. Simple statistical models of this kind will not address some of the most compelling examples of unobserved-object discovery, such as the discovery of the planet Neptune, or the ability to infer the existence of a hidden object by following another person’s gaze [14]. We will show, however, that a simple statistical approach helps to explain how humans infer the existence of objects that they have never seen. 1 A probabilistic account of object discovery and identiﬁcation Object discovery and identiﬁcation may depend on many kinds of observations and may be supported by many kinds of prior knowledge. This paper considers a very simple setting where these problems can be explored. Suppose that an agent is learning about a world that contains nw white balls and n − nw gray balls. Let f (oi ) indicate the color of ball oi , where each ball is white (f (oi ) = 1) or gray (f (oi ) = 0). An agent learns about the world by observing a sequence of object tokens. Suppose that label l(j) is a unique identiﬁer of token j—in other words, suppose that the jth token is a token of object ol(j) . Suppose also that the jth token is observed to have feature value g(j). Note the difference between f and g: f is a vector that speciﬁes the color of the n balls in the world, and g is a vector that speciﬁes the color of the object tokens observed thus far. We deﬁne a probability distribution over token sequences by assuming that a world is sampled from a prior P (n, nw ) and that tokens are sampled from this world. The full generative model is: P (n) ∝ 1 n 0 if n ≤ 1000 otherwise nw | n ∼ Uniform(0, n) l(j) | n ∼ Uniform(1, n) g(j) = f (ol(j) ) (1) (2) (3) (4) A prior often used for inferences about a population of unknown size is the scale-invariant Jeffreys 1 prior P (n) = n [15]. We follow this standard approach here but truncate at n = 1000. Choosing some upper bound is convenient when implementing the model, and has the advantage of producing a prior that is proper (note that the Jeffreys prior is improper). Equation 2 indicates that the number of white balls nw is sampled from a discrete uniform distribution. Equation 3 indicates that each token is generated by sampling one of the n balls in the world uniformly at random, and Equation 4 indicates that the color of each token is observed without noise. The generative assumptions just described can be used to deﬁne a probabilistic approach to object discovery and identiﬁcation. Suppose that the observations available to a learner consist of a fully-observed feature vector g and a partially-observed label vector lobs . Object discovery and identiﬁcation can be addressed by using the posterior distribution P (l|g, lobs ) to make inferences about the number of distinct objects observed and about the identity of each token. Computing the posterior distribution P (n|g, lobs ) allows the learner to make inferences about the total number of objects 2 in the world. In some cases, the learner may solve the problem of unobserved-object discovery by realizing that the world contains more objects than she has observed thus far. The next sections explore the idea that the inferences made by humans correspond approximately to the inferences of this ideal learner. Since the ideal learner allows for the possible existence of objects that have not yet been observed, we refer to our model as the open world model. Although we make no claim about the psychological mechanisms that might allow humans to approximate the predictions of the ideal learner, in practice we need some method for computing the predictions of our model. Since the domains we consider are relatively small, all results in this paper were computed by enumerating and summing over the complete set of possible worlds. 2 Experiment 1: Prior knowledge and identiﬁcation The introduction described a scenario (the statue and sandwiches example) where prior knowledge appears to guide identiﬁcation. Our ﬁrst experiment explores a very simple instance of this idea. We consider a setting where participants observe balls that are sampled with replacement from an urn. In one condition, participants sample the same ball from the urn on four consecutive occasions and are asked to predict whether the token observed on the ﬁfth draw is the same ball that they saw on the ﬁrst draw. In a second condition participants are asked exactly the same question about the ﬁfth token but sample four different balls on the ﬁrst four draws. We expect that these different patterns of data will shape the prior beliefs that participants bring to the identiﬁcation problem involving the ﬁfth token, and that participants in the ﬁrst condition will be substantially more likely to identify the ﬁfth token as a ball that they have seen before. Although we consider an abstract setting involving balls and urns the problem we explore has some real-world counterparts. Suppose, for example, that a colleague wears the same tie to four formal dinners. Based on this evidence you might be able to estimate the total number of ties that he owns, and might guess that he is less likely to wear a new tie to the next dinner than a colleague who wore different ties to the ﬁrst four dinners. Method. 12 adults participated for course credit. Participants interacted with a computer interface that displayed an urn, a robotic arm and a beam of UV light. The arm randomly sampled balls from the urn, and participants were told that each ball had a unique serial number that was visible only under UV light. After some balls were sampled, the robotic arm moved them under the UV light and revealed their serial numbers before returning them to the urn. Other balls were returned directly to the urn without having their serial numbers revealed. The serial numbers were alphanumeric strings such as “QXR182”—note that these serial numbers provide no information about the total number of objects, and that our setting is therefore different from the Jeffreys tramcar problem [15]. The experiment included ﬁve within-participant conditions shown in Figure 1. The observations for each condition can be summarized by a string that indicates the number of tokens and the serial numbers of some but perhaps not all tokens. The 1 1 1 1 1 condition in Figure 1a is a case where the same ball (without loss of generality, we call it ball 1) is drawn from the urn on ﬁve consecutive occasions. The 1 2 3 4 5 condition in Figure 1b is a case where ﬁve different balls are drawn from the urn. The 1 condition in Figure 1d is a case where ﬁve draws are made, but only the serial number of the ﬁrst ball is revealed. Within any of the ﬁve conditions, all of the balls had the same color (white or gray), but different colors were used across different conditions. For simplicity, all draws in Figure 1 are shown as white balls. On the second and all subsequent draws, participants were asked two questions about any token that was subsequently identiﬁed. They ﬁrst indicated whether the token was likely to be the same as the ball they observed on the ﬁrst draw (the ball labeled 1 in Figure 1). They then indicated whether the token was likely to be a ball that they had never seen before. Both responses were provided on a scale from 1 (very unlikely) to 7 (very likely). At the end of each condition, participants were asked to estimate the total number of balls in the urn. Twelve options were provided ranging from “exactly 1” to “exactly 12,” and a thirteenth option was labeled “more than 12.” Responses to each option were again provided on a seven point scale. Model predictions and results. The comparisons of primary interest involve the identiﬁcation questions in conditions 1a and 1b. In condition 1a the open world model infers that the total number of balls is probably low, and becomes increasingly conﬁdent that each new token is the same as the 3 a) b) 1 1 1 1 1 ?NEW = NEW 1 2 3 4 5 ? = (1) ?NEW = NEW BALL 1 BALL (1) NEW 5 5 3 3 3 3 1 1 1 1 Open world 7 5 0.66 DP mixture 7 5 0.66 PY mixture Human 7 ? = (1) BALL 1 1 1 0.66 0.66 0.33 0.33 0 0 7 13 0.66 9 0.33 5 0.33 5 0 1 0 1 1 # Balls 1 # Balls 0.66 1 1 ? (1)(?) 1 2 ? (1)(2)(?) (1)(2)(3)(?) 1 2 3 ? (1)(2)(3)(4)(?) 1 2 3 4 ? d) e) 5 5 3 3 3 1 1 1 13 13 13 9 9 9 5 5 5 1 1 1 # Balls # Balls 1 3 5 7 9 11 +12 7 5 1 3 5 7 9 11 +12 7 1 3 5 7 9 11 +12 7 Human 1 1 ? (1)(?) 1 2 ? (1)(2)(?) (1)(2)(3)(?) 1 2 3 ? (1)(2)(3)(4)(?) 1 2 3 4 ? 0 1 ? (1)(?) 1 1 ? (1)(1)(?) 1 1 1 ? (1)(1)(1)(?) (1)(1)(1)(1)(?) 1 1 1 1 ? 0.33 0 1 ? (1)(?) 1 1 ? (1)(1)(?) 1 1 1 ? (1)(1)(1)(?) (1)(1)(1)(1)(?) 1 1 1 1 ? 0.33 1 3 5 7 9 11 +12 1 9 1 3 5 7 9 11 +12 13 Open world c) 1 # Balls Figure 1: Model predictions and results for the ﬁve conditions in experiment 1. The left columns in (a) and (b) show inferences about the identiﬁcation questions. In each plot, the ﬁrst group of bars shows predictions about the probability that each new token is the same ball as the ﬁrst ball drawn from the urn. The second group of bars shows the probability that each new token is a ball that has never been seen before. The right columns in (a) and (b) and the plots in (c) through (e) show inferences about the total number of balls in each urn. All human responses are shown on the 1-7 scale used for the experiment. Model predictions are shown as probabilities (identiﬁcation questions) or ranks (population size questions). ﬁrst object observed. In condition 1b the model infers that the number of balls is probably high, and becomes increasingly conﬁdent that each new token is probably a new ball. The rightmost charts in Figures 1a and 1b show inferences about the total number of balls and conﬁrm that humans expect the number of balls to be low in condition 1a and high in condition 1b. Note that participants in condition 1b have solved the problem of unobserved-object discovery and inferred the existence of objects that they have never seen. The leftmost charts in 1a and 1b show responses to the identiﬁcation questions, and the ﬁnal bar in each group of four shows predictions about the ﬁfth token sampled. As predicted by the model, participants in 1a become increasingly conﬁdent that each new token is the same object as the ﬁrst token, but participants in 1b become increasingly conﬁdent that each new token is a new object. The increase in responses to the new ball questions in Figure 1b is replicated in conditions 2d and 2e of Experiment 2, and therefore appears to be reliable. 4 The third and fourth rows of Figures 1a and 1b show the predictions of two alternative models that are intuitively appealing but that fail to account for our results. The ﬁrst is the Dirichlet Process (DP) mixture model, which was proposed by Anderson [16] as an account of human categorization. Unlike most psychological models of categorization, the DP mixture model reserves some probability mass for outcomes that have not yet been observed. The model incorporates a prior distribution over partitions—in most applications of the model these partitions organize objects into categories, but Anderson suggests that the model can also be used to organize object tokens into classes that correspond to individual objects. The DP mixture model successfully predicts that the ball 1 questions will receive higher ratings in 1a than 1b, but predicts that responses to the new ball question will be identical across these two conditions. According to this model, the probability that a new token θ corresponds to a new object is m+θ where θ is a hyperparameter and m is the number of tokens observed thus far. Note that this probability is the same regardless of the identities of the m tokens previously observed. The Pitman Yor (PY) mixture model in the fourth row is a generalization of the DP mixture model that uses a prior over partitions deﬁned by two hyperparameters [17]. According to this model, the probability that a new token corresponds to a new object is θ+kα , where θ and α are hyperparameters m+θ and k is the number of distinct objects observed so far. The ﬂexibility offered by a second hyperparameter allows the model to predict a difference in responses to the new ball questions across the two conditions, but the model does not account for the increasing pattern observed in condition 1b. Most settings of θ and α predict that the responses to the new ball questions will decrease in condition 1b. A non-generic setting of these hyperparameters with θ = 0 can generate the ﬂat predictions in Figure 1, but no setting of the hyperparameters predicts the increase in the human responses. Although the PY and DP models both make predictions about the identiﬁcation questions, neither model can predict the total number of balls in the urn. Both models assume that the population of balls is countably inﬁnite, which does not seem appropriate for the tasks we consider. Figures 1c through 1d show results for three control conditions. Like condition 1a, 1c and 1d are cases where exactly one serial number is observed. Like conditions 1a and 1b, 1d and 1e are cases where exactly ﬁve tokens are observed. None of these control conditions produces results similar to conditions 1a and 1b, suggesting that methods which simply count the number of tokens or serial numbers will not account for our results. In each of the ﬁnal three conditions our model predicts that the posterior distribution on the number of balls n should decay as n increases. This prediction is not consistent with our data, since most participants assigned equal ratings to all 13 options, including “exactly 12 balls” and “more than 12 balls.” The ﬂat responses in Figures 1c through 1e appear to indicate a generic desire to express uncertainty, and suggest that our ideal learner model accounts for human responses only after several informative observations have been made. 3 Experiment 2: Object discovery and identity uncertainty Our second experiment focuses on object discovery rather than identiﬁcation. We consider cases where learners make inferences about the number of objects they have seen and the total number of objects in the urn even though there is substantial uncertainty about the identities of many of the tokens observed. Our probabilistic model predicts that observations of unidentiﬁed tokens can inﬂuence inferences about the total number of objects, and our second experiment tests this prediction. Method. 12 adults participated for course credit. The same participants took part in Experiments 1 and 2, and Experiment 2 was always completed after Experiment 1. Participants interacted with the same computer interface in both conditions, and the seven conditions in Experiment 2 are shown in Figure 2. Note that each condition now includes one or more gray tokens. In 2a, for example, there are four gray tokens and none of these tokens is identiﬁed. All tokens were sampled with replacement, and the condition labels in Figure 2 summarize the complete set of tokens presented in each condition. Within each condition the tokens were presented in a pseudo-random order—in 2a, for example, the gray and white tokens were interspersed with each other. Model predictions and results. The cases of most interest are the inferences about the total number of balls in conditions 2a and 2c. In both conditions participants observe exactly four white tokens and all four tokens are revealed to be the same ball. The gray tokens in each condition are never identiﬁed, but the number of these tokens varies across the conditions. Even though the identities 5 a) ?NEW = NEW 1 1 1 1 1 1 1 1 ? = (1) BALL 1 ?NEW = NEW 7 7 5 5 5 5 3 3 3 3 1 1 1 1 7 5 0.33 5 0 1 0 1 # Balls c) 1 2 3 4 ? = (1) BALL 1 ?NEW = NEW 5 3 3 3 3 1 1 1 1 1 13 1 13 0.66 9 0.66 9 0.33 5 0.33 5 0 1 0 1 e) ? = (1) BALL 1 ?NEW = NEW 1 1 3 5 7 9 11 +12 # Balls g) 1 3 3 3 1 1 1 13 1 13 1 13 0.66 9 9 9 0.33 5 5 5 0 1 1 1 # Balls # Balls 1 3 5 7 9 11 +12 5 3 1 3 5 7 9 11 +12 7 5 1 3 5 7 9 11 +12 7 5 [ ]x1 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x3 x3 1 2 3 ? (1)(2)(3)(?) 7 5 [ ]x1 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x3 x3 1 2 3 ? (1)(2)(3)(?) Human 7 Open world f) 1 2 3 4 7 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x1 x1 1 2 3 ? (1)(2)(3)(?) # Balls (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x1 x1 1 2 3 ? (1)(2)(3)(?) 5 1 3 5 7 9 11 +12 5 [ ]x3 (1)(?) x3 1 ? [ ]x6x6 1 1 ? (1)(1)(?) [ ]x9 x9 1 1 1 ? (1)(1)(1)(?) 7 5 [ ]x3 (1)(?) x3 1 ? [ ]x6x6 1 1 ? (1)(1)(?) [ ]x9 x9 1 1 1 ? (1)(1)(1)(?) 7 Human ?NEW = NEW Open world 7 ? = (1) BALL 1 # Balls d) 1 1 1 1 1 3 5 7 9 11 +12 9 0.33 [ ]x3 (1)(?) x3 1 ? 13 0.66 [ ]x3 (1)(?) x3 1 ? 1 9 1 3 5 7 9 11 +12 13 [ ]x2 (1)(?) x2 1 ? x3 1 1 ? [ ]x3 (1)(1)(?) [ ]x3x3 1 1 1 ? (1)(1)(1)(?) 1 0.66 [ ]x2 (1)(?) x2 1 ? [ ]x3 (1)(1)(?) x3 1 1 ? [ ]x3x3 1 1 1 ? (1)(1)(1)(?) Human 7 Open world b) 1 1 1 1 ? = (1) BALL 1 # Balls Figure 2: Model predictions and results for the seven conditions in Experiment 2. The left columns in (a) through (e) show inferences about the identiﬁcation questions, and the remaining plots show inferences about the total number of balls in each urn. of the gray tokens are never revealed, the open world model can use these observations to guide its inference about the total number of balls. In 2a, the proportions of white tokens and gray tokens are equal and there appears to be only one white ball, suggesting that the total number of balls is around two. In 2c grey tokens are now three times more common, suggesting that the total number of balls is larger than two. As predicted, the human responses in Figure 2 show that the peak of the distribution in 2a shifts to the right in 2c. Note, however, that the model does not accurately predict the precise location of the peak in 2c. Some of the remaining conditions in Figure 2 serve as controls for the comparison between 2a and 2c. Conditions 2a and 2c differ in the total number of tokens observed, but condition 2b shows that 6 this difference is not the critical factor. The number of tokens observed is the same across 2b and 2c, yet the inference in 2b is more similar to the inference in 2a than in 2c. Conditions 2a and 2c also differ in the proportion of white tokens observed, but conditions 2f and 2g show that this difference is not sufﬁcient to explain our results. The proportion of white tokens observed is the same across conditions 2a, 2f, and 2g, yet only 2a provides strong evidence that the total number of balls is low. The human inferences for 2f and 2g show the hint of an alternating pattern consistent with the inference that the total number of balls in the urn is even. Only 2 out of 12 participants generated this pattern, however, and the majority of responses are near uniform. Finally, conditions 2d and 2e replicate our ﬁnding from Experiment 1 that the identity labels play an important role. The only difference between 2a and 2e is that the four labels are distinct in the latter case, and this single difference produces a predictable divergence in human inferences about the total number of balls. 4 Experiment 3: Categorization and identity uncertainty Experiment 2 suggested that people make robust inferences about the existence and number of unobserved objects in the presence of identity uncertainty. Our ﬁnal experiment explores categorization in the presence of identity uncertainty. We consider an extreme case where participants make inferences about the variability of a category even though the tokens of that category have never been identiﬁed. Method. The experiment included two between subject conditions, and 20 adults were recruited for each condition. Participants were asked to reason about a category including eggs of a given species, where eggs in the same category might vary in size. The interface used in Experiments 1 and 2 was adapted so that the urn now contained two kinds of objects: notepads and eggs. Participants were told that each notepad had a unique color and a unique label written on the front. The UV light played no role in the experiment and was removed from the interface: notepads could be identiﬁed by visual inspection, and identifying labels for the eggs were never shown. In both conditions participants observed a sequence of 16 tokens sampled from the urn. Half of the tokens were notepads and the others were eggs, and all egg tokens were identical in size. Whenever an egg was sampled, participants were told that this egg was a Kwiba egg. At the end of the condition, participants were shown a set of 11 eggs that varied in size and asked to rate the probability that each one was a Kwiba egg. Participants then made inferences about the total number of eggs and the total number of notepads in the urn. The two conditions were intended to lead to different inferences about the total number of eggs in the urn. In the 4 egg condition, all items (notepad and eggs) were sampled with replacement. The 8 notepad tokens included two tokens of each of 4 notepads, suggesting that the total number of notepads was 4. Since the proportion of egg tokens and notepad tokens was equal, we expected participants to infer that the total number of eggs was roughly four. In the 1 egg condition, four notepads were observed in total, but the ﬁrst three were sampled without replacement and never returned to the urn. The ﬁnal notepad and the egg tokens were always sampled with replacement. After the ﬁrst three notepads had been removed from the urn, the remaining notepad was sampled about half of the time. We therefore expected participants to infer that the urn probably contained a single notepad and a single egg by the end of the experiment, and that all of the eggs they had observed were tokens of a single object. Model. We can simultaneously address identiﬁcation and categorization by combining the open world model with a Gaussian model of categorization. Suppose that the members of a given category (e.g. Kwiba eggs) vary along a single continuous dimension (e.g. size). We assume that the egg sizes are distributed according to a Gaussian with known mean and unknown variance σ 2 . For convenience, we assume that the mean is zero (i.e. we measure size with respect to the average) and β use the standard inverse-gamma prior on the variance: p(σ 2 ) ∝ (σ 2 )−(α+1) e− σ2 . Since we are interested only in qualitative predictions of the model, the precise values of the hyperparameters are not very important. To generate the results shown in Figure 3 we set α = 0.5 and β = 2. Before observing any eggs, the marginal distribution on sizes is p(x) = p(x|σ 2 )p(σ 2 )dσ 2 . Suppose now that we observe m random samples from the category and that each one has size zero. If m is large then these observations provide strong evidence that the variance σ 2 is small, and the posterior distribution p(x|m) will be tightly peaked around zero. If m, is small, however, then the posterior distribution will be broader. 7 2 − Category pdf (1 egg) 1 2 1 0 0 7 7 5 5 3 3 1 1 = p4 (x) − p1 (x) Category pdf (4 eggs) p1 (x) p4 (x) a) Model differences 0.1 0 −0.1 −2 0 2 x (size) Human differences 12 8 10 6 4 0.4 0.2 0 −0.2 −0.4 2 12 8 10 6 4 2 −2 0 2 x (size) −2 0 2 x (size) b) Number of eggs (4 eggs) Number of eggs (1 egg) c) −4 −2 0 2 4 (size) Figure 3: (a) Model predictions for Experiment 3. The ﬁrst two panels show the size distributions inferred for the two conditions, and the ﬁnal panel shows the difference of these distributions. The difference curve for the model rises to a peak of around 1.6 but has been truncated at 0.1. (b) Human inferences about the total number of eggs in the urn. As predicted, participants in the 4 egg condition believe that the urn contains more eggs. (c) The difference of the size distributions generated by participants in each condition. The central peak is absent but otherwise the curve is qualitatively similar to the model prediction. The categorization model described so far is entirely standard, but note that our experiment considers a case where T , the observed stream of object tokens, is not sufﬁcient to determine m, the number of distinct objects observed. We therefore use the open world model to generate a posterior distribution over m, and compute a marginal distribution over size by integrating out both m and σ 2 : p(x|T ) = p(x|σ 2 )p(σ 2 |m)p(m|T )dσ 2 dm. Figure 3a shows predictions of this “open world + Gaussian” model for the two conditions in our experiment. Note that the difference between the curves for the two conditions has the characteristic Mexican-hat shape produced by a difference of Gaussians. Results. Inferences about the total number of eggs suggested that our manipulation succeeded. Figure 3b indicates that participants in the 4 egg condition believed that they had seen more eggs than participants in the 1 egg condition. Participants in both conditions generated a size distribution for the category of Kwiba eggs, and the difference of these distributions is shown in Figure 3c. Although the magnitude of the differences is small, the shape of the difference curve is consistent with the model predictions. The x = 0 bar is the only case that diverges from the expected Mexican hat shape, and this result is probably due to a ceiling effect—80% of participants in both conditions chose the maximum possible rating for the egg with mean size (size zero), leaving little opportunity for a difference between conditions to emerge. To support the qualitative result in Figure 3c we computed the variance of the curve generated by each individual participant and tested the hypothesis that the variances were greater in the 1 egg condition than in the 4 egg condition. A Mann-Whitney test indicated that this difference was marginally signiﬁcant (p < 0.1, one-sided). 5 Conclusion Parsing the world into stable and recurring objects is arguably our most basic cognitive achievement [2, 10]. This paper described a simple model of object discovery and identiﬁcation and evaluated it in three behavioral experiments. Our ﬁrst experiment conﬁrmed that people rely on prior knowledge when solving identiﬁcation problems. Our second and third experiments explored problems where the identities of many object tokens were never revealed. Despite the resulting uncertainty, we found that participants in these experiments were able to track the number of objects they had seen, to infer the existence of unobserved objects, and to learn and reason about categories. Although the tasks in our experiments were all relatively simple, future work can apply our approach to more realistic settings. For example, a straightforward extension of our model can handle problems where objects vary along multiple perceptual dimensions and where observations are corrupted by perceptual noise. Discovery and identiﬁcation problems may take several different forms, but probabilistic inference can help to explain how all of these problems are solved. Acknowledgments We thank Bobby Han, Faye Han and Maureen Satyshur for running the experiments. 8 References [1] E. A. Tibbetts and J. Dale. Individual recognition: it is good to be different. Trends in Ecology and Evolution, 22(10):529–237, 2007. [2] W. James. Principles of psychology. Holt, New York, 1890. [3] R. M. Nosofsky. Attention, similarity, and the identiﬁcation-categorization relationship. Journal of Experimental Psychology: General, 115:39–57, 1986. [4] F. Xu and S. Carey. Infants’ metaphysics: the case of numerical identity. Cognitive Psychology, 30:111–153, 1996. [5] L. W. Barsalou, J. Huttenlocher, and K. Lamberts. Basing categorization on individuals and events. Cognitive Psychology, 36:203–272, 1998. [6] L. J. Rips, S. Blok, and G. Newman. Tracing the identity of objects. Psychological Review, 113(1):1–30, 2006. [7] A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 905–912. MIT Press, Cambridge, MA, 2005. [8] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence, pages 1352–1359, 2005. [9] J. Bunge and M. Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993. [10] R. G. Millikan. On clear and confused ideas: an essay about substance concepts. Cambridge University Press, New York, 2000. [11] R. N. Shepard. Stimulus and response generalization: a stochastic model relating generalization to distance in psychological space. Psychometrika, 22:325–345, 1957. [12] A. M. Leslie, F. Xu, P. D. Tremoulet, and B. J. Scholl. Indexing and the object concept: developing ‘what’ and ‘where’ systems. Trends in Cognitive Science, 2(1):10–18, 1998. [13] J. D. Nichols. Capture-recapture models. Bioscience, 42(2):94–102, 1992. [14] G. Csibra and A. Volein. Infants can infer the presence of hidden objects from referential gaze information. British Journal of Developmental Psychology, 26:1–11, 2008. [15] H. Jeffreys. Theory of Probability. Oxford University Press, Oxford, 1961. [16] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3): 409–429, 1991. [17] J. Pitman. Combinatorial stochastic processes, 2002. Notes for Saint Flour Summer School. 9

4 0.6609962 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

Author: Adam Sanborn, Nick Chater, Katherine A. Heller

5 0.63344252 25 nips-2009-Adaptive Design Optimization in Experiments with People

Author: Daniel Cavagnaro, Jay Myung, Mark A. Pitt

Abstract: In cognitive science, empirical data collected from participants are the arbiters in model selection. Model discrimination thus depends on designing maximally informative experiments. It has been shown that adaptive design optimization (ADO) allows one to discriminate models as efﬁciently as possible in simulation experiments. In this paper we use ADO in a series of experiments with people to discriminate the Power, Exponential, and Hyperbolic models of memory retention, which has been a long-standing problem in cognitive science, providing an ideal setting in which to test the application of ADO for addressing questions about human cognition. Using an optimality criterion based on mutual information, ADO is able to ﬁnd designs that are maximally likely to increase our certainty about the true model upon observation of the experiment outcomes. Results demonstrate the usefulness of ADO and also reveal some challenges in its implementation. 1

6 0.63250828 196 nips-2009-Quantification and the language of thought

7 0.59928191 21 nips-2009-Abstraction and Relational learning

8 0.59798932 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

9 0.596892 152 nips-2009-Measuring model complexity with the prior predictive

10 0.54129374 260 nips-2009-Zero-shot Learning with Semantic Output Codes

11 0.53191704 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

12 0.50272042 39 nips-2009-Bayesian Belief Polarization

13 0.48488143 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

14 0.47294724 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

15 0.47095212 69 nips-2009-Discrete MDL Predicts in Total Variation

16 0.46005037 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

17 0.44869384 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

18 0.43260452 71 nips-2009-Distribution-Calibrated Hierarchical Classification

19 0.4264425 154 nips-2009-Modeling the spacing effect in sequential category learning

20 0.41066152 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(18, 0.221), (24, 0.059), (25, 0.096), (35, 0.046), (36, 0.081), (39, 0.107), (58, 0.061), (61, 0.023), (71, 0.078), (81, 0.016), (86, 0.091), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83262801 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

2 0.81414574 204 nips-2009-Replicated Softmax: an Undirected Topic Model

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efﬁcient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.

3 0.70504105 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

4 0.68267632 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by ﬁnding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

5 0.67964053 154 nips-2009-Modeling the spacing effect in sequential category learning

Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille

Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We deﬁne conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efﬁciently perform model selection to decide whether observations are from one or multiple categories.

6 0.67905682 133 nips-2009-Learning models of object structure

7 0.67497313 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

8 0.67382884 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

9 0.67032218 211 nips-2009-Segmenting Scenes by Matching Image Composites

10 0.66989046 226 nips-2009-Spatial Normalized Gamma Processes

11 0.66792041 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

12 0.66705036 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

13 0.66246301 196 nips-2009-Quantification and the language of thought

14 0.66190785 115 nips-2009-Individuation, Identification and Object Discovery

15 0.66115373 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

16 0.659531 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

17 0.65937102 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

18 0.6589458 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

19 0.65875292 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

20 0.65872651 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition