nips nips2009 nips2009-109 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Adam Sanborn, Nick Chater, Katherine A. Heller
Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. [sent-10, score-0.573]
2 People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. [sent-12, score-0.579]
3 We explore the idea that the dimensions used by people echo the natural variation in the environment. [sent-14, score-0.406]
4 Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. [sent-15, score-0.604]
5 The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. [sent-17, score-0.423]
6 Specifically, people generalize along the dimensions, such as size, color, or shape – dimensions that are termed separable. [sent-24, score-0.57]
7 In contrast, dimensions such as hue and saturation, which show isotropic generalization are termed integral [3] . [sent-25, score-0.482]
8 An illustration of the importance of separable dimensions is found in the time to learn categories. [sent-26, score-0.456]
9 If dimensions did not play a strong role in generalization, then rotating a category structure in a parameter space of separable dimensions should not influence how easily it can be learned. [sent-27, score-0.987]
10 Similarity rating results also show strong trends of judging objects to be more similar if they match along separable dimensions [3, 5]. [sent-29, score-0.655]
11 The tendency to generalize categories along separable dimensions is learned over development. [sent-30, score-0.765]
12 On dimensions such as size and color, children produce generalizations that are more isotropic than adults [6]. [sent-31, score-0.668]
13 Interestingly, the developmental transition between isotropic and dimensionally biased generalizations is gradual [7]. [sent-32, score-0.382]
14 [1] identified shape as a key constant in categories, and we can find categories that are constant along other separable dimensions as well. [sent-37, score-0.688]
15 Models of categorization are able to account for both the isotropic and dimension-based components of generalization. [sent-40, score-0.487]
16 Classic models of categorization, such as the exemplar and prototype model, account for these using different mechanisms [8, 9, 10]. [sent-41, score-0.537]
17 Rational models of categorization have accounted for dimensional biases by assuming that the shapes of categories are aligned with the axes that people use for generalization [11, 12, 13]. [sent-42, score-1.244]
18 Neither the classic models nor rational models have investigated how people learn to use the particular dimension basis that they do. [sent-43, score-0.595]
19 This paper presents a model that learns the dimensional basis that people use for generalization. [sent-44, score-0.444]
20 We connect these biases with a hypothesis about the structure of categories in the environment and demonstrate how exposure to these categories during development results in human dimensional biases. [sent-45, score-0.864]
21 In the next section, we review models of categorization and how they have accounted for dimensional biases. [sent-46, score-0.579]
22 Next, we review current nonparametric Bayesian models of categorization, which all require that the dimensions be hand-coded. [sent-47, score-0.376]
23 Next, we introduce a new prior for categorization models that starts without pre-specified dimensions and learns to generalize new categories in the same way that previous categories varied. [sent-48, score-1.195]
24 We demonstrate that training the model on reasonable category structures produces generalization behavior that mimics that of human subjects at various ages. [sent-50, score-0.384]
25 2 Modeling Dimensional Biases in Categorization Models of categorization can be divided into generative and discriminative models – we will focus on generative models here and leave discriminative models for the discussion. [sent-52, score-0.497]
26 Generative models of categorization, such as the prototype [8] and exemplar models [9, 10], assume that people learn category distributions, not just rules for discriminating between categories. [sent-53, score-0.93]
27 The remaining items are collected in the vector xn 1 and the known labels for these items are cn 1 . [sent-55, score-0.394]
28 For the prototype and exemplar models, the likelihood of an item belonging to a category is based on the weighted Minkowski power metric1 , 1 X⇣X r⌘ r (d) w(d) x(d) Ri (2) n i 1 (d) For an exemplar model, Ri average of xn 1 . [sent-56, score-1.115]
29 d is each example in xn 2 1, while for the prototype model, it is the single which computes the absolute value of the power metric between the new example xn and the category representation Ri for category i on a dimension d. [sent-57, score-0.877]
30 Integral dimensions are modeled with r = 2, which results in a Euclidean distance metric. [sent-58, score-0.313]
31 The Euclidean metric has the special property that changing the basis set for the dimensions of the space does not affect the distances. [sent-59, score-0.404]
32 Separable dimensions are modeled with either r = 1, the city-block metric, or r < 1, which no longer obeys the triangle equality [5]. [sent-61, score-0.361]
33 Dimensional biases are also modeled in categorization by modifying the dimension weights for each dimension, w(d) . [sent-62, score-0.77]
34 In effect, the weights stretch or shrink the space of stimuli along each dimension so that some items are closer than others. [sent-63, score-0.44]
35 These generative models of categorization have been developed to account for the different types of dimensional biases that are displayed by people, but they lack means for learning the dimensions themselves. [sent-66, score-1.072]
36 If the chosen basis set did not match that used by people, then the models would be very poor descriptions of human dimensional biases. [sent-68, score-0.379]
37 3 Rational Models of Categorization Rational models of categorization view categorization behavior as the solution to a problem posed by the environment: how best to to generalize properties from one object to another. [sent-70, score-0.796]
38 Both exemplar and prototype models can be viewed as restricted versions of rational models of categorization, which also allow interpolations between these two extreme views of representation. [sent-71, score-0.696]
39 Anderson [11] proposed a rational model of categorization which modeled the stimuli in a task as a mixture of clusters. [sent-72, score-0.734]
40 The model was extended to supervised learning so each category is a mixture [17], P (x` |x` 1 , s` 1) = K X P (s` = k|s` k=1 1 )P (x` |s` = k, x` 1 , s` 1 ) (3) where x` is the newest example in a category i and x` 1 are the other members of category i. [sent-74, score-0.72]
41 The likelihood distribution for this model assumes a fixed basis set of dimensions, which must align with the separable dimensions to produce dimensional biases in generalization. [sent-89, score-0.955]
42 ⌃ The inverse-Wishart distribution has its mode at m+D+1 , where ⌃ is the mean covariance matrix parameter, m is the degrees of freedom, and D is the number of dimensions of the stimulus. [sent-93, score-0.42]
43 This new basis set gives the possible dimensional biases for this cluster. [sent-95, score-0.503]
44 However, using Gaussian distributions for each cluster, with a unimodal prior on the covariance matrix, greatly limits the patterns of generalizations that can be produced. [sent-96, score-0.317]
45 For a diagonal covariance matrix, strong generalization along a particular dimension would be produced if the covariance matrix has a high variance along that dimension, but low variances along the remaining dimensions. [sent-97, score-0.718]
46 Thus, this model can learn to strongly generalize along one dimension, but people often make strong generalizations along multiple dimensions [5], such as in Equation 2 when r < 1. [sent-98, score-0.934]
47 A unimodal prior on covariance matrices cannot produce this behavior, so we use a mixture of inverse Wishart distributions as a prior for covariance matrices, p(⌃k |uk , ) = J X p(uk = j|uk j=1 1 )p(⌃k | j , uk = j) (6) where ⌃k is the covariance parameter for the kth component. [sent-99, score-0.591]
48 We now have two infinite mixtures: one that allows a category to be composed of a mixture of clusters, and one that allows the prior for the covariance matrices to be composed of a mixture of inverse-Wishart distributions. [sent-109, score-0.508]
49 5 Learning the Prior The categories we learn during development often vary along separable dimensions – and people are sensitive to this variability. [sent-113, score-0.946]
50 The linguistic classification of nouns helps to identify categories that are fixed on one separable dimension and variable on others. [sent-114, score-0.525]
51 These two types of nouns show an interesting regularity: count nouns are often relatively similar in size but vary greatly in color, while mass nouns are often relatively fixed in color but vary greatly in size. [sent-118, score-0.521]
52 The discriminability judgments of adults to scale the parameters of the stimuli, so that one step in color caused the same gain in discriminability as one step in size. [sent-121, score-0.358]
53 Participants were asked to group the stimuli into clusters according 4 Figure 1: Schematic illustration of the hierarchical prior over covariance matrices. [sent-122, score-0.41]
54 The plot on the right shows some schematic examples of natural categories that tend to vary along either color or size. [sent-126, score-0.426]
55 The partitions of the stimuli into clusters that participants produced tended toward three informative patterns, shown in Figure 2. [sent-129, score-0.318]
56 The One-dimensional Similarity pattern is more biased towards generalizing along separable dimensions than the Overall Similarity pattern. [sent-131, score-0.541]
57 The strongest dimensional biases are shown by the One-Dimensional Identity pattern, with the dimensional match overriding the close isotropic similarity between neighboring stimuli. [sent-132, score-0.859]
58 The percentage of One-dimensional Identity clusterings increased with age, and was the dominant response for adults, supporting the idea that strong dimensional biases are learned. [sent-138, score-0.466]
59 We trained our model with clusters that were aligned with the dimensions of size and color. [sent-139, score-0.48]
60 Half of the clusters varied strongly in shape and weakly in size, while the other half varied strongly in size and weakly in shape. [sent-140, score-0.32]
61 The two dispersion parameters were set to 1, the degrees of freedom for all inverse-Wishart distributions were set to the number of dimensions plus 1, and 0. [sent-142, score-0.351]
62 In a free categorization task, the stimuli marked by dots in the top row were grouped by participants. [sent-148, score-0.447]
63 6 Generalization Gradients Standard models of categorization, such as the prototype or exemplar model, have a variety of mechanisms for producing the dimensional biases seen in experiments with adults. [sent-162, score-0.95]
64 In standard models of categorization, similarity ratings are modeled mainly by the exponent in the Minkowski power metric (Equation 2). [sent-169, score-0.386]
65 For rational models, similarity ratings can be modeled as the posterior predictive probability of one item, given the second item [20]. [sent-170, score-0.505]
66 The first two columns of Figure 3 give a comparison between the exemplar model and the model we propose for similarity ratings. [sent-171, score-0.468]
67 For integral dimensions, a Euclidean metric (r = 2) is used in the exemplar model, which the model we propose matches if it has not been trained on dimension-aligned categories. [sent-173, score-0.378]
68 For separable categories, the exemplar model usually uses a city-block metric (r = 1) [10]. [sent-174, score-0.476]
69 In experiments to test violations of the triangle equality, Tversky and Gati [5] showed that the best fitting exponent for similarity data is often r < 1. [sent-176, score-0.327]
70 The model we propose can produce this type of similarity prediction by using a prior that is a mixture of covariance matrices, in which each component of the mixture generalizes strongly along one dimension. [sent-177, score-0.66]
71 In a category of one item, which is the case when making similarity judgments with the posterior predictive distribution, it is uncertain which covariance component best describes the category. [sent-178, score-0.62]
72 As a result, our proposed model predicts violations of the triangle inequality if it has been trained on a set of clusters in which some vary strongly along one dimension and others vary strongly along another dimension. [sent-180, score-0.741]
73 A comparison between this generalization gradient and the exemplar model is shown in the second column of Figure 3. [sent-181, score-0.381]
74 The second mechanism for dimensional biases in standard models of categorization is selective attention. [sent-182, score-0.856]
75 Selective attention is used to describe biases that occur in categorization experiments, when many items are trained in each category. [sent-183, score-0.798]
76 These biases are implemented in the exemplar model as weights along each dimension, and early in learning there are usually large weights on a small number of separable dimensions [14, 21]. [sent-184, score-1.159]
77 Our proposed model does not have a mechanism for selective attention, but provides a rational explanation for this effect in terms of the strong sampling assumption [13]. [sent-185, score-0.32]
78 If two items are assumed to come from the same cluster, then generalization tends to be along a single dimension that has varied during training (third column of Figure 3). [sent-186, score-0.406]
79 However, if two items are inferred to belong to different clusters, then the generalization gradient corresponds to additive similarity without selective attention (fourth column of Figure 3). [sent-187, score-0.443]
80 We have shown that the model we have proposed can reproduce the key generalization gradients of the exemplar and prototype models. [sent-188, score-0.606]
81 The important difference between our model of dimensional 7 biases and these standard categorization models is that we learn basis set for dimensional biases, assuming these dimensions have proven to be useful for predicting category structure in the past. [sent-189, score-1.583]
82 To show that our model is not biased towards a particular basis set, we rotated the training stimuli 45 degrees in space. [sent-191, score-0.355]
83 7 Discussion The approach to dimensional biases we have outlined in this paper provides a single explanation for dimensional biases, in contrast to standard models of categorization, such as exemplar and prototype models. [sent-193, score-1.086]
84 These standard models of categorization assume two distinct mechanisms for producing dimensional biases: a Minkowski metric exponent, and attentional weights for each dimension. [sent-194, score-0.695]
85 In our approach, biases in both similarity judgments and categorization experiments are produced by learning covariance matrices that are shared between clusters. [sent-195, score-0.971]
86 For similarity judgments, the single item does not give information about which covariance mixture component was used to generate it. [sent-196, score-0.51]
87 This uncertainty produces similarity judgments that would be best fit with an Minkowski exponent of r < 1. [sent-197, score-0.316]
88 For category judgments, the alignment of the items along a dimension allows the generating covariance mixture component to be inferred, so the judgments will show a bias like that of attentional weights to the dimensions. [sent-198, score-0.902]
89 The difference between tasks drives the different types of dimensional biases in our approach. [sent-199, score-0.413]
90 Attention to dimensions is learned in connectionist models of categorization by finding the best single set of weights for each dimension in a basis set [15, 16], or by cross-category learning in a Bayesian approach [22]. [sent-201, score-0.909]
91 The only previous approach we are aware of that learns such complex cross-category information is a Bayesian rule-based model of categorization [25]. [sent-206, score-0.409]
92 The main advantage of our approach over many other models of categorization is that we learn the basis set of dimensions that can display dimensional biases. [sent-207, score-0.93]
93 Our model learns the basis the same way people do, from categories in the environment (as opposed to fitting to human similarity or category judgements). [sent-208, score-0.887]
94 Using a version of the Transformed Dirichlet Process [26], a close relation to the Hierarchical Dirichlet Process previously proposed as a unifying model of categorization [17], a mixture of covariance matrices are learned from environmentally plausible training data. [sent-210, score-0.55]
95 Most other models of categorization, including exemplar models [10], prototype models [8], rule-based discriminative models [27], as well as hierarchical Bayesian models for learning features [24, 22] and Bayesian rule-based models [25] all must have a pre-specified basis set. [sent-211, score-0.899]
96 8 Summary and Conclusions People generalize categories in two ways: they generalize to stimuli with parameters near to the category and generalize to stimuli that match along separable dimensions. [sent-212, score-1.123]
97 Existing models of categorization must assume the dimensions to produce human-like generalization performance. [sent-213, score-0.752]
98 Our model learns these dimensions from the data: starting with an unbiased prior, the dimensions that categories vary along are learned to be dimensions important for generalization. [sent-214, score-1.215]
99 After training the model with categories intended to mirror those learned during development, our model reproduces the trajectory of generalization biases as children grow into adults. [sent-215, score-0.696]
100 Unifying rational models of categorization via the hierarchical dirichlet process. [sent-323, score-0.603]
wordName wordTfidf (topN-words)
[('categorization', 0.332), ('dimensions', 0.272), ('biases', 0.27), ('exemplar', 0.254), ('category', 0.206), ('prototype', 0.188), ('categories', 0.183), ('similarity', 0.146), ('separable', 0.146), ('rational', 0.144), ('dimensional', 0.143), ('item', 0.135), ('people', 0.134), ('developmental', 0.123), ('nouns', 0.117), ('isotropic', 0.117), ('covariance', 0.116), ('stimuli', 0.115), ('items', 0.111), ('judgments', 0.107), ('generalizations', 0.106), ('cn', 0.094), ('minkowski', 0.094), ('generalization', 0.093), ('clusters', 0.092), ('adults', 0.091), ('basis', 0.09), ('psychological', 0.089), ('along', 0.087), ('children', 0.082), ('dimension', 0.079), ('xn', 0.078), ('generalize', 0.077), ('uk', 0.075), ('participants', 0.073), ('smith', 0.072), ('cogsci', 0.07), ('violations', 0.07), ('mixture', 0.068), ('color', 0.066), ('exponent', 0.063), ('associative', 0.061), ('objects', 0.057), ('selective', 0.056), ('models', 0.055), ('strong', 0.053), ('vary', 0.052), ('human', 0.051), ('psychology', 0.051), ('prior', 0.05), ('assignments', 0.05), ('review', 0.049), ('weights', 0.048), ('trained', 0.048), ('triangle', 0.048), ('rotated', 0.048), ('discriminability', 0.047), ('dispersion', 0.047), ('crp', 0.046), ('strongly', 0.046), ('patterns', 0.045), ('component', 0.045), ('learns', 0.043), ('age', 0.043), ('metric', 0.042), ('medin', 0.041), ('rosch', 0.041), ('perfors', 0.041), ('tversky', 0.041), ('cognitive', 0.041), ('modeled', 0.041), ('london', 0.04), ('match', 0.04), ('mechanisms', 0.04), ('ratings', 0.039), ('components', 0.038), ('partitions', 0.038), ('expectations', 0.038), ('schematic', 0.038), ('rotating', 0.038), ('sanborn', 0.038), ('learn', 0.038), ('hierarchical', 0.037), ('gradients', 0.037), ('attention', 0.037), ('biased', 0.036), ('varied', 0.036), ('cluster', 0.036), ('grif', 0.035), ('dirichlet', 0.035), ('attentional', 0.035), ('nosofsky', 0.035), ('development', 0.034), ('aligned', 0.034), ('model', 0.034), ('connectionist', 0.033), ('heller', 0.033), ('explanation', 0.033), ('degrees', 0.032), ('weakly', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
Author: Adam Sanborn, Nick Chater, Katherine A. Heller
Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1
2 0.19031973 21 nips-2009-Abstraction and Relational learning
Author: Charles Kemp, Alan Jern
Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7 ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff i ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔ ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1 ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1 ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9
3 0.16459911 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
Author: Tomasz Malisiewicz, Alyosha Efros
Abstract: The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the object’s relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearancebased model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralba’s proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems. 1
4 0.16398905 154 nips-2009-Modeling the spacing effect in sequential category learning
Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille
Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently perform model selection to decide whether observations are from one or multiple categories.
5 0.12766229 133 nips-2009-Learning models of object structure
Author: Joseph Schlecht, Kobus Barnard
Abstract: We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g. chairs). Fitting learned topologies to an image can be used to identify the object class, as well as detail its geometry. The latter goes beyond labeling objects, as it provides the geometric structure of particular instances. We learn the models using joint statistical inference over category parameters, camera parameters, and instance parameters. These produce an image likelihood through a statistical imaging model. We use trans-dimensional sampling to explore topology hypotheses, and alternate between Metropolis-Hastings and stochastic dynamics to explore instance parameters. Experiments on images of furniture objects such as tables and chairs suggest that this is an effective approach for learning models that encode simple representations of category geometry and the statistics thereof, and support inferring both category and geometry on held out single view images. 1
6 0.11682653 115 nips-2009-Individuation, Identification and Object Discovery
7 0.10505503 112 nips-2009-Human Rademacher Complexity
8 0.09010122 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
9 0.088305965 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
10 0.085693114 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process
11 0.082209945 196 nips-2009-Quantification and the language of thought
12 0.081416465 152 nips-2009-Measuring model complexity with the prior predictive
13 0.081065968 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
14 0.080411032 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
15 0.079186633 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
16 0.077943295 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model
17 0.070962049 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
18 0.070641778 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm
19 0.067107677 43 nips-2009-Bayesian estimation of orientation preference maps
20 0.066572368 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons
topicId topicWeight
[(0, -0.218), (1, -0.124), (2, -0.032), (3, -0.073), (4, 0.057), (5, 0.013), (6, 0.032), (7, 0.031), (8, -0.042), (9, -0.037), (10, 0.103), (11, -0.117), (12, 0.127), (13, -0.224), (14, 0.116), (15, 0.003), (16, 0.047), (17, 0.19), (18, -0.148), (19, 0.097), (20, -0.097), (21, -0.084), (22, 0.065), (23, 0.102), (24, -0.093), (25, -0.107), (26, -0.023), (27, -0.031), (28, -0.018), (29, 0.032), (30, 0.048), (31, 0.047), (32, 0.027), (33, 0.07), (34, -0.124), (35, 0.045), (36, 0.039), (37, 0.031), (38, -0.012), (39, -0.04), (40, -0.034), (41, -0.082), (42, -0.038), (43, -0.003), (44, -0.005), (45, 0.045), (46, -0.007), (47, 0.006), (48, 0.064), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.97411448 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
Author: Adam Sanborn, Nick Chater, Katherine A. Heller
Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1
2 0.82273716 115 nips-2009-Individuation, Identification and Object Discovery
Author: Charles Kemp, Alan Jern, Fei Xu
Abstract: Humans are typically able to infer how many objects their environment contains and to recognize when the same object is encountered twice. We present a simple statistical model that helps to explain these abilities and evaluate it in three behavioral experiments. Our first experiment suggests that humans rely on prior knowledge when deciding whether an object token has been previously encountered. Our second and third experiments suggest that humans can infer how many objects they have seen and can learn about categories and their properties even when they are uncertain about which tokens are instances of the same object. From an early age, humans and other animals [1] appear to organize the flux of experience into a series of encounters with discrete and persisting objects. Consider, for example, a young child who grows up in a home with two dogs. At a relatively early age the child will solve the problem of object discovery and will realize that her encounters with dogs correspond to views of two individuals rather than one or three. The child will also solve the problem of identification, and will be able to reliably identify an individual (e.g. Fido) each time it is encountered. This paper presents a Bayesian approach that helps to explain both object discovery and identification. Bayesian models are appealing in part because they help to explain how inferences are guided by prior knowledge. Imagine, for example, that you see some photographs taken by your friends Alice and Bob. The first shot shows Alice sitting next to a large statue and eating a sandwich, and the second is similar but features Bob rather than Alice. The statues in each photograph look identical, and probably you will conclude that the two photographs are representations of the same statue. The sandwiches in the photographs also look identical, but probably you will conclude that the photographs show different sandwiches. The prior knowledge that contributes to these inferences appears rather complex, but we will explore some much simpler cases where prior knowledge guides identification. A second advantage of Bayesian models is that they help to explain how learners cope with uncertainty. In some cases a learner may solve the problem of object discovery but should maintain uncertainty when faced with identification problems. For example, I may be quite certain that I have met eight different individuals at a dinner party, even if I am unable to distinguish between two guests who are identical twins. In other cases a learner may need to reason about several related problems even if there is no definitive solution to any one of them. Consider, for example, a young child who must simultaneously discover which objects her world contains (e.g. Mother, Father, Fido, and Rex) and organize them into categories (e.g. people and dogs). Many accounts of categorization seem to implicitly assume that the problem of identification must be solved before categorization can begin, but we will see that a probabilistic approach can address both problems simultaneously. Identification and object discovery have been discussed by researchers from several disciplines, including psychology [2, 3, 4, 5, 6], machine learning [7, 8], statistics [9], and philosophy [10]. Many machine learning approaches can handle identity uncertainty, or uncertainty about whether two tokens correspond to the same object. Some approaches such such as BLOG [8] are able in addition to handle problems where the number of objects is not specified in advance. We propose 1 that some of these approaches can help to explain human learning, and this paper uses a simple BLOG-style approach [8] to account for human inferences. There are several existing psychological models of identification, and the work of Shepard [11], Nosofsky [3] and colleagues is probably the most prominent. Models in this tradition usually focus on problems where the set of objects is specified in advance and where identity uncertainty arises as a result of perceptual noise. In contrast, we focus on problems where the number of objects must be inferred and where identity uncertainty arises from partial observability rather than noise. A separate psychological tradition focuses on problems where the number of objects is not fixed in advance. Developmental psychologists, for example, have used displays where only one object token is visible at any time to explore whether young infants can infer how many different objects have been observed in total [4]. Our work emphasizes some of the same themes as this developmental research, but we go beyond previous work in this area by presenting and evaluating a computational approach to object identification and discovery. The problem of deciding how many objects have been observed is sometimes called individuation [12] but here we treat individuation as a special case of object discovery. Note, however, that object discovery can also refer to cases where learners infer the existence of objects that have never been observed. Unobserved-object discovery has received relatively little attention in the psychological literature, but is addressed by statistical models including including species-sampling models [9] and capture-recapture models [13]. Simple statistical models of this kind will not address some of the most compelling examples of unobserved-object discovery, such as the discovery of the planet Neptune, or the ability to infer the existence of a hidden object by following another person’s gaze [14]. We will show, however, that a simple statistical approach helps to explain how humans infer the existence of objects that they have never seen. 1 A probabilistic account of object discovery and identification Object discovery and identification may depend on many kinds of observations and may be supported by many kinds of prior knowledge. This paper considers a very simple setting where these problems can be explored. Suppose that an agent is learning about a world that contains nw white balls and n − nw gray balls. Let f (oi ) indicate the color of ball oi , where each ball is white (f (oi ) = 1) or gray (f (oi ) = 0). An agent learns about the world by observing a sequence of object tokens. Suppose that label l(j) is a unique identifier of token j—in other words, suppose that the jth token is a token of object ol(j) . Suppose also that the jth token is observed to have feature value g(j). Note the difference between f and g: f is a vector that specifies the color of the n balls in the world, and g is a vector that specifies the color of the object tokens observed thus far. We define a probability distribution over token sequences by assuming that a world is sampled from a prior P (n, nw ) and that tokens are sampled from this world. The full generative model is: P (n) ∝ 1 n 0 if n ≤ 1000 otherwise nw | n ∼ Uniform(0, n) l(j) | n ∼ Uniform(1, n) g(j) = f (ol(j) ) (1) (2) (3) (4) A prior often used for inferences about a population of unknown size is the scale-invariant Jeffreys 1 prior P (n) = n [15]. We follow this standard approach here but truncate at n = 1000. Choosing some upper bound is convenient when implementing the model, and has the advantage of producing a prior that is proper (note that the Jeffreys prior is improper). Equation 2 indicates that the number of white balls nw is sampled from a discrete uniform distribution. Equation 3 indicates that each token is generated by sampling one of the n balls in the world uniformly at random, and Equation 4 indicates that the color of each token is observed without noise. The generative assumptions just described can be used to define a probabilistic approach to object discovery and identification. Suppose that the observations available to a learner consist of a fully-observed feature vector g and a partially-observed label vector lobs . Object discovery and identification can be addressed by using the posterior distribution P (l|g, lobs ) to make inferences about the number of distinct objects observed and about the identity of each token. Computing the posterior distribution P (n|g, lobs ) allows the learner to make inferences about the total number of objects 2 in the world. In some cases, the learner may solve the problem of unobserved-object discovery by realizing that the world contains more objects than she has observed thus far. The next sections explore the idea that the inferences made by humans correspond approximately to the inferences of this ideal learner. Since the ideal learner allows for the possible existence of objects that have not yet been observed, we refer to our model as the open world model. Although we make no claim about the psychological mechanisms that might allow humans to approximate the predictions of the ideal learner, in practice we need some method for computing the predictions of our model. Since the domains we consider are relatively small, all results in this paper were computed by enumerating and summing over the complete set of possible worlds. 2 Experiment 1: Prior knowledge and identification The introduction described a scenario (the statue and sandwiches example) where prior knowledge appears to guide identification. Our first experiment explores a very simple instance of this idea. We consider a setting where participants observe balls that are sampled with replacement from an urn. In one condition, participants sample the same ball from the urn on four consecutive occasions and are asked to predict whether the token observed on the fifth draw is the same ball that they saw on the first draw. In a second condition participants are asked exactly the same question about the fifth token but sample four different balls on the first four draws. We expect that these different patterns of data will shape the prior beliefs that participants bring to the identification problem involving the fifth token, and that participants in the first condition will be substantially more likely to identify the fifth token as a ball that they have seen before. Although we consider an abstract setting involving balls and urns the problem we explore has some real-world counterparts. Suppose, for example, that a colleague wears the same tie to four formal dinners. Based on this evidence you might be able to estimate the total number of ties that he owns, and might guess that he is less likely to wear a new tie to the next dinner than a colleague who wore different ties to the first four dinners. Method. 12 adults participated for course credit. Participants interacted with a computer interface that displayed an urn, a robotic arm and a beam of UV light. The arm randomly sampled balls from the urn, and participants were told that each ball had a unique serial number that was visible only under UV light. After some balls were sampled, the robotic arm moved them under the UV light and revealed their serial numbers before returning them to the urn. Other balls were returned directly to the urn without having their serial numbers revealed. The serial numbers were alphanumeric strings such as “QXR182”—note that these serial numbers provide no information about the total number of objects, and that our setting is therefore different from the Jeffreys tramcar problem [15]. The experiment included five within-participant conditions shown in Figure 1. The observations for each condition can be summarized by a string that indicates the number of tokens and the serial numbers of some but perhaps not all tokens. The 1 1 1 1 1 condition in Figure 1a is a case where the same ball (without loss of generality, we call it ball 1) is drawn from the urn on five consecutive occasions. The 1 2 3 4 5 condition in Figure 1b is a case where five different balls are drawn from the urn. The 1 condition in Figure 1d is a case where five draws are made, but only the serial number of the first ball is revealed. Within any of the five conditions, all of the balls had the same color (white or gray), but different colors were used across different conditions. For simplicity, all draws in Figure 1 are shown as white balls. On the second and all subsequent draws, participants were asked two questions about any token that was subsequently identified. They first indicated whether the token was likely to be the same as the ball they observed on the first draw (the ball labeled 1 in Figure 1). They then indicated whether the token was likely to be a ball that they had never seen before. Both responses were provided on a scale from 1 (very unlikely) to 7 (very likely). At the end of each condition, participants were asked to estimate the total number of balls in the urn. Twelve options were provided ranging from “exactly 1” to “exactly 12,” and a thirteenth option was labeled “more than 12.” Responses to each option were again provided on a seven point scale. Model predictions and results. The comparisons of primary interest involve the identification questions in conditions 1a and 1b. In condition 1a the open world model infers that the total number of balls is probably low, and becomes increasingly confident that each new token is the same as the 3 a) b) 1 1 1 1 1 ?NEW = NEW 1 2 3 4 5 ? = (1) ?NEW = NEW BALL 1 BALL (1) NEW 5 5 3 3 3 3 1 1 1 1 Open world 7 5 0.66 DP mixture 7 5 0.66 PY mixture Human 7 ? = (1) BALL 1 1 1 0.66 0.66 0.33 0.33 0 0 7 13 0.66 9 0.33 5 0.33 5 0 1 0 1 1 # Balls 1 # Balls 0.66 1 1 ? (1)(?) 1 2 ? (1)(2)(?) (1)(2)(3)(?) 1 2 3 ? (1)(2)(3)(4)(?) 1 2 3 4 ? d) e) 5 5 3 3 3 1 1 1 13 13 13 9 9 9 5 5 5 1 1 1 # Balls # Balls 1 3 5 7 9 11 +12 7 5 1 3 5 7 9 11 +12 7 1 3 5 7 9 11 +12 7 Human 1 1 ? (1)(?) 1 2 ? (1)(2)(?) (1)(2)(3)(?) 1 2 3 ? (1)(2)(3)(4)(?) 1 2 3 4 ? 0 1 ? (1)(?) 1 1 ? (1)(1)(?) 1 1 1 ? (1)(1)(1)(?) (1)(1)(1)(1)(?) 1 1 1 1 ? 0.33 0 1 ? (1)(?) 1 1 ? (1)(1)(?) 1 1 1 ? (1)(1)(1)(?) (1)(1)(1)(1)(?) 1 1 1 1 ? 0.33 1 3 5 7 9 11 +12 1 9 1 3 5 7 9 11 +12 13 Open world c) 1 # Balls Figure 1: Model predictions and results for the five conditions in experiment 1. The left columns in (a) and (b) show inferences about the identification questions. In each plot, the first group of bars shows predictions about the probability that each new token is the same ball as the first ball drawn from the urn. The second group of bars shows the probability that each new token is a ball that has never been seen before. The right columns in (a) and (b) and the plots in (c) through (e) show inferences about the total number of balls in each urn. All human responses are shown on the 1-7 scale used for the experiment. Model predictions are shown as probabilities (identification questions) or ranks (population size questions). first object observed. In condition 1b the model infers that the number of balls is probably high, and becomes increasingly confident that each new token is probably a new ball. The rightmost charts in Figures 1a and 1b show inferences about the total number of balls and confirm that humans expect the number of balls to be low in condition 1a and high in condition 1b. Note that participants in condition 1b have solved the problem of unobserved-object discovery and inferred the existence of objects that they have never seen. The leftmost charts in 1a and 1b show responses to the identification questions, and the final bar in each group of four shows predictions about the fifth token sampled. As predicted by the model, participants in 1a become increasingly confident that each new token is the same object as the first token, but participants in 1b become increasingly confident that each new token is a new object. The increase in responses to the new ball questions in Figure 1b is replicated in conditions 2d and 2e of Experiment 2, and therefore appears to be reliable. 4 The third and fourth rows of Figures 1a and 1b show the predictions of two alternative models that are intuitively appealing but that fail to account for our results. The first is the Dirichlet Process (DP) mixture model, which was proposed by Anderson [16] as an account of human categorization. Unlike most psychological models of categorization, the DP mixture model reserves some probability mass for outcomes that have not yet been observed. The model incorporates a prior distribution over partitions—in most applications of the model these partitions organize objects into categories, but Anderson suggests that the model can also be used to organize object tokens into classes that correspond to individual objects. The DP mixture model successfully predicts that the ball 1 questions will receive higher ratings in 1a than 1b, but predicts that responses to the new ball question will be identical across these two conditions. According to this model, the probability that a new token θ corresponds to a new object is m+θ where θ is a hyperparameter and m is the number of tokens observed thus far. Note that this probability is the same regardless of the identities of the m tokens previously observed. The Pitman Yor (PY) mixture model in the fourth row is a generalization of the DP mixture model that uses a prior over partitions defined by two hyperparameters [17]. According to this model, the probability that a new token corresponds to a new object is θ+kα , where θ and α are hyperparameters m+θ and k is the number of distinct objects observed so far. The flexibility offered by a second hyperparameter allows the model to predict a difference in responses to the new ball questions across the two conditions, but the model does not account for the increasing pattern observed in condition 1b. Most settings of θ and α predict that the responses to the new ball questions will decrease in condition 1b. A non-generic setting of these hyperparameters with θ = 0 can generate the flat predictions in Figure 1, but no setting of the hyperparameters predicts the increase in the human responses. Although the PY and DP models both make predictions about the identification questions, neither model can predict the total number of balls in the urn. Both models assume that the population of balls is countably infinite, which does not seem appropriate for the tasks we consider. Figures 1c through 1d show results for three control conditions. Like condition 1a, 1c and 1d are cases where exactly one serial number is observed. Like conditions 1a and 1b, 1d and 1e are cases where exactly five tokens are observed. None of these control conditions produces results similar to conditions 1a and 1b, suggesting that methods which simply count the number of tokens or serial numbers will not account for our results. In each of the final three conditions our model predicts that the posterior distribution on the number of balls n should decay as n increases. This prediction is not consistent with our data, since most participants assigned equal ratings to all 13 options, including “exactly 12 balls” and “more than 12 balls.” The flat responses in Figures 1c through 1e appear to indicate a generic desire to express uncertainty, and suggest that our ideal learner model accounts for human responses only after several informative observations have been made. 3 Experiment 2: Object discovery and identity uncertainty Our second experiment focuses on object discovery rather than identification. We consider cases where learners make inferences about the number of objects they have seen and the total number of objects in the urn even though there is substantial uncertainty about the identities of many of the tokens observed. Our probabilistic model predicts that observations of unidentified tokens can influence inferences about the total number of objects, and our second experiment tests this prediction. Method. 12 adults participated for course credit. The same participants took part in Experiments 1 and 2, and Experiment 2 was always completed after Experiment 1. Participants interacted with the same computer interface in both conditions, and the seven conditions in Experiment 2 are shown in Figure 2. Note that each condition now includes one or more gray tokens. In 2a, for example, there are four gray tokens and none of these tokens is identified. All tokens were sampled with replacement, and the condition labels in Figure 2 summarize the complete set of tokens presented in each condition. Within each condition the tokens were presented in a pseudo-random order—in 2a, for example, the gray and white tokens were interspersed with each other. Model predictions and results. The cases of most interest are the inferences about the total number of balls in conditions 2a and 2c. In both conditions participants observe exactly four white tokens and all four tokens are revealed to be the same ball. The gray tokens in each condition are never identified, but the number of these tokens varies across the conditions. Even though the identities 5 a) ?NEW = NEW 1 1 1 1 1 1 1 1 ? = (1) BALL 1 ?NEW = NEW 7 7 5 5 5 5 3 3 3 3 1 1 1 1 7 5 0.33 5 0 1 0 1 # Balls c) 1 2 3 4 ? = (1) BALL 1 ?NEW = NEW 5 3 3 3 3 1 1 1 1 1 13 1 13 0.66 9 0.66 9 0.33 5 0.33 5 0 1 0 1 e) ? = (1) BALL 1 ?NEW = NEW 1 1 3 5 7 9 11 +12 # Balls g) 1 3 3 3 1 1 1 13 1 13 1 13 0.66 9 9 9 0.33 5 5 5 0 1 1 1 # Balls # Balls 1 3 5 7 9 11 +12 5 3 1 3 5 7 9 11 +12 7 5 1 3 5 7 9 11 +12 7 5 [ ]x1 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x3 x3 1 2 3 ? (1)(2)(3)(?) 7 5 [ ]x1 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x3 x3 1 2 3 ? (1)(2)(3)(?) Human 7 Open world f) 1 2 3 4 7 (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x1 x1 1 2 3 ? (1)(2)(3)(?) # Balls (1)(?) x1 1 ? [ ]x1x1 1 2 ? (1)(2)(?) [ ]x1 x1 1 2 3 ? (1)(2)(3)(?) 5 1 3 5 7 9 11 +12 5 [ ]x3 (1)(?) x3 1 ? [ ]x6x6 1 1 ? (1)(1)(?) [ ]x9 x9 1 1 1 ? (1)(1)(1)(?) 7 5 [ ]x3 (1)(?) x3 1 ? [ ]x6x6 1 1 ? (1)(1)(?) [ ]x9 x9 1 1 1 ? (1)(1)(1)(?) 7 Human ?NEW = NEW Open world 7 ? = (1) BALL 1 # Balls d) 1 1 1 1 1 3 5 7 9 11 +12 9 0.33 [ ]x3 (1)(?) x3 1 ? 13 0.66 [ ]x3 (1)(?) x3 1 ? 1 9 1 3 5 7 9 11 +12 13 [ ]x2 (1)(?) x2 1 ? x3 1 1 ? [ ]x3 (1)(1)(?) [ ]x3x3 1 1 1 ? (1)(1)(1)(?) 1 0.66 [ ]x2 (1)(?) x2 1 ? [ ]x3 (1)(1)(?) x3 1 1 ? [ ]x3x3 1 1 1 ? (1)(1)(1)(?) Human 7 Open world b) 1 1 1 1 ? = (1) BALL 1 # Balls Figure 2: Model predictions and results for the seven conditions in Experiment 2. The left columns in (a) through (e) show inferences about the identification questions, and the remaining plots show inferences about the total number of balls in each urn. of the gray tokens are never revealed, the open world model can use these observations to guide its inference about the total number of balls. In 2a, the proportions of white tokens and gray tokens are equal and there appears to be only one white ball, suggesting that the total number of balls is around two. In 2c grey tokens are now three times more common, suggesting that the total number of balls is larger than two. As predicted, the human responses in Figure 2 show that the peak of the distribution in 2a shifts to the right in 2c. Note, however, that the model does not accurately predict the precise location of the peak in 2c. Some of the remaining conditions in Figure 2 serve as controls for the comparison between 2a and 2c. Conditions 2a and 2c differ in the total number of tokens observed, but condition 2b shows that 6 this difference is not the critical factor. The number of tokens observed is the same across 2b and 2c, yet the inference in 2b is more similar to the inference in 2a than in 2c. Conditions 2a and 2c also differ in the proportion of white tokens observed, but conditions 2f and 2g show that this difference is not sufficient to explain our results. The proportion of white tokens observed is the same across conditions 2a, 2f, and 2g, yet only 2a provides strong evidence that the total number of balls is low. The human inferences for 2f and 2g show the hint of an alternating pattern consistent with the inference that the total number of balls in the urn is even. Only 2 out of 12 participants generated this pattern, however, and the majority of responses are near uniform. Finally, conditions 2d and 2e replicate our finding from Experiment 1 that the identity labels play an important role. The only difference between 2a and 2e is that the four labels are distinct in the latter case, and this single difference produces a predictable divergence in human inferences about the total number of balls. 4 Experiment 3: Categorization and identity uncertainty Experiment 2 suggested that people make robust inferences about the existence and number of unobserved objects in the presence of identity uncertainty. Our final experiment explores categorization in the presence of identity uncertainty. We consider an extreme case where participants make inferences about the variability of a category even though the tokens of that category have never been identified. Method. The experiment included two between subject conditions, and 20 adults were recruited for each condition. Participants were asked to reason about a category including eggs of a given species, where eggs in the same category might vary in size. The interface used in Experiments 1 and 2 was adapted so that the urn now contained two kinds of objects: notepads and eggs. Participants were told that each notepad had a unique color and a unique label written on the front. The UV light played no role in the experiment and was removed from the interface: notepads could be identified by visual inspection, and identifying labels for the eggs were never shown. In both conditions participants observed a sequence of 16 tokens sampled from the urn. Half of the tokens were notepads and the others were eggs, and all egg tokens were identical in size. Whenever an egg was sampled, participants were told that this egg was a Kwiba egg. At the end of the condition, participants were shown a set of 11 eggs that varied in size and asked to rate the probability that each one was a Kwiba egg. Participants then made inferences about the total number of eggs and the total number of notepads in the urn. The two conditions were intended to lead to different inferences about the total number of eggs in the urn. In the 4 egg condition, all items (notepad and eggs) were sampled with replacement. The 8 notepad tokens included two tokens of each of 4 notepads, suggesting that the total number of notepads was 4. Since the proportion of egg tokens and notepad tokens was equal, we expected participants to infer that the total number of eggs was roughly four. In the 1 egg condition, four notepads were observed in total, but the first three were sampled without replacement and never returned to the urn. The final notepad and the egg tokens were always sampled with replacement. After the first three notepads had been removed from the urn, the remaining notepad was sampled about half of the time. We therefore expected participants to infer that the urn probably contained a single notepad and a single egg by the end of the experiment, and that all of the eggs they had observed were tokens of a single object. Model. We can simultaneously address identification and categorization by combining the open world model with a Gaussian model of categorization. Suppose that the members of a given category (e.g. Kwiba eggs) vary along a single continuous dimension (e.g. size). We assume that the egg sizes are distributed according to a Gaussian with known mean and unknown variance σ 2 . For convenience, we assume that the mean is zero (i.e. we measure size with respect to the average) and β use the standard inverse-gamma prior on the variance: p(σ 2 ) ∝ (σ 2 )−(α+1) e− σ2 . Since we are interested only in qualitative predictions of the model, the precise values of the hyperparameters are not very important. To generate the results shown in Figure 3 we set α = 0.5 and β = 2. Before observing any eggs, the marginal distribution on sizes is p(x) = p(x|σ 2 )p(σ 2 )dσ 2 . Suppose now that we observe m random samples from the category and that each one has size zero. If m is large then these observations provide strong evidence that the variance σ 2 is small, and the posterior distribution p(x|m) will be tightly peaked around zero. If m, is small, however, then the posterior distribution will be broader. 7 2 − Category pdf (1 egg) 1 2 1 0 0 7 7 5 5 3 3 1 1 = p4 (x) − p1 (x) Category pdf (4 eggs) p1 (x) p4 (x) a) Model differences 0.1 0 −0.1 −2 0 2 x (size) Human differences 12 8 10 6 4 0.4 0.2 0 −0.2 −0.4 2 12 8 10 6 4 2 −2 0 2 x (size) −2 0 2 x (size) b) Number of eggs (4 eggs) Number of eggs (1 egg) c) −4 −2 0 2 4 (size) Figure 3: (a) Model predictions for Experiment 3. The first two panels show the size distributions inferred for the two conditions, and the final panel shows the difference of these distributions. The difference curve for the model rises to a peak of around 1.6 but has been truncated at 0.1. (b) Human inferences about the total number of eggs in the urn. As predicted, participants in the 4 egg condition believe that the urn contains more eggs. (c) The difference of the size distributions generated by participants in each condition. The central peak is absent but otherwise the curve is qualitatively similar to the model prediction. The categorization model described so far is entirely standard, but note that our experiment considers a case where T , the observed stream of object tokens, is not sufficient to determine m, the number of distinct objects observed. We therefore use the open world model to generate a posterior distribution over m, and compute a marginal distribution over size by integrating out both m and σ 2 : p(x|T ) = p(x|σ 2 )p(σ 2 |m)p(m|T )dσ 2 dm. Figure 3a shows predictions of this “open world + Gaussian” model for the two conditions in our experiment. Note that the difference between the curves for the two conditions has the characteristic Mexican-hat shape produced by a difference of Gaussians. Results. Inferences about the total number of eggs suggested that our manipulation succeeded. Figure 3b indicates that participants in the 4 egg condition believed that they had seen more eggs than participants in the 1 egg condition. Participants in both conditions generated a size distribution for the category of Kwiba eggs, and the difference of these distributions is shown in Figure 3c. Although the magnitude of the differences is small, the shape of the difference curve is consistent with the model predictions. The x = 0 bar is the only case that diverges from the expected Mexican hat shape, and this result is probably due to a ceiling effect—80% of participants in both conditions chose the maximum possible rating for the egg with mean size (size zero), leaving little opportunity for a difference between conditions to emerge. To support the qualitative result in Figure 3c we computed the variance of the curve generated by each individual participant and tested the hypothesis that the variances were greater in the 1 egg condition than in the 4 egg condition. A Mann-Whitney test indicated that this difference was marginally significant (p < 0.1, one-sided). 5 Conclusion Parsing the world into stable and recurring objects is arguably our most basic cognitive achievement [2, 10]. This paper described a simple model of object discovery and identification and evaluated it in three behavioral experiments. Our first experiment confirmed that people rely on prior knowledge when solving identification problems. Our second and third experiments explored problems where the identities of many object tokens were never revealed. Despite the resulting uncertainty, we found that participants in these experiments were able to track the number of objects they had seen, to infer the existence of unobserved objects, and to learn and reason about categories. Although the tasks in our experiments were all relatively simple, future work can apply our approach to more realistic settings. For example, a straightforward extension of our model can handle problems where objects vary along multiple perceptual dimensions and where observations are corrupted by perceptual noise. Discovery and identification problems may take several different forms, but probabilistic inference can help to explain how all of these problems are solved. Acknowledgments We thank Bobby Han, Faye Han and Maureen Satyshur for running the experiments. 8 References [1] E. A. Tibbetts and J. Dale. Individual recognition: it is good to be different. Trends in Ecology and Evolution, 22(10):529–237, 2007. [2] W. James. Principles of psychology. Holt, New York, 1890. [3] R. M. Nosofsky. Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115:39–57, 1986. [4] F. Xu and S. Carey. Infants’ metaphysics: the case of numerical identity. Cognitive Psychology, 30:111–153, 1996. [5] L. W. Barsalou, J. Huttenlocher, and K. Lamberts. Basing categorization on individuals and events. Cognitive Psychology, 36:203–272, 1998. [6] L. J. Rips, S. Blok, and G. Newman. Tracing the identity of objects. Psychological Review, 113(1):1–30, 2006. [7] A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 905–912. MIT Press, Cambridge, MA, 2005. [8] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, pages 1352–1359, 2005. [9] J. Bunge and M. Fitzpatrick. Estimating the number of species: a review. Journal of the American Statistical Association, 88(421):364–373, 1993. [10] R. G. Millikan. On clear and confused ideas: an essay about substance concepts. Cambridge University Press, New York, 2000. [11] R. N. Shepard. Stimulus and response generalization: a stochastic model relating generalization to distance in psychological space. Psychometrika, 22:325–345, 1957. [12] A. M. Leslie, F. Xu, P. D. Tremoulet, and B. J. Scholl. Indexing and the object concept: developing ‘what’ and ‘where’ systems. Trends in Cognitive Science, 2(1):10–18, 1998. [13] J. D. Nichols. Capture-recapture models. Bioscience, 42(2):94–102, 1992. [14] G. Csibra and A. Volein. Infants can infer the presence of hidden objects from referential gaze information. British Journal of Developmental Psychology, 26:1–11, 2008. [15] H. Jeffreys. Theory of Probability. Oxford University Press, Oxford, 1961. [16] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3): 409–429, 1991. [17] J. Pitman. Combinatorial stochastic processes, 2002. Notes for Saint Flour Summer School. 9
3 0.80374044 21 nips-2009-Abstraction and Relational learning
Author: Charles Kemp, Alan Jern
Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7 ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff i ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔ ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1 ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1 ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9
4 0.70119262 152 nips-2009-Measuring model complexity with the prior predictive
Author: Wolf Vanpaemel
Abstract: In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model’s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to fit a wide range of outcomes. The PPC then measures how wide this range exactly is. keywords: Model Selection & Structure Learning; Model Comparison Methods; Perception The recent revolution in model selection methods in the cognitive sciences was driven to a large extent by the observation that computational models can differ in their complexity. Differences in complexity put models on unequal footing when their ability to approximate empirical data is assessed. Therefore, models should be penalized for their complexity when their adequacy is measured. The balance between descriptive adequacy and complexity has been termed generalizability [1, 2]. Much attention has been devoted to developing, advocating, and comparing different measures of generalizability (for a recent overview, see [3]). In contrast, measures of complexity have received relatively little attention. The aim of the current paper is to propose and illustrate a stand alone measure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on the intuitive idea that a complex model can predict many outcomes and a simple model can predict a few outcomes only. First, I discuss existing approaches to measuring model complexity and note some of their limitations. In particular, I argue that currently existing measures ignore one important aspect of a model: the prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the existing measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the complexities of two popular models of information integration. 1 Previous approaches to measuring model complexity A first approach to assess the (relative) complexity of models relies on simulated data. Simulationbased methods differ in how these artificial data are generated. A first, atheoretical approach uses random data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically ∗ I am grateful to Michael Lee and Liz Bonawitz. 1 interesting functions, such as the exponential or the logistic function [4]. Using these approaches, the models under consideration are equally complex if each model provides the best optimal fit to roughly the same number of data sets. A final approach to generating artificial data is a theoretical one, in which the data are generated from the models of interest themselves [6, 7]. The parameter sets used in the generation can either be hand-picked by the researcher, estimated from empirical data or drawn from a previously specified distribution. If the models under consideration are equally complex, each model should provide the best optimal fit to self-generated data more often than the other models under consideration do. One problem with this simulation-based approach is that it is very labor intensive. It requires generating a large amount of artificial data sets, and fitting the models to all these data sets. Further, it relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For example, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in the theoretical approach, results are heavily influenced by the parameter values used in generating the data. If they are fixed, on what basis? If they are estimated from empirical data, from which data? If they are drawn randomly, from which distribution? Further, a simulation study only gives a rough idea of complexity differences but provides no direct measure reflecting the complexity. A number of proposals have been made to measure model complexity more directly. Consider a model M with k parameters, summarized in the parameter vector θ = (θ1 , θ2 , . . . , θk , ) which has a range indicated by Ω. Let d denote the data and p(d|θ, M ) the likelihood. The most straightforward measure of model complexity is the parametric complexity (PC), which simply counts the number of parameters: PC = k. (1) PC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a direct and well understood relation toward complexity: the more parameters, the more complex the model. It is included as the complexity term of several generalizability measures such as AIC [8] and BIC [9], and it is at the heart of the Likelihood Ratio Test. Despite this intuitive appeal, PC is not free from problems. One problem with PC is that it reflects only a single aspect of complexity. Also the parameter range and the functional form (the way the parameters are combined in the model equation) influence a model’s complexity, but these dimensions of complexity are ignored in PC [2, 6]. A complexity measure that takes these three dimensions into account is provided by the geometric complexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is conceptualized as the number of distinguishable probability distributions a model can generate. It is defined by GC = k n ln + ln 2 2π det I(θ|M )dθ, (2) Ω where n indicates the size of the data sample and I(θ) is the Fisher Information Matrix: Iij (θ|M ) = −Eθ ∂ 2 ln p(d|θ, M ) . ∂θi ∂θj (3) Note that I(θ|M ) is determined by the likelihood function p(d|θ, M ), which is in turn determined by the model equation. Hence GC is sensitive to the number of parameters (through k), the functional form (through I), and the range (through Ω). Quite surprisingly, GC turns out to be equal to the complexity term used in one version of Minimum Description Length (MDL), a measure of generalizability developed within the domain of information theory [2, 11, 12, 13]. GC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into account rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable technical sophistication to be computed, as it relies on the second derivative of the likelihood. A more important limitation of both PC and GC is that these measures are insensitive to yet another important dimension contributing to model complexity: the prior distribution over the model parameters. The relation between the parameter prior distribution and model complexity is discussed next. 2 2 Model complexity and the parameter prior The growing popularity of Bayesian methods in psychology has not only raised awareness that model complexity should be taken into account when testing models [6], it has also drawn attention to the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods, there is room to incorporate this information in two different flavors: as a prior distribution over the models, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so almost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast, information regarding the parameter is much easier to include, although still challenging (e.g., [16]). There are two ways to formalize prior information about a model’s parameters: using the parameter prior range (often referred to as simply the range) and using the parameter prior distribution (often referred to as simply the prior). The prior range indicates which parameter values are allowed and which are forbidden. The prior distribution indicates which parameter values are likely and which are unlikely. Models that share the same equation and the same range but differ in the prior distribution can be considered different models (or at least different model versions), just like models that share the same equation but differ in range are different model versions. Like the parameter prior range, the parameter prior distribution influences the model complexity. In general, a model with a vague parameter prior distribution is more complex than a model with a sharply peaked parameter prior distribution, much as a model with a broad-ranged parameter is more complex than the same model where the parameter is heavily restricted. To drive home the point that the parameter prior should be considered when model complexity is assessed, consider the following “fair coin” model Mf and a “biased coin” model Mb . There is a clear intuitive complexity difference between these models: Mb is more complex than Mf . The most straightforward way to formalize these models is as follows, where ph denotes the probability of observing heads: ph = 1/2, (4) ph = θ 0≤θ≤1 p(θ) = 1, (5) for model Mf and the triplet of equations jointly define model Mb . The range forbids values smaller than 0 or greater than 1 because ph is a proportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive to the number of parameters, pick up the difference in model complexity between the models. Alternatively, model Mf could be defined as follows: ph = θ 0≤θ≤1 1 p(θ) = δ(θ − ), 2 (6) where δ(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the model formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6, PC and GC now judge Mf and Mb to be equally complex: both models share the same model equation (which implies they have the same number of parameters and the same functional form) and the same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity difference between both models. This misjudgement is a direct result of the insensitivity of these measures to the parameter prior. As models Mf and Mb have different prior distributions over their parameter, a measure sensitive to the prior would pick up the complexity difference between these models. Such a measure is introduced next. 3 The Prior Predictive Complexity Model complexity refers to the property of the model that enables it to predict a wide range of data patterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model 3 can predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity, then, refers to the property of placing restrictions on the possible outcomes: the greater restrictions, the greater the simplicity. To understand how model complexity is measured in the PPC, it is useful to think about the universal interval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could potentially be observed, irrespective of any model. For example, in an experiment with n binomial trials, it is impossible to observe less that zero successes, or more than n successes, so the range of possible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted interval is the interval containing all outcomes the model predicts. An intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative to the cardinality of the universal interval, averaged over all m conditions or stimuli: PPC = 1 m m i=1 |PIi | . |UIi | (7) A key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based model, prediction takes the form of a distribution over all possible outcomes for some future, yet-tobe-observed data d under some model M . This distribution is called the prior predictive distribution (ppd) and can be calculated using the law of total probability: p(d|M ) = p(d|θ, M )p(θ|M )dθ. (8) Ω Predicting the probability of unseen future data d arising under the assumption that model M is true involves integrating the probability of the data for each of the possible parameter values, p(d|θ, M ), as weighted by the prior probability of each of these values, p(θ|M ). Note that the ppd relies on the number of parameters (through the number of integrals and the likelihood), the model equation (through the likelihood), and the parameter range (through Ω). Therefore, as GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the PPC, also relies on the parameter prior. Since predictions are made probabilistically, virtually all outcomes will be assigned some prior weight. This implies that, in principle, the predicted interval equals the universal interval. However, for some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to restrict the predicted interval to the smallest interval that includes some predetermined amount of the prior mass. For example, the 95% predictive interval is defined by those outcomes with the highest prior mass that together make up 95% of the prior mass. Analytical solutions to the integral defining the ppd are rarely available. Instead, one should rely on approximations to the ppd by drawing samples from it. In the current study, sampling was performed using WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package. It contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algorithms to sample from any distribution of interest. 4 An application example The PPC is illustrated by comparing the complexity of two popular models of information integration, which attempt to account for how people merge potentially ambiguous or conflicting information from various sensorial sources to create subjective experience. These models either assume that the sources of information are combined additively (the Linear Integration Model; LIM; [19]) or multiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]). 4.1 Information integration tasks A typical information integration task exposes participants simultaneously to different sources of information and requires this combined experience to be identified in a forced-choice identification task. The presented stimuli are generated from a factorial manipulation of the sources of information by systematically varying the ambiguity of each of the sources. The relevant empirical data consist 4 of, for each of the presented stimuli, the counts km of the number of times the mth stimulus was identified as one of the response alternatives, out of the tm trials on which it was presented. For example, an experiment in phonemic identification could involve two phonemes to be identified, /ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing different levels of audible speech, varying between /ba/ and /da/, with different levels of visible speech, also varying between these alternatives. The resulting set of stimuli spans a continuum between the two syllables. The participant is then asked to listen and to watch the speaker, and based on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/. In the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and visual information) but also unimodal stimuli (providing only a single source of information) are presented. 4.2 Information integration models In what follows, the formal description of the LIM and the FLMP is outlined for a design with two response alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels, respectively. In such a two-choice identification task, the counts km follow a Binomial distribution: km ∼ Binomial(pm , tm ), (9) where pm indicates the probability that the mth stimulus is identified as /da/. 4.2.1 Model equation The probability for the stimulus constructed with the ith level of the first source and the jth level of the second being identified as /da/ is computed according to the choice rule: pij = s (ij, /da/) , s (ij, /da/) + s (ij, /ba/) (10) where s (ij, /da/) represents the overall degree of support for the stimulus to be /da/. The sources of information are assumed to be evaluated independently, implying that different parameters are used for the different modalities. In the present example, the degree of auditory support for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj (j = 1, . . . , J). When a unimodal stimulus is presented, the overall degree of support for each alternative is given by s (i∗, /da/) = ai and s (∗j, /da/) = bj , where the asterisk (*) indicates the absence of information, implying that Equation 10 reduces to pi∗ = ai and p∗j = bj . (11) When a bimodal stimulus is presented, the overall degree of support for each alternative is based on the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) = ai bj , where the operator denotes the combination of both sources. Hence, Equation 10 reduces to ai bj . (12) pij = ai bj + (1 − ai ) (1 − bj ) = +, so Equation 12 becomes The LIM assumes an additive combination, i.e., pij = ai + bj . 2 (13) The FLMP, in contrast, assumes a multiplicative combination, i.e., = ×, so Equation 12 becomes ai bj . ai bj + (1 − ai )(1 − bj ) (14) pij = 5 4.2.2 Parameter prior range and distribution Each level of auditory and visual support for /da/ (i.e., ai and bj , respectively) is associated with a free parameter, which implies that the FLMP and the LIM have an equal number of free parameters, I + J. Each of these parameters is constrained to satisfy 0 ≤ ai , bj ≤ 1. The original formulations of the LIM and FLMP unfortunately left the parameter priors unspecified. However, an implicit assumption that has been commonly used is a uniform prior for each of the parameters. This assumption implicitly underlies classical and widely adopted methods for model evaluation using accounted percentage of variance or maximum likelihood. ai ∼ Uniform(0, 1) and bi ∼ Uniform(0, 1) for i = 1, . . . , I; j = 1, . . . , J. (15) The models relying on this set of uniform priors will be referred to as LIMu and FLMPu . Note that LIMu and FLMPu treat the different parameters as independent. This approach misses important information. In particular, the experimental design is such that the amount of support for each level i + 1 is always higher than for level i. Because parameter ai (or bi ) corresponds to the degree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to expect the following orderings among the parameters to hold (see also [6]): aj > ai and bj > bi for j > i. (16) The models relying on this set of ordered priors will be referred to as LIMo and FLMPo . 4.3 Complexity and experimental design It is tempting to consider model complexity as an inherent characteristic of a model. For some models and for some measures of complexity this is clearly the case. Consider, for example, model Mb . In any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally, this is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as PC depends crucially on (some aspects of) the experimental design. In particular, every level corresponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC is not different in this respect. The design sensitivity implies that one can only make sensible conclusions about differences in model complexity by using different designs. In an information integration task, the design decisions include the type of design (expanded or not), the number of sources, the number of response alternatives, the number of levels for each source, and the number of observations for each stimulus (sample size). The present study focuses on the expanded factorial designs with two sources and two response alternatives. The additional design features were varied: both a 5 × 5 and a 8 × 2 design were considered, using three different sample sizes (20, 60 and 150, following [2]). 4.4 Results Figure 1 shows the 99% predicted interval in the 8×2 design with n = 150. Each panel corresponds to a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The first eight stimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing order of auditory support. The next eight stimuli correspond to the stimuli with the highest level of visual support. The next eight stimuli correspond to the unimodal stimuli where only auditory information is provided (again ranked in increasing order). The final two stimuli are the unimodal visual stimuli. Panel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging between 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass by LIMu , making it almost maximally complex. FLMPu is even more complex. The predicted interval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts virtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant prior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller than their counterparts using the uniform priors. Focusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more complex than the former. This observation holds irrespective of the model version (assuming uniform 6 0.9 0.8 0.8 Proportion of /da/ responses 1 0.9 Proportion of /da/ responses 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 A 1* 0 *1 11 21 B 1* *1 1* *1 0.8 Proportion of /da/ responses 0.9 0.8 21 1 0.9 Proportion of /da/ responses 1 11 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 11 21 C 1* 0 *1 D Figure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel A), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D). Table 1: PPC, based on the 99% predicted interval, for four models across six different designs. 20 LIMu FLMPu LIMo FLMPo 5×5 60 150 20 8×2 60 150 0.97 1 0.75 0.83 0.94 1 0.67 0.80 .97 1 0.77 0.86 0.95 1 0.69 0.82 0.93 0.99 0.64 0.78 7 0.94 0.99 0.66 0.81 vs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure the relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4] but see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based approach [2, 6, 22], and a direct computation of the GC [2]. The PPC’s for all six designs considered are displayed in Table 1. It shows that the observations made for the 8 × 2, n = 150 design holds across the five remaining designs as well: LIM is simpler than FLMP; and models assuming ordered priors are simpler than models assuming uniform priors. Note that these conclusions would not have been possible based on PC or GC. For PC, all four models have the same complexity. GC, in contrast, would detect complexity differences between LIM and FLMP (i.e., the first conclusion), but due to its insensitivity to the parameter prior, the complexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the other hand (i.e., the second conclusion) would have gone unnoticed. 5 Discussion A theorist defining a model should clearly and explicitly specify at least the three following pieces of information: the model equation, the parameter prior range, and the parameter prior distribution. If any of these pieces is missing, the model should be regarded as incomplete, and therefore untestable. Consequently, any measure of generalizability should be sensitive to all three aspects of the model definition. Many currently popular generalizability measures do not satisfy this criterion, including AIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into account is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly for its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that the prior is a part of the model definition, I see the sensitivity of the marginal likelihood to the prior as an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive to the prior that miss an important aspect of the model. Similarly, any stand alone measure of model complexity should be sensitive to all three aspects of the model definition, as all three aspects contribute to the model’s complexity (with the model equation contributing two factors: the number of parameters and the functional form). Existing measures of complexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the model equation into account, whereas GC takes only the model equation and the range into account. In contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model complexity using the predicted interval which contains all possible outcomes a model can generate. A narrow predicted interval (relative to the universal interval) indicates a simple model; a complex model is characterized by a wide predicted interval. There is a tight coupling between the notions of information, knowledge and uncertainty, and the notion of model complexity. As parameters correspond to unknown variables, having more information available leads to fewer parameters and hence to a simpler model. Similarly, the more information there is available, the sharper the parameter prior, implying a simpler model. To put it differently, the less uncertainty present in a model, the narrower its predicted interval, and the simpler the model. For example, in model Mb , there is maximal uncertainty. Nothing but the range is known about θ, so all values of θ are equally likely. In contrast, in model Mf , there is minimal uncertainty. In fact, ph is known for sure, so only a single value of θ is possible. This difference in uncertainty is translated in a difference in complexity. The same is true for the information integration models. Incorporating the order constraints in the priors reduces the uncertainty compared to the models without these constraints (it tells you, for example, that parameter a1 is smaller than a2 ). This reduction in uncertainty is reflected by a smaller complexity. There are many different sources of prior information that can be translated in a range or distribution. The illustration using the information integration models highlighted that prior information can reflect meaningful information in the design. Alternatively, priors can be informed by previous applications of similar models in similar settings. Probably the purest form of priors are those that translate theoretical assumptions made by a model (see [16]). The fact that it is often difficult to formalize this prior information may not be used as an excuse to leave the prior unspecified. Sure it is a challenging task, but so is translating theoretical assumptions into the model equation. Formalizing theory, intuitions, and information is what model building is all about. 8 References [1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. [2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. [3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248–1284. [4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364–381. [5] Dunn, J. (2000) Model complexity: The fit to random data reconsidered. Psychological Research, 63, 174–182. [6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. [7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior Research Methods. [8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B. and Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267–281, Academiai Kiado. [9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461–464. [10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170–11175. [11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of Classification, 19, 69–85. [12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. [13] Gr¨ nwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical u Psychology, 44, 133–152. [14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668. [15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category representation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403–1424. [16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the generalized context model. [17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. [18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medical Research Council Biostatistics Unit. Institute of Public Health, Cambridge. [19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press. [20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psychological Review, 85, 172–191. [21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press. [22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review, 8, 1–17. [23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. [24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of Mathematical Psychology, 53, 362–375. 9
5 0.67934728 25 nips-2009-Adaptive Design Optimization in Experiments with People
Author: Daniel Cavagnaro, Jay Myung, Mark A. Pitt
Abstract: In cognitive science, empirical data collected from participants are the arbiters in model selection. Model discrimination thus depends on designing maximally informative experiments. It has been shown that adaptive design optimization (ADO) allows one to discriminate models as efficiently as possible in simulation experiments. In this paper we use ADO in a series of experiments with people to discriminate the Power, Exponential, and Hyperbolic models of memory retention, which has been a long-standing problem in cognitive science, providing an ideal setting in which to test the application of ADO for addressing questions about human cognition. Using an optimality criterion based on mutual information, ADO is able to find designs that are maximally likely to increase our certainty about the true model upon observation of the experiment outcomes. Results demonstrate the usefulness of ADO and also reveal some challenges in its implementation. 1
6 0.67889941 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory
7 0.67184496 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
8 0.65024602 112 nips-2009-Human Rademacher Complexity
9 0.64628547 196 nips-2009-Quantification and the language of thought
10 0.64582843 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
11 0.63579214 154 nips-2009-Modeling the spacing effect in sequential category learning
13 0.54309052 133 nips-2009-Learning models of object structure
14 0.53277582 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
15 0.53194785 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities
16 0.49749324 39 nips-2009-Bayesian Belief Polarization
17 0.48911712 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
18 0.4055458 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
19 0.36931869 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
20 0.36666581 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
topicId topicWeight
[(24, 0.019), (25, 0.063), (35, 0.028), (36, 0.052), (39, 0.558), (58, 0.057), (61, 0.013), (71, 0.053), (86, 0.058), (91, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97009486 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization
Author: Adam Sanborn, Nick Chater, Katherine A. Heller
Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1
2 0.91510135 54 nips-2009-Compositionality of optimal control laws
Author: Emanuel Todorov
Abstract: We present a theory of compositionality in stochastic optimal control, showing how task-optimal controllers can be constructed from certain primitives. The primitives are themselves feedback controllers pursuing their own agendas. They are mixed in proportion to how much progress they are making towards their agendas and how compatible their agendas are with the present task. The resulting composite control law is provably optimal when the problem belongs to a certain class. This class is rather general and yet has a number of unique properties – one of which is that the Bellman equation can be made linear even for non-linear or discrete dynamics. This gives rise to the compositionality developed here. In the special case of linear dynamics and Gaussian noise our framework yields analytical solutions (i.e. non-linear mixtures of LQG controllers) without requiring the final cost to be quadratic. More generally, a natural set of control primitives can be constructed by applying SVD to Green’s function of the Bellman equation. We illustrate the theory in the context of human arm movements. The ideas of optimality and compositionality are both very prominent in the field of motor control, yet they have been difficult to reconcile. Our work makes this possible.
3 0.89068854 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han
Abstract: Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives1 . 1
4 0.85913384 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis
Author: Gunhee Kim, Antonio Torralba
Abstract: This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. a small number of highly ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200K images. 1
5 0.85526121 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions
Author: Bangpeng Yao, Dirk Walther, Diane Beck, Li Fei-fei
Abstract: The human brain can be described as containing a number of functional regions. These regions, as well as the connections between them, play a key role in information processing in the brain. However, most existing multi-voxel pattern analysis approaches either treat multiple regions as one large uniform region or several independent regions, ignoring the connections between them. In this paper we propose to model such connections in an Hidden Conditional Random Field (HCRF) framework, where the classiďŹ er of one region of interest (ROI) makes predictions based on not only its voxels but also the predictions from ROIs that it connects to. Furthermore, we propose a structural learning method in the HCRF framework to automatically uncover the connections between ROIs. We illustrate this approach with fMRI data acquired while human subjects viewed images of different natural scene categories and show that our model can improve the top-level (the classiďŹ er combining information from all ROIs) and ROI-level prediction accuracy, as well as uncover some meaningful connections between ROIs. 1
6 0.61852932 21 nips-2009-Abstraction and Relational learning
7 0.58831167 148 nips-2009-Matrix Completion from Power-Law Distributed Samples
8 0.56521577 154 nips-2009-Modeling the spacing effect in sequential category learning
10 0.5635289 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
11 0.54942656 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization
12 0.54788858 115 nips-2009-Individuation, Identification and Object Discovery
13 0.54017401 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data
14 0.5384922 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering
15 0.53796643 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference
16 0.53304476 133 nips-2009-Learning models of object structure
17 0.51659435 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning
18 0.51209646 112 nips-2009-Human Rademacher Complexity
19 0.50007206 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis
20 0.49986896 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs