nips nips2009 nips2009-155 knowledge-graph by maker-knowledge-mining

155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization


Source: pdf

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of learning probabilistic models for complex relational structures between various types of objects. [sent-6, score-0.147]

2 A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. [sent-7, score-0.436]

3 Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. [sent-8, score-0.147]

4 We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. [sent-9, score-0.15]

5 The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data. [sent-11, score-0.451]

6 1 Introduction Learning with relational data, or sets of propositions of the form (object, relation, object), has been important in a number of areas of AI and statistical data analysis. [sent-12, score-0.198]

7 AI researchers have proposed that by storing enough everyday relational facts and generalizing appropriately to unobserved propositions, we might capture the essence of human common sense. [sent-13, score-0.147]

8 Modelling relational data is also important for more immediate applications, including problems arising in social networks [2], bioinformatics [16], and collaborative filtering [18]. [sent-15, score-0.147]

9 Such a model defines a joint distribution over the binary variables T (a, r, b) ∈ {0, 1}, where a and b are objects, r is a relation, and the variable T (a, r, b) determines whether the relation (a, r, b) is true. [sent-17, score-0.083]

10 Given a set of true relations S = {(a, r, b)}, the model predicts that a new relation (a, r, b) is true with probability P (T (a, r, b) = 1|S). [sent-18, score-0.233]

11 In addition to making predictions on new relations, we also want to understand the data—that is, to find a small set of interpretable laws that explains a large fraction of the observations. [sent-19, score-0.147]

12 By introducing hidden variables over simple hypotheses, the posterior distribution over the hidden variables will concentrate on the laws the data is likely to obey, while the nature of the laws depends on the model. [sent-20, score-0.136]

13 For example, the Infinite Relational Model (IRM) [8] represents simple laws consisting of partitions of objects and partitions of relations. [sent-21, score-0.324]

14 To decide whether the relation (a, r, b) is valid, the IRM simply checks that the clusters to which a, r, and b belong are compatible. [sent-22, score-0.24]

15 The main advantage of the IRM is its ability to extract meaningful partitions of objects and relations from the observational data, 1 which greatly facilitates exploratory data analysis. [sent-23, score-0.321]

16 More elaborate proposals consider models over more powerful laws (e. [sent-24, score-0.068]

17 Models based on matrix or tensor factorization [18, 19, 3] have the potential of making better predictions than interpretable models of similar complexity, as we demonstrate in our experimental results section. [sent-27, score-0.386]

18 Factorization models learn a distributed representation for each object and each relation, and make predictions by taking appropriate inner products. [sent-28, score-0.113]

19 Their strength lies in the relative ease of their continuous (rather than discrete) optimization, and in their excellent predictive performance. [sent-29, score-0.068]

20 The tension between interpretability and predictive power is unfortunate: it is clearly better to have a model that has both strong predictive power and interpretability. [sent-31, score-0.136]

21 We address this problem by introducing the Bayesian Clustered Tensor Factorization (BCTF) model, which combines good interpretability with excellent predictive power. [sent-32, score-0.068]

22 Specifically, similarly to the IRM, the BCTF model learns a partition of the objects and a partition of the relations, so that the truth-value of a relation (a, r, b) depends primarily on the compatibility of the clusters to which a, r, and b belong. [sent-33, score-0.326]

23 At the same time, every entity has a distributed representation: each object a is assigned the two vectors aL , aR (one for a being a left argument in a relation and one for it being a right argument), and a relation r is assigned the matrix R. [sent-34, score-0.315]

24 Given the distributed representations, the truth of a relation (a, r, b) is determined by the value of a⊤ RbR , while the object partition encourages the objects L within a cluster to have similar distributed representations (and similarly for relations). [sent-35, score-0.473]

25 The experiments show that the BCTF model achieves better predictive performance than a number of related probabilistic relational models, including the IRM, on several datasets. [sent-36, score-0.215]

26 We also examine the structure found in BCTF’s clusters and learned vectors. [sent-38, score-0.189]

27 Finally, our results provide an example where the performance of a Bayesian model substantially outperforms a corresponding MAP estimate for large sparse datasets with minimal manual hyperparameter selection. [sent-39, score-0.111]

28 2 The Bayesian Clustered Tensor Factorization (BCTF) We begin with a simple tensor factorization model. [sent-40, score-0.307]

29 Suppose that we have a fixed finite set of objects O and a fixed finite set of relations R. [sent-41, score-0.236]

30 For each object a ∈ O the model maintains two vectors aL , aR ∈ Rd (the left and the right arguments of the relation), and for each relation r ∈ R it maintains a matrix R ∈ Rd×d , where d is the dimensionality of the model. [sent-42, score-0.176]

31 Given a setting of these parameters (collectively denoted by θ), the model independently chooses the truth-value of each relation (a, r, b) from the distribution P (T (a, r, b) = 1|θ) = 1/(1 + exp(−a⊤ RbR )). [sent-43, score-0.083]

32 In L particular, given a set of known relations S, we can learn the parameters by maximizing a penalized log likelihood log P (S|θ) − Reg(θ). [sent-44, score-0.15]

33 Specifically, the model defines a prior distribution over partitions of objects and partitions of relations using the Chinese Restaurant Process. [sent-47, score-0.406]

34 Once the partitions are chosen, each cluster C samples its own prior mean and prior diagonal covariance, which are then used to independently sample vectors {aL , aR : a ∈ C} that belong to cluster C (and similarly for the relations, where we treat R as a d2 -dimensional vector). [sent-48, score-0.305]

35 As a result, objects within a cluster have similar distributed representations. [sent-49, score-0.234]

36 When the clusters are sufficiently tight, the value of a⊤ RbR is mainly determined by the clusters to which a, r, and b L belong. [sent-50, score-0.314]

37 At the same time, the distributed representations help generalization, because they can represent graded similarities between clusters and fine differences between objects in the same cluster. [sent-51, score-0.342]

38 Thus, given a set of relations, we expect the model to find both meaningful clusters of objects and relations, as well as predictive distributed representations. [sent-52, score-0.367]

39 First, given the parameters θ, the chain updates c = (crel , cobj ) using a collapsed Gibbs sampling sweep and a step of the split-and-merge algorithm (where the launch state was obtained with two sweeps of Gibbs sampling starting from a uniformly random cluster assignment) [5]. [sent-79, score-0.294]

40 Next, it samples from the posterior mean 3 and covariance of each cluster, which is the distribution proportional to the term being integrated in Eq. [sent-80, score-0.089]

41 Next, the Markov chain samples the parameters {aL } given {aR }, {R}, and the cluster posterior means and covariances. [sent-82, score-0.147]

42 This step is tractable since the conditional distribution over the object vectors {aL } is Gaussian and factorizes into the product of conditional distributions over the individual object vectors. [sent-83, score-0.15]

43 This conditional independence is important, since it tends to make the Markov chain mix faster, and is a direct consequence of each object a having two vectors, aL and aR . [sent-84, score-0.112]

44 The speedup could be substantial, particularly when the number of objects is large. [sent-87, score-0.086]

45 The disadvantage of using two vectors for each object is that the model cannot as easily capture the “position-independent” properties of the object, especially in the sparse regime. [sent-88, score-0.093]

46 3 Evaluation In this section, we show that the BCTF model has excellent predictive power and that it finds interpretable clusters by applying it to five datasets and comparing its performance to the IRM [8] and the Multiple Relational Clustering (MRC) model [9]. [sent-95, score-0.368]

47 We also compare BCTF to its simpler counterpart: a Bayesian Tensor Factorization (BTF) model, where all the objects and the relations belong to a single cluster. [sent-96, score-0.236]

48 The Bayesian Tensor Factorization model is a generalization of the Bayesian probabilistic matrix factorization [17], and is closely related to many other existing tensor-factorization methods [3, 14, 1]. [sent-97, score-0.123]

49 In what follows, we will describe the datasets, report the predictive performance of our and of the competing algorithms, and examine the structure discovered by BCTF. [sent-98, score-0.068]

50 1 Description of the Datasets We use three of the four datasets used by [8] and [9], namely, the Animals, the UML, and the Kinship dataset, as well the Movielens [15] and the Conceptnet datasets [10]. [sent-100, score-0.128]

51 The animals dataset consists of 50 animals and 85 binary attributes. [sent-102, score-0.27]

52 The dataset is a fully observed matrix—so there is only one relation. [sent-103, score-0.06]

53 The kinship dataset consists of kinship relationships among the members of the Alyawarra tribe [4]. [sent-105, score-0.426]

54 This dataset is dense and has 104·26·104 = 218216 observations, most of which are 0. [sent-107, score-0.06]

55 The UML dataset [11] consists of a 135 medical terms and 49 relations. [sent-109, score-0.092]

56 The dataset is also fully observed and has 135·49·135 = 893025 (mostly 0) observations. [sent-110, score-0.06]

57 The Movielens [15] dataset consists of 1000209 observed integer ratings of 6041 movies on a scale from 1 to 5, which are rated by 3953 users. [sent-112, score-0.109]

58 The Conceptnet dataset [10] is a collection of common-sense assertions collected from the web. [sent-116, score-0.102]

59 First, we normalized each dataset so the mean of its observations was 0. [sent-126, score-0.06]

60 Next, we created 10 random train/test 4 algorithm MAP20 MAP40 BTF20 BCTF20 BTF40 BCTF40 IRM [8] MRC [9] animals RMSE AUC 0. [sent-127, score-0.105]

61 We report the performance of the following algorithms: the MAP-based Tensor Factorization, the Bayesian Tensor Factorization (BTF) with MCMC (where all objects belong to a single cluster), the full Bayesian Clustered Tensor Factorization (BCTF), the IRM [8] and the MRC [9]. [sent-191, score-0.086]

62 Right: The covariance of the distributed representations of the animals (bottom) and their attributes (top). [sent-195, score-0.327]

63 The IRM partitions the data into blocks; we compute the smoothed mean of the observed entries of each block and use it to predict the test entries in the same block. [sent-200, score-0.085]

64 3 Results We first applied BCTF to the Animals, Kinship, and the UML datasets using 20 and 40-dimensional vectors. [sent-202, score-0.064]

65 In fact, for the Kinship and the UML datasets, the simple tensor factorization model trained by MAP performs as well as BTF and BCTF. [sent-204, score-0.307]

66 This happens because for these datasets the number of observations is much larger than the number of parameters, so there is little uncertainty about the true parameter values. [sent-205, score-0.064]

67 However, the Animals dataset is considerably smaller, so BTF performs better, and BCTF performs even better than the BTF model. [sent-206, score-0.06]

68 This is important because both datasets are sparse, which makes overfitting difficult to combat. [sent-209, score-0.064]

69 Left: The covariance of the distributed representations {aL } learned for each person. [sent-215, score-0.22]

70 Right: The biclustering of a subset of the relations. [sent-216, score-0.084]

71 Left: The covariance of the distributed representations {aL } learned for each object. [sent-251, score-0.22]

72 Right: The inferred clusters, along with the biclustering of a subset of the relations. [sent-252, score-0.084]

73 Figure 2 shows some of the clusters learned by the model on the Animals dataset. [sent-256, score-0.189]

74 It also shows the biclustering, as well as the covariance of the distributed representations of the animals and their attributes, sorted by their clusters. [sent-257, score-0.293]

75 By inspecting the covariance, we can determine the clusters that are tight and the affinities between the clusters. [sent-258, score-0.199]

76 Indeed, the cluster structure is reflected in the block-diagonal structure of the covariance matrix. [sent-259, score-0.181]

77 For example, the covariance of the attributes (see Fig. [sent-260, score-0.123]

78 2, top-right panel) shows that cluster F1, containing {flippers, stainteeth,swims} is similar to cluster F4, containing {bulbous, slow, inactive}, but is very dissimilar to F2, containing {hooves, vegetation, grazer}. [sent-261, score-0.22]

79 The kinship dataset has 104 people with complex relationships between them: each person belongs to one of four sections, which strongly constrains the other relations. [sent-263, score-0.243]

80 After learning, each cluster was almost completely localized in gender, section, and age. [sent-265, score-0.092]

81 For clarity of presentation, we sort the clusters first by their section, then by their gender, and finally by their age, as done in [8]. [sent-266, score-0.157]

82 Figure 3 (panels (b-g)) displays some of the relations according to this clustering, and panel (a) shows the covariance between the vectors {aL } learned for each person. [sent-267, score-0.307]

83 The four sections are clearly visible in the covariance structure of the distributed representations. [sent-268, score-0.145]

84 Figure 4 shows the inferred clusters for the medical UML dataset. [sent-269, score-0.189]

85 Left: The covariance of the learned {aL } vectors for each object. [sent-337, score-0.157]

86 Qualitatively, the clustering appears to be on par with that of the IRM on all the datasets, but the BCTF model is able to predict held-out relations much better. [sent-340, score-0.15]

87 Figures 5 and 6 display the learned clusters for the Movielens and the Conceptnet datasets. [sent-341, score-0.189]

88 For the Movielens dataset, we show the most frequently-rated movies in each cluster where the clusters are sorted by size. [sent-342, score-0.298]

89 We also show the covariance between the movie vectors which are sorted by the clusters, where we display only the 100 most frequently-rated movies per cluster. [sent-343, score-0.174]

90 The covariance matrix is aligned with the table on the right, making it easy to see how the clusters relate to each other. [sent-344, score-0.246]

91 For example, according to the covariance structure, clusters 7 and 9, containing Hollywood action/adventure movies are similar to each other but are dissimilar to cluster 8, which consists of comedy/horror movies. [sent-345, score-0.423]

92 6 displays the 100 most frequent objects per category. [sent-347, score-0.086]

93 From the covariance matrix, we can infer that clusters 8, 9, and 11, containing concepts associated with humans taking actions, are very similar to each other, and are very dissimilar to cluster 10, which contains animals. [sent-348, score-0.374]

94 , clusters 2-6) are not crisp, which is reflected in the smaller covariances between vectors in each of these clusters. [sent-351, score-0.193]

95 4 Discussions and Conclusions We introduced a new method for modelling relational data which is able to both discover meaningful structure and generalize well. [sent-352, score-0.195]

96 In particular, our results illustrate the predictive power of distributed representations when applied to modelling relational data, since even simple tensor factorization models can sometimes outperform the more complex models. [sent-353, score-0.669]

97 Indeed, for the kinship and the UML datasets, the performance of the MAP-based tensor factorization was as good as the performance of the BCTF model, which is due to the density of these datasets: the number of observations was much larger than the number of parameters. [sent-354, score-0.49]

98 Learning systems of concepts with an infinite relational model. [sent-416, score-0.147]

99 Multi-HDP: A non parametric bayesian model for tensor factorization. [sent-445, score-0.233]

100 Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. [sent-474, score-0.178]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bctf', 0.377), ('conceptnet', 0.251), ('irm', 0.22), ('movielens', 0.209), ('tensor', 0.184), ('kinship', 0.183), ('crel', 0.168), ('uml', 0.168), ('clusters', 0.157), ('relations', 0.15), ('relational', 0.147), ('btf', 0.147), ('cobj', 0.147), ('al', 0.142), ('ar', 0.134), ('rel', 0.128), ('factorization', 0.123), ('animals', 0.105), ('obs', 0.105), ('rbr', 0.105), ('cluster', 0.092), ('covariance', 0.089), ('rmse', 0.089), ('objects', 0.086), ('partitions', 0.085), ('cup', 0.084), ('biclustering', 0.084), ('relation', 0.083), ('interpretable', 0.079), ('dysfunction', 0.073), ('mrc', 0.073), ('obj', 0.073), ('auc', 0.07), ('park', 0.068), ('glass', 0.068), ('laws', 0.068), ('predictive', 0.068), ('dp', 0.067), ('datasets', 0.064), ('whale', 0.063), ('dataset', 0.06), ('object', 0.057), ('distributed', 0.056), ('chain', 0.055), ('abnormality', 0.055), ('wars', 0.055), ('clustered', 0.053), ('cell', 0.052), ('propositions', 0.051), ('movies', 0.049), ('bayesian', 0.049), ('modelling', 0.048), ('water', 0.048), ('hyperparameter', 0.047), ('music', 0.045), ('protein', 0.043), ('representations', 0.043), ('day', 0.042), ('assertions', 0.042), ('beaver', 0.042), ('biologic', 0.042), ('bulbous', 0.042), ('coffee', 0.042), ('grazer', 0.042), ('hooves', 0.042), ('inspecting', 0.042), ('ippers', 0.042), ('juice', 0.042), ('jurassic', 0.042), ('men', 0.042), ('terminator', 0.042), ('vegetation', 0.042), ('star', 0.041), ('crp', 0.041), ('episode', 0.039), ('disease', 0.038), ('markov', 0.037), ('sh', 0.037), ('sport', 0.037), ('club', 0.037), ('thirtieth', 0.037), ('substance', 0.037), ('rocky', 0.037), ('buy', 0.037), ('alyawarra', 0.037), ('vectors', 0.036), ('dissimilar', 0.036), ('aaai', 0.035), ('map', 0.035), ('restaurant', 0.035), ('salakhutdinov', 0.035), ('city', 0.034), ('attributes', 0.034), ('kemp', 0.034), ('peptide', 0.034), ('ilya', 0.034), ('drinking', 0.034), ('medical', 0.032), ('learned', 0.032), ('mental', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

2 0.22811437 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

Author: Kurt Miller, Michael I. Jordan, Thomas L. Griffiths

Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets. 1

3 0.087102614 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

Author: Francois Caron, Arnaud Doucet

Abstract: Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. These models are attractive because they ensure exchangeability (over samples). We propose here extensions of these models where the dependency between samples is given by a known decomposable graph. These models have appealing properties and can be easily learned using Monte Carlo techniques. 1 Motivation The CRP and IBP have found numerous applications in machine learning over recent years [5, 10]. We consider here the case where the data we are interested in are ‘locally’ dependent; these dependencies being represented by a known graph G where each data point/object is associated to a vertex. These local dependencies can correspond to any conceptual or real (e.g. space, time) metric. For example, in the context of clustering, we might want to propose a prior distribution on partitions enforcing that data which are ‘close’ in the graph are more likely to be in the same cluster. Similarly, in the context of latent feature models, we might be interested in a prior distribution on features enforcing that data which are ‘close’ in the graph are more likely to possess similar features. The ‘standard’ CRP and IBP correspond to the case where the graph G is complete; that is it is fully connected. In this paper, we generalize the CRP and IBP to decomposable graphs. The resulting generalized versions of the CRP and IBP enjoy attractive properties. Each clique of the graph follows marginally a CRP or an IBP process and explicit expressions for the joint prior distribution on the graph is available. It makes it easy to learn those models using straightforward generalizations of Markov chain Monte Carlo (MCMC) or Sequential Monte Carlo (SMC) algorithms proposed to perform inference for the CRP and IBP [5, 10, 14]. The rest of the paper is organized as follows. In Section 2, we review the popular Dirichlet multinomial allocation model and the Dirichlet Process (DP) partition distribution. We propose an extension of these two models to decomposable graphical models. In Section 3 we discuss nonparametric latent feature models, reviewing briefly the construction in [5] and extending it to decomposable graphs. We demonstrate these models in Section 4 on two applications: an alternative to the hierarchical DP model [12] and a time-varying matrix factorization problem. 2 Prior distributions for partitions on decomposable graphs Assume we have n observations. When performing clustering, we associate to each of this observation an allocation variable zi ∈ [K] = {1, . . . , K}. Let Πn be the partition of [n] = {1, . . . , n} defined by the equivalence relation i ↔ j ⇔ zi = zj . The resulting partition Πn = {A1 , . . . , An(Πn ) } 1 is an unordered collection of disjoint non-empty subsets Aj of [n], j = 1, . . . , n(Πn ), where ∪j Aj = [n] and n(Πn ) is the number of subsets for partition Πn . We also denote by Pn be the set of all partitions of [n] and let nj , j = 1, . . . , n(Πn ), be the size of the subset Aj . Each allocation variable zi is associated to a vertex/site of an undirected graph G, which is assumed to be known. In the standard case where the graph G is complete, we first review briefly here two popular prior distributions on z1:n , equivalently on Πn . We then extend these models to undirected decomposable graphs; see [2, 8] for an introduction to decomposable graphs. Finally we briefly discuss the directed case. Note that the models proposed here are completely different from the hyper multinomial-Dirichlet in [2] and its recent DP extension [6]. 2.1 Dirichlet multinomial allocation model and DP partition distribution Assume for the time being that K is finite. When the graph is complete, a popular choice for the allocation variables is to consider a Dirichlet multinomial allocation model [11] θ θ , . . . , ), zi |π ∼ π (1) K K where D is the standard Dirichlet distribution and θ > 0. Integrating out π, we obtain the following Dirichlet multinomial prior distribution π ∼ D( Pr(z1:n ) = K j=1 Γ(θ) Γ(nj + θ K) (2) θ Γ(θ + n)Γ( K )K and then, using the straightforward equality Pr(Πn ) = PK where PK = {Πn ∈ Pn |n(Πn ) ≤ K}, we obtain K! (K−n(Πn ))! Pr(z1:n ) valid for for all Πn ∈ n(Π ) Pr(Πn ) = θ Γ(θ) j=1n Γ(nj + K ) K! . θ (K − n(Πn ))! Γ(θ + n)Γ( K )n(Πn ) (3) DP may be seen as a generalization of the Dirichlet multinomial model when the number of components K → ∞; see for example [10]. In this case the distribution over the partition Πn of [n] is given by [11] n(Π ) θn(Πn ) j=1n Γ(nj ) . (4) Pr(Πn ) = n i=1 (θ + i − 1) Let Π−k = {A1,−k , . . . , An(Π−k ),−k } be the partition induced by removing item k to Πn and nj,−k be the size of cluster j for j = 1, . . . , n(Π−k ). It follows from (4) that an item k is assigned to an existing cluster j, j = 1, . . . , n(Π−k ), with probability proportional to nj,−k / (n − 1 + θ) and forms a new cluster with probability θ/ (n − 1 + θ). This property is the basis of the CRP. We now extend the Dirichlet multinomial allocation and the DP partition distribution models to decomposable graphs. 2.2 Markov combination of Dirichlet multinomial and DP partition distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. It can be easily checked that if the marginal distribution of zC for each clique C ∈ C is defined by (2) then these distributions are consistent as they yield the same distribution (2) over the separators. Therefore, the unique Markov distribution over G with Dirichlet multinomial distribution over the cliques is defined by [8] Pr(zC ) S∈S Pr(zS ) C∈C Pr(z1:n ) = (5) where for each complete set B ⊆ G, we have Pr(zB ) given by (2). It follows that we have for any Πn ∈ PK Γ(θ) K! Pr(Πn ) = (K − n(Πn ))! C∈C Γ(θ) S∈S 2 K j=1 θ Γ(nj,C + K ) θ Γ(θ+nC )Γ( K )K K j=1 θ Γ(nj,S + K ) θ Γ(θ+nS )Γ( K )K (6) where for each complete set B ⊆ G, nj,B is the number of items associated to cluster j, j = 1, . . . , K in B and nB is the total number of items in B. Within each complete set B, the allocation variables define a partition distributed according to the Dirichlet-multinomial distribution. We now extend this approach to DP partition distributions; that is we derive a joint distribution over Πn such that the distribution of ΠB over each complete set B of the graph is given by (4) with θ > 0. Such a distribution satisfies the consistency condition over the separators as the restriction of any partition distributed according to (4) still follows (4) [7]. G Proposition. Let Pn be the set of partitions Πn ∈ Pn such that for each decomposition A, B, and any (i, j) ∈ A × B, i ↔ j ⇒ ∃k ∈ A ∩ B such that k ↔ i ↔ j. As K → ∞, the prior distribution G over partitions (6) is given for each Πn ∈ Pn by Pr(Πn ) = θn(Πn ) n(ΠC ) Γ(nj,C ) j=1 nC i=1 (θ+i−1) n(ΠS ) Γ(nj,S ) j=1 nS (θ+i−1) i=1 C∈C S∈S (7) where n(ΠB ) is the number of clusters in the complete set B. Proof. From (6), we have θ n(ΠC ) K(K − 1) . . . (K − n(Πn ) + 1) Pr(Πn ) = K C∈C n(ΠC )− S∈S n(ΠS ) C∈C θ n(ΠS ) S∈S n(ΠC ) θ Γ(nj,C + K ) j=1 nC (θ+i−1) i=1 n(ΠS ) θ Γ(nj,S + K ) j=1 nS (θ+i−1) i=1 Thus when K → ∞, we obtain (7) if n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) and 0 otherwise. We have n(Πn ) ≤ C∈C n(ΠC ) − S∈S n(ΠS ) for any Πn ∈ Pn and the subset of Pn verifying G n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) corresponds to the set Pn . Example. Let the notation i ∼ j (resp. i j) indicates an edge (resp. no edge) between two sites. Let n = 3 and G be the decomposable graph defined by the relations 1 ∼ 2, 2 ∼ 3 and 1 3. G The set P3 is then equal to {{{1, 2, 3}}; {{1, 2}, {3}}; {{1}, {2, 3}}; {{1}, {2}, {3}}}. Note that G the partition {{1, 3}, {2}} does not belong to P3 . Indeed, as there is no edge between 1 and 3, they cannot be in the same cluster if 2 is in another cluster. The cliques are C1 = {1, 2} and C2 = {2, 3} Pr(ΠC1 ) Pr(ΠC2 ) hence we can and the separator is S2 = {2}. The distribution is given by Pr(Π3 ) = Pr(ΠS ) 2 check that we obtain Pr({1, 2, 3}) = (θ + 1)−2 , Pr({1, 2}, {3}) = Pr({1, 2}, {3}) = θ(θ + 1)−2 and Pr({1}, {2}, {3}) = θ2 (θ + 1)−2 . Let now define the full conditional distributions. Based on (7) the conditional assignment of an item k is proportional to the conditional over the cliques divided by the conditional over the separators. G Let denote G−k the undirected graph obtained by removing vertex k from G. Suppose that Πn ∈ Pn . G−k If Π−k ∈ Pn−1 , then do not change the value of item k. Otherwise, item k is assigned to cluster j / where j = 1, . . . , n(Π−k ) with probability proportional to {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (8) and to a new cluster with probability proportional to θ, where n−k,j,C is the number of items in the set C \ {k} belonging to cluster j. The updating process is illustrated by the Chinese wedding party process1 in Fig. 1. The results of this section can be extended to the Pitman-Yor process, and more generally to species sampling models. Example (continuing). Given Π−2 = {A1 = {1}, A2 = {3}}, we have −1 Pr( item 2 assigned to A1 = {1}| Π−2 ) = Pr( item 2 assigned to A2 = {3}| Π−2 ) = (θ + 2) −1 and Pr( item 2 assigned to new cluster A3 | Π−2 ) = θ (θ + 2) . Given Π−2 = {A1 = {1, 3}}, item 2 is assigned to A1 with probability 1. 1 Note that this representation describes the full conditionals while the CRP represents the sequential updat- ing. 3 (a) (b) (d) (c) (e) Figure 1: Chinese wedding party. Consider a group of n guests attending a wedding party. Each of the n guests may belong to one or several cliques, i.e. maximal groups of people such that everybody knows everybody. The belonging of each guest to the different cliques is represented by color patches on the figures, and the graphical representation of the relationship between the guests is represented by the graphical model (e). (a) Suppose that the guests are already seated such that two guests cannot be together at the same table is they are not part of the same clique, or if there does not exist a group of other guests such that they are related (“Any friend of yours is a friend of mine”). (b) The guest number k leaves his table and either (c) joins a table where there are guests from the same clique as him, with probability proportional to the product of the number of guests from each clique over the product of the number of guests belonging to several cliques on that table or (d) he joins a new table with probability proportional to θ. 2.3 Monte Carlo inference 2.3.1 MCMC algorithm Using the full conditionals, a single site Gibbs sampler can easily be designed to approximate the posterior distribution Pr(Πn |z1:n ). Given a partition Πn , an item k is taken out of the partition. If G−k Π−k ∈ Pn−1 , item k keeps the same value. Otherwise, the item will be assigned to a cluster j, / j = 1, . . . , n(Π−k ), with probability proportional to p(z{k}∪Aj,−k ) × p(zAj,−k ) {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (9) and the item will be assigned to a new cluster with probability proportional to p(z{k} ) × θ. Similarly to [3], we can also define a procedure to sample from p(θ|n(Πn ) = k)). We assume that θ ∼ G(a, b) and use p auxiliary variables x1 , . . . , xp . The procedure is as follows. • For j = 1, . . . , p, sample xj |k, θ ∼ Beta(θ + nSj , nCj − nSj ) • Sample θ|k, x1:p ∼ G(a + k, b − j log xj ) 2.3.2 Sequential Monte Carlo We have so far only treated the case of an undirected decomposable graph G. We can formulate a sequential updating rule for the corresponding perfect directed version D of G. Indeed, let (a1 , . . . a|V | ) be a perfect ordering and pa(ak ) be the set of parents of ak which is by definition complete. Let Πk−1 = {A1,k−1 , . . . , An(Πk−1 ),k−1 } denote the partition of the first k−1 vertices a1:k−1 and let nj,pa(ak ) be the number of elements with value j in the set pa(ak ), j = 1, . . . , n(Πk−1 ). Then the vertex ak joins the set j with probability nj,pa(ak ) / θ + cluster with probability θ/ θ + q q nq,pa(ak ) and creates a new nq,pa(ak ) . One can then design a particle filter/SMC method in a similar fashion as [4]. Consider a set of (i) (i) (i) (i) N N particles Πk−1 with weights wk−1 ∝ Pr(Πk−1 , z1:k−1 ) ( i=1 wk−1 = 1) that approximate (i) the posterior distribution Pr(Πk−1 |z1:k−1 ). For each particle i, there are n(Πk−1 ) + 1 possible 4 (i,j) allocations for component ak . We denote Πk the partition obtained by associating component ak (i,j) to cluster j. The weight associated to Πk is given by  nj,pa(ak ) (i)  if j = 1, . . . , n(Πk−1 ) θ+ q nq,pa(ak ) (i,j) (i) p(z{ak }∪Aj,k−1 ) wk−1 = wk−1 × (10) (i) θ  θ+ n p(zAj,k−1 ) if j = n(Πk−1 ) + 1 q q,pa(ak ) (i,j) Then we can perform a deterministic resampling step by keeping the N particles Πk with highest (i,j) (i) (i) weights wk−1 . Let Πk be the resampled particles and wk the associated normalized weights. 3 Prior distributions for infinite binary matrices on decomposable graphs Assume we have n objects; each of these objects being associated to the vertex of a graph G. To K each object is associated a K-dimensional binary vector zn = (zn,1 , . . . , zn,K ) ∈ {0, 1} where zn,i = 1 if object n possesses feature i and zn,i = 0 otherwise. These vectors zt form a binary n × K matrix denoted Z1:n . We denote by ξ1:n the associated equivalence class of left-ordered matrices and let EK be the set of left-ordered matrices with at most K features. In the standard case where the graph G is complete, we review briefly here two popular prior distributions on Z1:n , equivalently on ξ1:n : the Beta-Bernoulli model and the IBP [5]. We then extend these models to undirected decomposable graphs. This can be used for example to define a time-varying IBP as illustrated in Section 4. 3.1 Beta-Bernoulli and IBP distributions The Beta-Bernoulli distribution over the allocation Z1:n is K Pr(Z1:n ) = α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) α K Γ(nj j=1 (11) where nj is the number of objects having feature j. It follows that Pr(ξ1:n ) = K K! 2n −1 h=0 α K Γ(nj α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) Kh ! j=1 (12) where Kh is the number of features possessing the history h (see [5] for details). The nonparametric model is obtained by taking the limit when K → ∞ Pr(ξ1:n ) = αK K+ + 2n −1 h=1 Kh ! exp(−αHn ) where K + is the total number of features and Hn = 3.2 (n − nj )!(nj − 1)! n! j=1 n 1 k=1 k . (13) The IBP follows from (13). Markov combination of Beta-Bernoulli and IBP distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. As in the Dirichlet-multinomial case, it is easily seen that if for each clique C ∈ C, the marginal distribution is defined by (11), then these distributions are consistent as they yield the same distribution (11) over the separators. Therefore, the unique Markov distribution over G with Beta-Bernoulli distribution over the cliques is defined by [8] Pr(ZC ) S∈S Pr(ZS ) C∈C Pr(Z1:n ) = (14) where Pr(ZB ) given by (11) for each complete set B ⊆ G. The prior over ξ1:n is thus given, for ξ1:n ∈ EK , by Pr(ξ1:n ) = K! 2n −1 h=0 Kh ! α K α Γ(nj,C + K )Γ(nC −nj,C +1) α Γ(nC +1+ K ) α α Γ(nj,S + K )Γ(nS −nj,S +1) K K α j=1 Γ(nS +1+ K ) K j=1 C∈C S∈S 5 (15) where for each complete set B ⊆ G, nj,B is the number of items having feature j, j = 1, . . . , K in the set B and nB is the whole set of objects in set B. Taking the limit when K → ∞, we obtain after a few calculations Pr(ξ1:n ) = α + K[n] exp [−α ( C HnC − 2n −1 h=1 Kh ! HnS )] × C∈C + KC (nC −nj,C )!(nj,C −1)! j=1 nC ! S∈S S + KS (nS −nj,S )!(nj,S −1)! j=1 nS ! + + + + if K[n] = C KC − S KS and 0 otherwise, where KB is the number of different features possessed by objects in B. G Let En be the subset of En such that for each decomposition A, B and any (u, v) ∈ A × B: {u and v possess feature j} ⇒ ∃k ∈ A ∩ B such that {k possesses feature j}. Let ξ−k be the left-ordered + matrix obtained by removing object k from ξn and K−k be the total number of different features in G−k + ξ−k . For each feature j = 1, . . . , K−k , if ξ−k ∈ En−1 then we have   b C∈C nj,C if i = 1 S∈C nj,S Pr(ξk,j = i) = (16)  b C∈C (nC −nj,C ) if i = 0 (nS −nj,S ) S∈C nS where b is the appropriate normalizing constant then the customer k tries Poisson α {S∈S|k∈S} nC {C∈C|k∈C} new dishes. We can easily generalize this construction to a directed version D of G using arguments similar to those presented in Section 2; see Section 4 for an application to time-varying matrix factorization. 4 4.1 Applications Sharing clusters among relative groups: An alternative to HDP Consider that we are given d groups with nj data yi,j in each group, i = 1, . . . , nj , j = 1, . . . , d. We consider latent cluster variables zi,j that define the partition of the data. We will use alternatively the notation θi,j = Uzi,j in the following. Hierarchical Dirichlet Process [12] (HDP) is a very popular model for sharing clusters among related groups. It is based on a hierarchy of DPs G0 ∼ DP (γ, H), Gj |G0 ∼ DP (α, G0 ) j = 1, . . . d θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj . Under conjugacy assumptions, G0 , Gj and U can be integrated out and we can approximate the marginal posterior of (zi,j ) given y = (yi,j ) with Gibbs sampling using the Chinese restaurant franchise to sample from the full conditional p(zi,j |z−{i,j} , y). Using the graph formulation defined in Section 2, we propose an alternative to HDP. Let θ0,1 , . . . , θ0,N be N auxiliary variables belonging to what we call group 0. We define each clique Cj (j = 1, . . . , d) to be composed of elements from group j and elements from group 0. This defines a decomposable graphical model whose separator is given by the elements of group 0. We can rewrite the model in a way quite similar to HDP G0 ∼ DP (α, H), θ0,i |G0 ∼ G0 i = 1, ..., N α α Gj |θ0,1 , . . . , θ0,N ∼ DP (α + N, α+N H + α+N θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj N i=1 δθ0,i ) j = 1, . . . d, N For any subset A and j = k ∈ {1, . . . , p} we have corr(Gj (A), Gk (A)) = α+N . Again, under conjugacy conditions, we can integrate out G0 , Gj and U and approximate the marginal posterior distribution over the partition using the Chinese wedding party process defined in Section 2. Note that for latent variables zi,j , j = 1, . . . , d, associated to data, this is the usual CRP update. As in HDP, multiple layers can be added to the model. Figures 2 (a) and (b) resp. give the graphical DP alternative to HDP and 2-layer HDP. 6 z0 root z0 root corpora docs z1 z2 z1 z2 z3 z1,1 z1,2 z2,1 z2,2 z2,3 docs (a) Graphical DP alternative to HDP (b) Graphical DP alternative to 2-layer HDP Figure 2: Hierarchical Graphs of dependency with (a) one layer and (b) two layers of hierarchy. If N = 0, then Gj ∼ DP (α, H) for all j and this is equivalent to setting γ → ∞ in HDP. If N → ∞ then Gj = G0 for all j, G0 ∼ DP (α, H). This is equivalent to setting α → ∞ in the HDP. One interesting feature of the model is that, contrary to HDP, the marginal distribution of Gj at any layer of the tree is DP (α, H). As a consequence, the total number of clusters scales logarithmically (as in the usual DP) with the size of each group, whereas it scales doubly logarithmically in HDP. Contrary to HDP, there are at most N clusters shared between different groups. Our model is in that sense reminiscent of [9] where only a limited number of clusters can be shared. Note however that contrary to [9] we have a simple CRP-like process. The proposed methodology can be straightforwardly extended to the infinite HMM [12]. The main issue of the proposed model is the setting of the number N of auxiliary parameters. Another issue is that to achieve high correlation, we need a large number of auxiliary variables. Nonetheless, the computational time used to sample from auxiliary variables is negligible compared to the time used for latent variables associated to data. Moreover, it can be easily parallelized. The model proposed offers a far richer framework and ensures that at each level of the tree, the marginal distribution of the partition is given by a DP partition model. 4.2 Time-varying matrix factorization Let X1:n be an observed matrix of dimension n × D. We want to find a representation of this matrix in terms of two latent matrices Z1:n of dimension n × K and Y of dimension K × D. Here Z1:n 2 is a binary matrix whereas Y is a matrix of latent features. By assuming that Y ∼ N 0, σY IK×D and 2 X1:n = Z1:n Y + σX εn where εn ∼ N 0, σX In×D , we obtain p(X1:n |Z1:n ) ∝ −D/2 2 2 + Z+T Z+ + σX /σY IKn 1:n 1:n + (n−Kn )D σX exp − + Kn D σY 2 2 + where Σ−1 = I − Z+ Z+T Z+ + σX /σY IKn n 1:n 1:n 1:n −1 1 T −1 2 tr X1:n Σn X1:n 2σX (17) + Z+T , Kn the number of non-zero columns of 1:n + Z1:n and Z+ is the first Kn columns of Z1:n . To avoid having to set K, [5, 14] assume that Z1:n 1:n follows an IBP. The resulting posterior distribution p(Z1:n |X1:n ) can be estimated through MCMC [5] or SMC [14]. We consider here a different model where the object Xt is assumed to arrive at time index t and we want a prior distribution on Z1:n ensuring that objects close in time are more likely to possess similar features. To achieve this, we consider the simple directed graphical model D of Fig. 3 where the site numbering corresponds to a time index in that case and a perfect numbering of D is (1, 2, . . .). The set of parents pa(t) is composed of the r preceding sites {{t − r}, . . . , {t − 1}}. The time-varying IBP to sample from p(Z1:n ) associated to this directed graph follows from (16) and proceeds as follows. At time t = 1 + new new • Sample K1 ∼Poisson(α), set z1,i = 1 for i = 1, ..., K1 and set K1 = Knew . At times t = 2, . . . , r n + new ∼Poisson( α ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( 1:t−1,k ) and Kt t t 7   ?  ? - t−r - t−r+1 - . . . - t−1 - t - t+1        6 6 Figure 3: Directed graph. At times t = r + 1, . . . , n n + α new ∼Poisson( r+1 ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( t−r:t−1,k ) and Kt r+1 + Here Kt is the total number of features appearing from time max(1, t − r) to t − 1 and nt−r:t−1,k the restriction of n1:t−1 to the r last customers. Using (17) and the prior distribution of Z1:n which can be sampled using the time-varying IBP described above, we can easily design an SMC method to sample from p(Z1:n |X1:n ). We do not detail it here. Note that contrary to [14], our algorithm does not require inverting a matrix whose dimension grows linearly with the size of the data but only a matrix of dimension r × r. In order to illustrate the model and SMC algorithm, we create 200 6 × 6 images using a ground truth Y consisting of 4 different 6 × 6 latent images. The 200 × 4 binary matrix was generated from Pr(zt,k = 1) = πt,k , where πt = ( .6 .5 0 0 ) if t = 1, . . . , 30, πt = ( .4 .8 .4 0 ) if t = 31, . . . , 50 and πt = ( 0 .3 .6 .6 ) if t = 51, . . . , 200. The order of the model is set to r = 50. The feature occurences Z1:n and true features Y and their estimates are represented in Figure 4. Two spurious features are detected by the model (features 2 and 5 on Fig. 3(c)) but quickly discarded (Fig. 4(d)). The algorithm is able to correctly estimate the varying prior occurences of the features over time. Feature1 Feature2 Feature1 Feature2 Feature3 20 20 40 40 60 60 Feature4 80 100 Feature4 Feature5 Feature6 Time Feature3 Time 80 100 120 120 140 140 160 160 180 200 180 1 2 3 200 4 Feature (a) 1 2 3 4 5 6 Feature (b) (c) (d) Figure 4: (a) True features, (b) True features occurences, (c) MAP estimate ZM AP and (d) associated E[Y|ZM AP ] t=20 t=50 t=20 t=50 t=100 t=200 t=100 t=200 (a) (b) Figure 5: (a) E[Xt |πt , Y] and (b) E[Xt |X1:t−1 ] at t = 20, 50, 100, 200. 5 Related work and Discussion The fixed-lag version of the time-varying DP of Caron et al. [1] is a special case of the proposed model when G is given by Fig. 3. The bivariate DP of Walker and Muliere [13] is also a special case when G has only two cliques. In this paper, we have assumed that the structure of the graph was known beforehand and we have shown that many flexible models arise from this framework. It would be interesting in the future to investigate the case where the graphical structure is unknown and must be estimated from the data. Acknowledgment The authors thank the reviewers for their comments that helped to improve the writing of the paper. 8 References [1] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In Uncertainty in Artificial Intelligence, 2007. [2] A.P. Dawid and S.L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21:1272–1317, 1993. [3] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. [4] P. Fearnhead. Particle filters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004. [5] T.L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2006. [6] D. Heinz. Building hyper dirichlet processes for graphical models. Electonic Journal of Statistics, 3:290–315, 2009. [7] J.F.C. Kingman. Random partitions in population genetics. Proceedings of the Royal Society of London, 361:1–20, 1978. [8] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. [9] P. M¨ ller, F. Quintana, and G. Rosner. A method for combining inference across related nonu parametric Bayesian models. Journal of the Royal Statistical Society B, 66:735–749, 2004. [10] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [11] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related fields, 102:145–158, 1995. [12] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006. [13] S. Walker and P. Muliere. A bivariate Dirichlet process. Statistics and Probability Letters, 64:1–7, 2003. [14] F. Wood and T.L. Griffiths. Particle filtering for nonparametric Bayesian matrix factorization. In Advances in Neural Information Processing Systems, 2007. 9

4 0.086945772 21 nips-2009-Abstraction and Relational learning

Author: Charles Kemp, Alan Jern

Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7  ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff  i  ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔  ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1  ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1  ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9

5 0.080411032 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

Author: Adam Sanborn, Nick Chater, Katherine A. Heller

Abstract: Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions – and have a strong preference to generalize along the axes of these dimensions, but not “diagonally”. What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. The learning behaviour of the model captures the developmental shift from roughly “isotropic” for children to the axis-aligned generalization that adults show. 1

6 0.079398483 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

7 0.077075645 226 nips-2009-Spatial Normalized Gamma Processes

8 0.074050114 201 nips-2009-Region-based Segmentation and Object Detection

9 0.067505792 96 nips-2009-Filtering Abstract Senses From Image Search Results

10 0.066731833 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

11 0.062552735 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

12 0.061183412 3 nips-2009-AUC optimization and the two-sample problem

13 0.060937628 195 nips-2009-Probabilistic Relational PCA

14 0.060336258 133 nips-2009-Learning models of object structure

15 0.057619441 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

16 0.054483354 175 nips-2009-Occlusive Components Analysis

17 0.0535767 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

18 0.05341807 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

19 0.052246526 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

20 0.051268909 43 nips-2009-Bayesian estimation of orientation preference maps


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.172), (1, -0.088), (2, -0.057), (3, -0.105), (4, 0.038), (5, -0.061), (6, 0.055), (7, -0.001), (8, -0.015), (9, -0.045), (10, 0.031), (11, -0.076), (12, 0.024), (13, -0.118), (14, -0.048), (15, -0.018), (16, 0.106), (17, 0.023), (18, -0.045), (19, 0.045), (20, -0.101), (21, 0.05), (22, -0.008), (23, 0.016), (24, -0.013), (25, 0.053), (26, -0.057), (27, -0.071), (28, 0.062), (29, -0.008), (30, 0.112), (31, -0.079), (32, 0.116), (33, -0.05), (34, 0.002), (35, -0.092), (36, -0.145), (37, 0.14), (38, 0.053), (39, -0.088), (40, -0.057), (41, 0.034), (42, -0.043), (43, -0.114), (44, -0.042), (45, 0.024), (46, -0.028), (47, -0.113), (48, -0.018), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92091137 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

2 0.74684244 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

Author: Kurt Miller, Michael I. Jordan, Thomas L. Griffiths

Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets. 1

3 0.73600763 195 nips-2009-Probabilistic Relational PCA

Author: Wu-jun Li, Dit-Yan Yeung, Zhihua Zhang

Abstract: One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on realworld data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1

4 0.60630965 21 nips-2009-Abstraction and Relational learning

Author: Charles Kemp, Alan Jern

Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7  ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff  i  ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔  ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1  ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1  ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9

5 0.52787012 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

Author: Francois Caron, Arnaud Doucet

Abstract: Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. These models are attractive because they ensure exchangeability (over samples). We propose here extensions of these models where the dependency between samples is given by a known decomposable graph. These models have appealing properties and can be easily learned using Monte Carlo techniques. 1 Motivation The CRP and IBP have found numerous applications in machine learning over recent years [5, 10]. We consider here the case where the data we are interested in are ‘locally’ dependent; these dependencies being represented by a known graph G where each data point/object is associated to a vertex. These local dependencies can correspond to any conceptual or real (e.g. space, time) metric. For example, in the context of clustering, we might want to propose a prior distribution on partitions enforcing that data which are ‘close’ in the graph are more likely to be in the same cluster. Similarly, in the context of latent feature models, we might be interested in a prior distribution on features enforcing that data which are ‘close’ in the graph are more likely to possess similar features. The ‘standard’ CRP and IBP correspond to the case where the graph G is complete; that is it is fully connected. In this paper, we generalize the CRP and IBP to decomposable graphs. The resulting generalized versions of the CRP and IBP enjoy attractive properties. Each clique of the graph follows marginally a CRP or an IBP process and explicit expressions for the joint prior distribution on the graph is available. It makes it easy to learn those models using straightforward generalizations of Markov chain Monte Carlo (MCMC) or Sequential Monte Carlo (SMC) algorithms proposed to perform inference for the CRP and IBP [5, 10, 14]. The rest of the paper is organized as follows. In Section 2, we review the popular Dirichlet multinomial allocation model and the Dirichlet Process (DP) partition distribution. We propose an extension of these two models to decomposable graphical models. In Section 3 we discuss nonparametric latent feature models, reviewing briefly the construction in [5] and extending it to decomposable graphs. We demonstrate these models in Section 4 on two applications: an alternative to the hierarchical DP model [12] and a time-varying matrix factorization problem. 2 Prior distributions for partitions on decomposable graphs Assume we have n observations. When performing clustering, we associate to each of this observation an allocation variable zi ∈ [K] = {1, . . . , K}. Let Πn be the partition of [n] = {1, . . . , n} defined by the equivalence relation i ↔ j ⇔ zi = zj . The resulting partition Πn = {A1 , . . . , An(Πn ) } 1 is an unordered collection of disjoint non-empty subsets Aj of [n], j = 1, . . . , n(Πn ), where ∪j Aj = [n] and n(Πn ) is the number of subsets for partition Πn . We also denote by Pn be the set of all partitions of [n] and let nj , j = 1, . . . , n(Πn ), be the size of the subset Aj . Each allocation variable zi is associated to a vertex/site of an undirected graph G, which is assumed to be known. In the standard case where the graph G is complete, we first review briefly here two popular prior distributions on z1:n , equivalently on Πn . We then extend these models to undirected decomposable graphs; see [2, 8] for an introduction to decomposable graphs. Finally we briefly discuss the directed case. Note that the models proposed here are completely different from the hyper multinomial-Dirichlet in [2] and its recent DP extension [6]. 2.1 Dirichlet multinomial allocation model and DP partition distribution Assume for the time being that K is finite. When the graph is complete, a popular choice for the allocation variables is to consider a Dirichlet multinomial allocation model [11] θ θ , . . . , ), zi |π ∼ π (1) K K where D is the standard Dirichlet distribution and θ > 0. Integrating out π, we obtain the following Dirichlet multinomial prior distribution π ∼ D( Pr(z1:n ) = K j=1 Γ(θ) Γ(nj + θ K) (2) θ Γ(θ + n)Γ( K )K and then, using the straightforward equality Pr(Πn ) = PK where PK = {Πn ∈ Pn |n(Πn ) ≤ K}, we obtain K! (K−n(Πn ))! Pr(z1:n ) valid for for all Πn ∈ n(Π ) Pr(Πn ) = θ Γ(θ) j=1n Γ(nj + K ) K! . θ (K − n(Πn ))! Γ(θ + n)Γ( K )n(Πn ) (3) DP may be seen as a generalization of the Dirichlet multinomial model when the number of components K → ∞; see for example [10]. In this case the distribution over the partition Πn of [n] is given by [11] n(Π ) θn(Πn ) j=1n Γ(nj ) . (4) Pr(Πn ) = n i=1 (θ + i − 1) Let Π−k = {A1,−k , . . . , An(Π−k ),−k } be the partition induced by removing item k to Πn and nj,−k be the size of cluster j for j = 1, . . . , n(Π−k ). It follows from (4) that an item k is assigned to an existing cluster j, j = 1, . . . , n(Π−k ), with probability proportional to nj,−k / (n − 1 + θ) and forms a new cluster with probability θ/ (n − 1 + θ). This property is the basis of the CRP. We now extend the Dirichlet multinomial allocation and the DP partition distribution models to decomposable graphs. 2.2 Markov combination of Dirichlet multinomial and DP partition distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. It can be easily checked that if the marginal distribution of zC for each clique C ∈ C is defined by (2) then these distributions are consistent as they yield the same distribution (2) over the separators. Therefore, the unique Markov distribution over G with Dirichlet multinomial distribution over the cliques is defined by [8] Pr(zC ) S∈S Pr(zS ) C∈C Pr(z1:n ) = (5) where for each complete set B ⊆ G, we have Pr(zB ) given by (2). It follows that we have for any Πn ∈ PK Γ(θ) K! Pr(Πn ) = (K − n(Πn ))! C∈C Γ(θ) S∈S 2 K j=1 θ Γ(nj,C + K ) θ Γ(θ+nC )Γ( K )K K j=1 θ Γ(nj,S + K ) θ Γ(θ+nS )Γ( K )K (6) where for each complete set B ⊆ G, nj,B is the number of items associated to cluster j, j = 1, . . . , K in B and nB is the total number of items in B. Within each complete set B, the allocation variables define a partition distributed according to the Dirichlet-multinomial distribution. We now extend this approach to DP partition distributions; that is we derive a joint distribution over Πn such that the distribution of ΠB over each complete set B of the graph is given by (4) with θ > 0. Such a distribution satisfies the consistency condition over the separators as the restriction of any partition distributed according to (4) still follows (4) [7]. G Proposition. Let Pn be the set of partitions Πn ∈ Pn such that for each decomposition A, B, and any (i, j) ∈ A × B, i ↔ j ⇒ ∃k ∈ A ∩ B such that k ↔ i ↔ j. As K → ∞, the prior distribution G over partitions (6) is given for each Πn ∈ Pn by Pr(Πn ) = θn(Πn ) n(ΠC ) Γ(nj,C ) j=1 nC i=1 (θ+i−1) n(ΠS ) Γ(nj,S ) j=1 nS (θ+i−1) i=1 C∈C S∈S (7) where n(ΠB ) is the number of clusters in the complete set B. Proof. From (6), we have θ n(ΠC ) K(K − 1) . . . (K − n(Πn ) + 1) Pr(Πn ) = K C∈C n(ΠC )− S∈S n(ΠS ) C∈C θ n(ΠS ) S∈S n(ΠC ) θ Γ(nj,C + K ) j=1 nC (θ+i−1) i=1 n(ΠS ) θ Γ(nj,S + K ) j=1 nS (θ+i−1) i=1 Thus when K → ∞, we obtain (7) if n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) and 0 otherwise. We have n(Πn ) ≤ C∈C n(ΠC ) − S∈S n(ΠS ) for any Πn ∈ Pn and the subset of Pn verifying G n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) corresponds to the set Pn . Example. Let the notation i ∼ j (resp. i j) indicates an edge (resp. no edge) between two sites. Let n = 3 and G be the decomposable graph defined by the relations 1 ∼ 2, 2 ∼ 3 and 1 3. G The set P3 is then equal to {{{1, 2, 3}}; {{1, 2}, {3}}; {{1}, {2, 3}}; {{1}, {2}, {3}}}. Note that G the partition {{1, 3}, {2}} does not belong to P3 . Indeed, as there is no edge between 1 and 3, they cannot be in the same cluster if 2 is in another cluster. The cliques are C1 = {1, 2} and C2 = {2, 3} Pr(ΠC1 ) Pr(ΠC2 ) hence we can and the separator is S2 = {2}. The distribution is given by Pr(Π3 ) = Pr(ΠS ) 2 check that we obtain Pr({1, 2, 3}) = (θ + 1)−2 , Pr({1, 2}, {3}) = Pr({1, 2}, {3}) = θ(θ + 1)−2 and Pr({1}, {2}, {3}) = θ2 (θ + 1)−2 . Let now define the full conditional distributions. Based on (7) the conditional assignment of an item k is proportional to the conditional over the cliques divided by the conditional over the separators. G Let denote G−k the undirected graph obtained by removing vertex k from G. Suppose that Πn ∈ Pn . G−k If Π−k ∈ Pn−1 , then do not change the value of item k. Otherwise, item k is assigned to cluster j / where j = 1, . . . , n(Π−k ) with probability proportional to {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (8) and to a new cluster with probability proportional to θ, where n−k,j,C is the number of items in the set C \ {k} belonging to cluster j. The updating process is illustrated by the Chinese wedding party process1 in Fig. 1. The results of this section can be extended to the Pitman-Yor process, and more generally to species sampling models. Example (continuing). Given Π−2 = {A1 = {1}, A2 = {3}}, we have −1 Pr( item 2 assigned to A1 = {1}| Π−2 ) = Pr( item 2 assigned to A2 = {3}| Π−2 ) = (θ + 2) −1 and Pr( item 2 assigned to new cluster A3 | Π−2 ) = θ (θ + 2) . Given Π−2 = {A1 = {1, 3}}, item 2 is assigned to A1 with probability 1. 1 Note that this representation describes the full conditionals while the CRP represents the sequential updat- ing. 3 (a) (b) (d) (c) (e) Figure 1: Chinese wedding party. Consider a group of n guests attending a wedding party. Each of the n guests may belong to one or several cliques, i.e. maximal groups of people such that everybody knows everybody. The belonging of each guest to the different cliques is represented by color patches on the figures, and the graphical representation of the relationship between the guests is represented by the graphical model (e). (a) Suppose that the guests are already seated such that two guests cannot be together at the same table is they are not part of the same clique, or if there does not exist a group of other guests such that they are related (“Any friend of yours is a friend of mine”). (b) The guest number k leaves his table and either (c) joins a table where there are guests from the same clique as him, with probability proportional to the product of the number of guests from each clique over the product of the number of guests belonging to several cliques on that table or (d) he joins a new table with probability proportional to θ. 2.3 Monte Carlo inference 2.3.1 MCMC algorithm Using the full conditionals, a single site Gibbs sampler can easily be designed to approximate the posterior distribution Pr(Πn |z1:n ). Given a partition Πn , an item k is taken out of the partition. If G−k Π−k ∈ Pn−1 , item k keeps the same value. Otherwise, the item will be assigned to a cluster j, / j = 1, . . . , n(Π−k ), with probability proportional to p(z{k}∪Aj,−k ) × p(zAj,−k ) {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (9) and the item will be assigned to a new cluster with probability proportional to p(z{k} ) × θ. Similarly to [3], we can also define a procedure to sample from p(θ|n(Πn ) = k)). We assume that θ ∼ G(a, b) and use p auxiliary variables x1 , . . . , xp . The procedure is as follows. • For j = 1, . . . , p, sample xj |k, θ ∼ Beta(θ + nSj , nCj − nSj ) • Sample θ|k, x1:p ∼ G(a + k, b − j log xj ) 2.3.2 Sequential Monte Carlo We have so far only treated the case of an undirected decomposable graph G. We can formulate a sequential updating rule for the corresponding perfect directed version D of G. Indeed, let (a1 , . . . a|V | ) be a perfect ordering and pa(ak ) be the set of parents of ak which is by definition complete. Let Πk−1 = {A1,k−1 , . . . , An(Πk−1 ),k−1 } denote the partition of the first k−1 vertices a1:k−1 and let nj,pa(ak ) be the number of elements with value j in the set pa(ak ), j = 1, . . . , n(Πk−1 ). Then the vertex ak joins the set j with probability nj,pa(ak ) / θ + cluster with probability θ/ θ + q q nq,pa(ak ) and creates a new nq,pa(ak ) . One can then design a particle filter/SMC method in a similar fashion as [4]. Consider a set of (i) (i) (i) (i) N N particles Πk−1 with weights wk−1 ∝ Pr(Πk−1 , z1:k−1 ) ( i=1 wk−1 = 1) that approximate (i) the posterior distribution Pr(Πk−1 |z1:k−1 ). For each particle i, there are n(Πk−1 ) + 1 possible 4 (i,j) allocations for component ak . We denote Πk the partition obtained by associating component ak (i,j) to cluster j. The weight associated to Πk is given by  nj,pa(ak ) (i)  if j = 1, . . . , n(Πk−1 ) θ+ q nq,pa(ak ) (i,j) (i) p(z{ak }∪Aj,k−1 ) wk−1 = wk−1 × (10) (i) θ  θ+ n p(zAj,k−1 ) if j = n(Πk−1 ) + 1 q q,pa(ak ) (i,j) Then we can perform a deterministic resampling step by keeping the N particles Πk with highest (i,j) (i) (i) weights wk−1 . Let Πk be the resampled particles and wk the associated normalized weights. 3 Prior distributions for infinite binary matrices on decomposable graphs Assume we have n objects; each of these objects being associated to the vertex of a graph G. To K each object is associated a K-dimensional binary vector zn = (zn,1 , . . . , zn,K ) ∈ {0, 1} where zn,i = 1 if object n possesses feature i and zn,i = 0 otherwise. These vectors zt form a binary n × K matrix denoted Z1:n . We denote by ξ1:n the associated equivalence class of left-ordered matrices and let EK be the set of left-ordered matrices with at most K features. In the standard case where the graph G is complete, we review briefly here two popular prior distributions on Z1:n , equivalently on ξ1:n : the Beta-Bernoulli model and the IBP [5]. We then extend these models to undirected decomposable graphs. This can be used for example to define a time-varying IBP as illustrated in Section 4. 3.1 Beta-Bernoulli and IBP distributions The Beta-Bernoulli distribution over the allocation Z1:n is K Pr(Z1:n ) = α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) α K Γ(nj j=1 (11) where nj is the number of objects having feature j. It follows that Pr(ξ1:n ) = K K! 2n −1 h=0 α K Γ(nj α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) Kh ! j=1 (12) where Kh is the number of features possessing the history h (see [5] for details). The nonparametric model is obtained by taking the limit when K → ∞ Pr(ξ1:n ) = αK K+ + 2n −1 h=1 Kh ! exp(−αHn ) where K + is the total number of features and Hn = 3.2 (n − nj )!(nj − 1)! n! j=1 n 1 k=1 k . (13) The IBP follows from (13). Markov combination of Beta-Bernoulli and IBP distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. As in the Dirichlet-multinomial case, it is easily seen that if for each clique C ∈ C, the marginal distribution is defined by (11), then these distributions are consistent as they yield the same distribution (11) over the separators. Therefore, the unique Markov distribution over G with Beta-Bernoulli distribution over the cliques is defined by [8] Pr(ZC ) S∈S Pr(ZS ) C∈C Pr(Z1:n ) = (14) where Pr(ZB ) given by (11) for each complete set B ⊆ G. The prior over ξ1:n is thus given, for ξ1:n ∈ EK , by Pr(ξ1:n ) = K! 2n −1 h=0 Kh ! α K α Γ(nj,C + K )Γ(nC −nj,C +1) α Γ(nC +1+ K ) α α Γ(nj,S + K )Γ(nS −nj,S +1) K K α j=1 Γ(nS +1+ K ) K j=1 C∈C S∈S 5 (15) where for each complete set B ⊆ G, nj,B is the number of items having feature j, j = 1, . . . , K in the set B and nB is the whole set of objects in set B. Taking the limit when K → ∞, we obtain after a few calculations Pr(ξ1:n ) = α + K[n] exp [−α ( C HnC − 2n −1 h=1 Kh ! HnS )] × C∈C + KC (nC −nj,C )!(nj,C −1)! j=1 nC ! S∈S S + KS (nS −nj,S )!(nj,S −1)! j=1 nS ! + + + + if K[n] = C KC − S KS and 0 otherwise, where KB is the number of different features possessed by objects in B. G Let En be the subset of En such that for each decomposition A, B and any (u, v) ∈ A × B: {u and v possess feature j} ⇒ ∃k ∈ A ∩ B such that {k possesses feature j}. Let ξ−k be the left-ordered + matrix obtained by removing object k from ξn and K−k be the total number of different features in G−k + ξ−k . For each feature j = 1, . . . , K−k , if ξ−k ∈ En−1 then we have   b C∈C nj,C if i = 1 S∈C nj,S Pr(ξk,j = i) = (16)  b C∈C (nC −nj,C ) if i = 0 (nS −nj,S ) S∈C nS where b is the appropriate normalizing constant then the customer k tries Poisson α {S∈S|k∈S} nC {C∈C|k∈C} new dishes. We can easily generalize this construction to a directed version D of G using arguments similar to those presented in Section 2; see Section 4 for an application to time-varying matrix factorization. 4 4.1 Applications Sharing clusters among relative groups: An alternative to HDP Consider that we are given d groups with nj data yi,j in each group, i = 1, . . . , nj , j = 1, . . . , d. We consider latent cluster variables zi,j that define the partition of the data. We will use alternatively the notation θi,j = Uzi,j in the following. Hierarchical Dirichlet Process [12] (HDP) is a very popular model for sharing clusters among related groups. It is based on a hierarchy of DPs G0 ∼ DP (γ, H), Gj |G0 ∼ DP (α, G0 ) j = 1, . . . d θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj . Under conjugacy assumptions, G0 , Gj and U can be integrated out and we can approximate the marginal posterior of (zi,j ) given y = (yi,j ) with Gibbs sampling using the Chinese restaurant franchise to sample from the full conditional p(zi,j |z−{i,j} , y). Using the graph formulation defined in Section 2, we propose an alternative to HDP. Let θ0,1 , . . . , θ0,N be N auxiliary variables belonging to what we call group 0. We define each clique Cj (j = 1, . . . , d) to be composed of elements from group j and elements from group 0. This defines a decomposable graphical model whose separator is given by the elements of group 0. We can rewrite the model in a way quite similar to HDP G0 ∼ DP (α, H), θ0,i |G0 ∼ G0 i = 1, ..., N α α Gj |θ0,1 , . . . , θ0,N ∼ DP (α + N, α+N H + α+N θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj N i=1 δθ0,i ) j = 1, . . . d, N For any subset A and j = k ∈ {1, . . . , p} we have corr(Gj (A), Gk (A)) = α+N . Again, under conjugacy conditions, we can integrate out G0 , Gj and U and approximate the marginal posterior distribution over the partition using the Chinese wedding party process defined in Section 2. Note that for latent variables zi,j , j = 1, . . . , d, associated to data, this is the usual CRP update. As in HDP, multiple layers can be added to the model. Figures 2 (a) and (b) resp. give the graphical DP alternative to HDP and 2-layer HDP. 6 z0 root z0 root corpora docs z1 z2 z1 z2 z3 z1,1 z1,2 z2,1 z2,2 z2,3 docs (a) Graphical DP alternative to HDP (b) Graphical DP alternative to 2-layer HDP Figure 2: Hierarchical Graphs of dependency with (a) one layer and (b) two layers of hierarchy. If N = 0, then Gj ∼ DP (α, H) for all j and this is equivalent to setting γ → ∞ in HDP. If N → ∞ then Gj = G0 for all j, G0 ∼ DP (α, H). This is equivalent to setting α → ∞ in the HDP. One interesting feature of the model is that, contrary to HDP, the marginal distribution of Gj at any layer of the tree is DP (α, H). As a consequence, the total number of clusters scales logarithmically (as in the usual DP) with the size of each group, whereas it scales doubly logarithmically in HDP. Contrary to HDP, there are at most N clusters shared between different groups. Our model is in that sense reminiscent of [9] where only a limited number of clusters can be shared. Note however that contrary to [9] we have a simple CRP-like process. The proposed methodology can be straightforwardly extended to the infinite HMM [12]. The main issue of the proposed model is the setting of the number N of auxiliary parameters. Another issue is that to achieve high correlation, we need a large number of auxiliary variables. Nonetheless, the computational time used to sample from auxiliary variables is negligible compared to the time used for latent variables associated to data. Moreover, it can be easily parallelized. The model proposed offers a far richer framework and ensures that at each level of the tree, the marginal distribution of the partition is given by a DP partition model. 4.2 Time-varying matrix factorization Let X1:n be an observed matrix of dimension n × D. We want to find a representation of this matrix in terms of two latent matrices Z1:n of dimension n × K and Y of dimension K × D. Here Z1:n 2 is a binary matrix whereas Y is a matrix of latent features. By assuming that Y ∼ N 0, σY IK×D and 2 X1:n = Z1:n Y + σX εn where εn ∼ N 0, σX In×D , we obtain p(X1:n |Z1:n ) ∝ −D/2 2 2 + Z+T Z+ + σX /σY IKn 1:n 1:n + (n−Kn )D σX exp − + Kn D σY 2 2 + where Σ−1 = I − Z+ Z+T Z+ + σX /σY IKn n 1:n 1:n 1:n −1 1 T −1 2 tr X1:n Σn X1:n 2σX (17) + Z+T , Kn the number of non-zero columns of 1:n + Z1:n and Z+ is the first Kn columns of Z1:n . To avoid having to set K, [5, 14] assume that Z1:n 1:n follows an IBP. The resulting posterior distribution p(Z1:n |X1:n ) can be estimated through MCMC [5] or SMC [14]. We consider here a different model where the object Xt is assumed to arrive at time index t and we want a prior distribution on Z1:n ensuring that objects close in time are more likely to possess similar features. To achieve this, we consider the simple directed graphical model D of Fig. 3 where the site numbering corresponds to a time index in that case and a perfect numbering of D is (1, 2, . . .). The set of parents pa(t) is composed of the r preceding sites {{t − r}, . . . , {t − 1}}. The time-varying IBP to sample from p(Z1:n ) associated to this directed graph follows from (16) and proceeds as follows. At time t = 1 + new new • Sample K1 ∼Poisson(α), set z1,i = 1 for i = 1, ..., K1 and set K1 = Knew . At times t = 2, . . . , r n + new ∼Poisson( α ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( 1:t−1,k ) and Kt t t 7   ?  ? - t−r - t−r+1 - . . . - t−1 - t - t+1        6 6 Figure 3: Directed graph. At times t = r + 1, . . . , n n + α new ∼Poisson( r+1 ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( t−r:t−1,k ) and Kt r+1 + Here Kt is the total number of features appearing from time max(1, t − r) to t − 1 and nt−r:t−1,k the restriction of n1:t−1 to the r last customers. Using (17) and the prior distribution of Z1:n which can be sampled using the time-varying IBP described above, we can easily design an SMC method to sample from p(Z1:n |X1:n ). We do not detail it here. Note that contrary to [14], our algorithm does not require inverting a matrix whose dimension grows linearly with the size of the data but only a matrix of dimension r × r. In order to illustrate the model and SMC algorithm, we create 200 6 × 6 images using a ground truth Y consisting of 4 different 6 × 6 latent images. The 200 × 4 binary matrix was generated from Pr(zt,k = 1) = πt,k , where πt = ( .6 .5 0 0 ) if t = 1, . . . , 30, πt = ( .4 .8 .4 0 ) if t = 31, . . . , 50 and πt = ( 0 .3 .6 .6 ) if t = 51, . . . , 200. The order of the model is set to r = 50. The feature occurences Z1:n and true features Y and their estimates are represented in Figure 4. Two spurious features are detected by the model (features 2 and 5 on Fig. 3(c)) but quickly discarded (Fig. 4(d)). The algorithm is able to correctly estimate the varying prior occurences of the features over time. Feature1 Feature2 Feature1 Feature2 Feature3 20 20 40 40 60 60 Feature4 80 100 Feature4 Feature5 Feature6 Time Feature3 Time 80 100 120 120 140 140 160 160 180 200 180 1 2 3 200 4 Feature (a) 1 2 3 4 5 6 Feature (b) (c) (d) Figure 4: (a) True features, (b) True features occurences, (c) MAP estimate ZM AP and (d) associated E[Y|ZM AP ] t=20 t=50 t=20 t=50 t=100 t=200 t=100 t=200 (a) (b) Figure 5: (a) E[Xt |πt , Y] and (b) E[Xt |X1:t−1 ] at t = 20, 50, 100, 200. 5 Related work and Discussion The fixed-lag version of the time-varying DP of Caron et al. [1] is a special case of the proposed model when G is given by Fig. 3. The bivariate DP of Walker and Muliere [13] is also a special case when G has only two cliques. In this paper, we have assumed that the structure of the graph was known beforehand and we have shown that many flexible models arise from this framework. It would be interesting in the future to investigate the case where the graphical structure is unknown and must be estimated from the data. Acknowledgment The authors thank the reviewers for their comments that helped to improve the writing of the paper. 8 References [1] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In Uncertainty in Artificial Intelligence, 2007. [2] A.P. Dawid and S.L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21:1272–1317, 1993. [3] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. [4] P. Fearnhead. Particle filters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004. [5] T.L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2006. [6] D. Heinz. Building hyper dirichlet processes for graphical models. Electonic Journal of Statistics, 3:290–315, 2009. [7] J.F.C. Kingman. Random partitions in population genetics. Proceedings of the Royal Society of London, 361:1–20, 1978. [8] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. [9] P. M¨ ller, F. Quintana, and G. Rosner. A method for combining inference across related nonu parametric Bayesian models. Journal of the Royal Statistical Society B, 66:735–749, 2004. [10] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [11] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related fields, 102:145–158, 1995. [12] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006. [13] S. Walker and P. Muliere. A bivariate Dirichlet process. Statistics and Probability Letters, 64:1–7, 2003. [14] F. Wood and T.L. Griffiths. Particle filtering for nonparametric Bayesian matrix factorization. In Advances in Neural Information Processing Systems, 2007. 9

6 0.50071782 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

7 0.49307323 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

8 0.49066943 226 nips-2009-Spatial Normalized Gamma Processes

9 0.47862819 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

10 0.41684738 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

11 0.41217661 51 nips-2009-Clustering sequence sets for motif discovery

12 0.40899175 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

13 0.40661073 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

14 0.40656793 175 nips-2009-Occlusive Components Analysis

15 0.4027499 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

16 0.3997488 115 nips-2009-Individuation, Identification and Object Discovery

17 0.39931187 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

18 0.38519266 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

19 0.38238931 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

20 0.3774403 196 nips-2009-Quantification and the language of thought


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.107), (7, 0.014), (18, 0.012), (24, 0.032), (25, 0.058), (31, 0.01), (35, 0.056), (36, 0.067), (39, 0.086), (45, 0.029), (46, 0.02), (58, 0.073), (61, 0.022), (71, 0.048), (72, 0.096), (81, 0.034), (84, 0.013), (86, 0.077), (91, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87079972 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by finding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

2 0.82896572 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

3 0.74641269 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

4 0.74091977 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

5 0.739281 112 nips-2009-Human Rademacher Complexity

Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers

Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner’s ability to fit random labels, and can be used to bound the learner’s true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. 1

6 0.73754263 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.7364493 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

8 0.72929341 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

9 0.72914231 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

10 0.72894335 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

11 0.72888511 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

12 0.72337723 70 nips-2009-Discriminative Network Models of Schizophrenia

13 0.72279429 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

14 0.72030687 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

15 0.71979105 113 nips-2009-Improving Existing Fault Recovery Policies

16 0.71917176 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

17 0.71825314 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

18 0.71647328 154 nips-2009-Modeling the spacing effect in sequential category learning

19 0.71612036 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

20 0.71427292 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs