nips nips2009 nips2009-195 knowledge-graph by maker-knowledge-mining

195 nips-2009-Probabilistic Relational PCA

Source: pdf

Author: Wu-jun Li, Dit-Yan Yeung, Zhihua Zhang

Abstract: One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on realworld data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i. [sent-13, score-0.174]

2 In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. [sent-21, score-1.008]

3 assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i. [sent-25, score-0.041]

4 Experiments on realworld data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. [sent-29, score-0.296]

5 The methods for discovering such low-dimensional embedding are often referred to as dimensionality reduction (DR) methods. [sent-31, score-0.055]

6 As a more recent development, probabilistic PCA (PPCA) [21] provides a probabilistic formulation of PCA [13] based on a Gaussian latent variable model [1]. [sent-33, score-0.124]

7 assumption, which means that the instances are assumed to be independent and identically distributed (i. [sent-42, score-0.083]

8 However, the data in many real-world applications, such as web pages and research papers, contain relations or links between (some) instances in the data in addition to the textual content information which is represented in the form of feature vectors. [sent-46, score-0.274]

9 Data of this sort, referred to as relational data1 [10, 20], can be found in such diverse application areas as web mining [3, 17, 23, 24], bioinformatics [22], social network analysis [4], and so on. [sent-47, score-0.297]

10 On one hand, the link structure among instances 1 In this paper, we use document classiﬁcation as a running example for relational data analysis. [sent-48, score-0.411]

11 Hence, for convenience of illustration, the speciﬁc term ‘textual content information’ is used in the paper to refer to the feature vectors describing the instances. [sent-49, score-0.093]

12 However, the algorithms derived in this paper can be applied to any relational data in which the instance feature vectors can represent any attribute information. [sent-50, score-0.307]

13 1 cannot be exploited easily when traditional DR methods such as PCA are applied to relational data. [sent-51, score-0.276]

14 Very often, the useful relational information is simply discarded. [sent-52, score-0.276]

15 One possible use of the relational information in PCA or PPCA is to ﬁrst convert the link structure into the format of ﬂat data by extracting some additional features from the links. [sent-54, score-0.346]

16 assumption underlying PCA and PPCA is unreasonable for relational data. [sent-59, score-0.325]

17 In relational data, the attributes of the connected (linked) instances are often correlated and the class label of one instance may have an inﬂuence on that of a linked instance. [sent-60, score-0.455]

18 In this paper, a novel probabilistic DR method called probabilistic relational PCA (PRPCA) is proposed for relational data analysis. [sent-66, score-0.612]

19 By explicitly modeling the covariance between instances as derived from the relational information, PRPCA seamlessly integrates relational information and textual content information into a uniﬁed probabilistic framework. [sent-67, score-0.814]

20 assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i. [sent-72, score-0.041]

21 Extensive experiments on real-world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. [sent-76, score-0.296]

22 2 Notation We use boldface uppercase letters, such as K, to denote matrices, and boldface lowercase letters, such as z, to denote vectors. [sent-77, score-0.04]

23 The ith row and the jth column of a matrix K are denoted by Ki∗ and K∗j , respectively. [sent-78, score-0.036]

24 tr(·) denotes the trace of a matrix and etr(·) exp(tr(·)). [sent-83, score-0.057]

25 We overload N (·) for both multivariate normal distributions and matrix variate normal distributions [11]. [sent-88, score-0.157]

26 · denotes the expectation operation and cov(·) denotes the covariance operation. [sent-89, score-0.074]

27 Note that in relational data, there exist both content and link observations. [sent-90, score-0.439]

28 We further use the d × N matrix T to denote the content matrix with T∗n = tn , and the q × N matrix X to denote the latent variables of T with X∗n = WT (tn − µ). [sent-92, score-0.302]

29 For relational data, the N × N matrix A denotes the adjacency (link) matrix of the N instances. [sent-93, score-0.39]

30 In this paper, we assume that the links are undirected. [sent-94, score-0.073]

31 For those data with directed links, we will convert the directed links into undirected links which can keep the original physical meaning of the links. [sent-95, score-0.228]

32 Hence, Aij = 1 if there exists a relation between instances i and j, and otherwise Aij = 0. [sent-99, score-0.091]

33 3 Probabilistic PCA To set the stage for the next section which introduces our PRPCA model, we ﬁrst brieﬂy present the derivation for PPCA [21], which was originally based on (vector-based) multivariate normal distributions, from the perspective of matrix variate normal distributions [11]. [sent-103, score-0.157]

34 2 If we use Υ to denote the Gaussian noise process and assume that Υ and the latent variable matrix X follow these distributions: Υ ∼ Nd,N (0, σ 2 Id ⊗ IN ), X ∼ Nq,N (0, Iq ⊗ IN ), (1) we can express a generative model as follows: T = WX + µeT + Υ. [sent-104, score-0.1]

35 Based on some properties of matrix variate normal distributions in [11], we get the following results: T | X ∼ Nd,N (WX + µeT , σ 2 Id ⊗ IN ), T ∼ Nd,N µeT , (WWT + σ 2 Id ) ⊗ IN . [sent-105, score-0.157]

36 The corresponding log-likelihood of the observation matrix T is then N d ln(2π) + ln |C| + tr(C−1 S) , (3) L = ln p(T) = − 2 N (T −µ)(T −µ)T (T−µeT )(T−µeT )T where S = = n=1 ∗n N ∗n . [sent-107, score-0.1]

37 We can see that S is just the sample N covariance matrix of the content observations. [sent-108, score-0.161]

38 Using matrix notations, the graphical model of PPCA based on matrix variate normal distributions is shown in Figure 1(a). [sent-110, score-0.17]

39 P Iq µ W Iq µ W IN X σ2 ∆−1 X σ2 T T (a) Model of PPCA (b) Model of PRPCA Figure 1: Graphical models of PPCA and PRPCA, in which T is the observation matrix, X is the latent variable matrix, µ, W and σ 2 are the parameters to learn, and the other quantities are kept constant. [sent-111, score-0.064]

40 assumption can make the modeling process much simpler and has achieved great success in many traditional applications, this assumption is however very unreasonable for relational data [10]. [sent-116, score-0.349]

41 In relational data, the attributes of connected (linked) instances are often correlated. [sent-117, score-0.367]

42 In this section, a probabilistic relational PCA model, called PRPCA, is proposed to integrate both the relational information and the content information seamlessly into a uniﬁed framework by eliminating the i. [sent-118, score-0.695]

43 Based on our reformulation of PPCA using matrix variate notations as presented in the previous section, we can obtain PRPCA just by introducing some relatively simple (but very effective) modiﬁcations. [sent-122, score-0.111]

44 1 Model Formulation Assume that the latent variable matrix X has the following distribution: X ∼ Nq,N (0, Iq ⊗ Φ). [sent-128, score-0.1]

45 assumption for relational data, one direct way is to use a non-identity covariance matrix Φ for the distribution of X in (4). [sent-143, score-0.368]

46 1 Relational Covariance Construction Because the covariance matrix Φ in PRPCA is constructed from the relational information in the data, we refer to it as relational covariance here. [sent-151, score-0.652]

47 The goal of PCA and PPCA is to ﬁnd those principal axes onto which the retained variance under projection is maximal [13, 21]. [sent-152, score-0.103]

48 For one speciﬁc X, the retained variance is tr[XXT ]. [sent-153, score-0.066]

49 If we rewrite 1 exp{− 1 tr[XXT ]} exp{tr[− 2 XXT ]} 2 = , we have the following observation: p(X) in (1) as p(X) = qN/2 (2π) (2π)qN/2 3 Observation 1 For PPCA, the larger the retained variance of X, i. [sent-154, score-0.066]

50 , the more X approaches the destination point, the lower is the probability density at X given by the prior. [sent-156, score-0.037]

51 Here, the destination point refers to the point where the goal of PPCA is achieved, i. [sent-157, score-0.037]

52 Moreover, we use the retained variance as a measure to deﬁne the gap between two different points. [sent-160, score-0.066]

53 The smaller is the gap between the retained variance of two points, the more they approach each other. [sent-161, score-0.066]

54 Because the design principle of PRPCA is similar to that of PPCA, our working hypothesis here is that Observation 1 can also guide us to design the relational covariance of PRPCA. [sent-162, score-0.308]

55 In PRPCA, we assume that the attributes of two linked instances are positively correlated. [sent-164, score-0.148]

56 2 Under this assumption, the ideal goal of PRPCA should be to make the latent representations of two instances as close as possible if there exists a relation (link) between them. [sent-165, score-0.155]

57 Hence, the measure to deﬁne the gap between two points refers to the closeness of the linked instances, i. [sent-166, score-0.057]

58 , the summation of the Euclidean distances between the linked instances. [sent-168, score-0.057]

59 Based on Observation 1, the more X approaches the destination point, the lower should be the probability density at X given by the prior. [sent-169, score-0.037]

60 Hence, under the latent space representation X, the closer the linked instances are, the lower should be the probability density at X given by the prior. [sent-170, score-0.186]

61 Note that Aij = 1 if there exists a relation between instances i and j, and otherwise Aij = 0. [sent-172, score-0.091]

62 ˜ ˜ Let D denote a diagonal matrix whose diagonal elements Dii = j Aij . [sent-174, score-0.036]

63 Because Bij = k=1 Aik Akj for i = j, we can see that Bij is the number of paths, each with path length 2, from instance i to instance j in the original adjacency graph A. [sent-178, score-0.107]

64 Because the attributes of two linked instances are positively correlated, Bij actually reﬂects the degree of correlation between instance i and instance j. [sent-179, score-0.21]

65 Let us take the paper citation graph as an example to illustrate this. [sent-180, score-0.042]

66 The existence of a citation relation between two papers often implies that they are about the same topic. [sent-181, score-0.044]

67 If paper i cites paper k and paper k cites paper j, it is highly likely that paper i and paper j are about the same topic. [sent-182, score-0.048]

68 Hence, the larger Bij is, the stronger is the correlation N between instance i and instance j. [sent-184, score-0.062]

69 Because Bij = k=1 Aik Akj = AT A∗j , Bij can also be seen ∗i as the similarity between the link vectors of instance i and instance j. [sent-185, score-0.132]

70 Therefore, B can be seen as a weight matrix (corresponding to a weight graph) derived from the original adjacency matrix A, and B is also consistent with the physical meaning underlying A. [sent-186, score-0.119]

71 Letting G = 2A + B,3 we can ﬁnd that G actually combines the original graph reﬂected by A and the derived graph reﬂected by B to get a new graph, and puts a weight 2Aij + Bij on the edge between instance i and instance j in the new graph. [sent-187, score-0.133]

72 The new weight graph reﬂected by G is also consistent with the physical meaning underlying A. [sent-188, score-0.05]

73 Letting L D − G, where D is a diagonal matrix whose diagonal elements Dii = j Gij and L is called the Laplacian matrix [6] of G, we ˜ ˆ ˜ can get ∆ = (1+γ)IN + D+D−L. [sent-189, score-0.095]

74 (5) i=1 j=1 2 Links with other physical meanings, such as the directed links in web graphs [25], can be transformed into links satisfying the assumption in PRPCA via some preprocessing strategies. [sent-192, score-0.245]

75 N ˆ The ﬁrst term i=1 Dii X∗i 2 in (5) can be treated as a measure of weighted variance of all the ˆ instances in the latent space. [sent-198, score-0.145]

76 This means that under the latent space representation X, the closer the linked instances are, the lower is the probability density at X given by the prior. [sent-201, score-0.186]

77 2 Model With the constructed relational covariance Φ, the generative model of PRPCA is deﬁned as follows: Υ ∼ Nd,N (0, σ 2 Id ⊗ Φ), X ∼ Nq,N (0, Iq ⊗ Φ), T = WX + µeT + Υ, where Φ = ∆−1 . [sent-205, score-0.308]

78 When ∆ 0, we say T follows a singular matrix variate normal distribution [11], and all the derivations for PRPCA are still correct. [sent-218, score-0.134]

79 Then the log-likelihood of the observation matrix T in PRPCA is N d ln(2π) + ln |C| + tr(C−1 H) + c, (7) L1 = ln p(T) = − 2 where c = − d ln |Φ| can be seen as a constant independent of the parameters µ, W and σ 2 , and 2 (T−µeT )∆(T−µeT )T H= . [sent-222, score-0.132]

80 5 Experiments Although PPCA possesses additional advantages when compared with the original non-probabilistic formulation of PCA, they will get similar DR results when there exist no missing values in the data. [sent-255, score-0.044]

81 If the task is to classify instances in the low-dimensional embedding, the classiﬁers based on the embedding results of PCA and PPCA are expected to achieve comparable results. [sent-256, score-0.097]

82 For WebKB, according to the semantics of authoritative pages and hub pages [25], we ﬁrst preprocess the link structure of this data set as follows: if two web pages are co-linked by or link to another common web page, we add a link between these two pages. [sent-267, score-0.279]

83 After preprocessing, all the directed links are converted into undirected links. [sent-269, score-0.101]

84 For the PoliticalBook data set, we use the testing procedure of the latent Wishart process (LWP) model [15] for evaluation. [sent-273, score-0.064]

85 We can see that it is not easy to separate the two classes in the latent space of PCA. [sent-284, score-0.064]

86 However, the two classes are better separated from each other in the latent space of PRPCA. [sent-285, score-0.064]

87 Hence, better clustering or classiﬁcation performance can be expected when the examples are clustered or classiﬁed in the latent space of PRPCA. [sent-286, score-0.064]

88 2 Figure 2: Convergence speed Figure 3: Visualization of data points in the latent spaces of PCA and of the EM learning procedure of PRPCA. [sent-307, score-0.064]

89 Performance The dimensionality of Cora and WebKB is moderately high, but the dimensionality of PoliticalBook is very high. [sent-311, score-0.046]

90 Performance on Cora and WebKB The average classiﬁcation accuracy with its standard deviation based on 5-fold cross validation against the dimensionality of the latent space q is shown in Figure 4. [sent-313, score-0.111]

91 We can ﬁnd that PRPCA can dramatically outperform PCA on all the data sets under any dimensionality, which conﬁrms that the relational information is very informative and PRPCA can utilize it very effectively. [sent-314, score-0.296]

92 The methods include: SVM on content, which ignores the link structure in the data and applies SVM only on the content information in the original bag-of-words representation; SVM on links, which ignores the content information and treats the links as features, i. [sent-316, score-0.329]

93 Very recently, another method proposed by us, called relation regularized matrix factorization (RRMF) [14], has achieved better performance than PRPCA on the Cora data set. [sent-377, score-0.062]

94 9 Accuracy SVM on content SVM on links SVM on link−content DGR PLSI+PHITS link−content MF PRPCA 0. [sent-383, score-0.166]

95 Although LWP can also learn a low-dimensional embedding for the instances, the computation cost to obtain a low-dimensional embedding for a test instance is O(N 3 ) because it has to invert the kernel matrix deﬁned on the training data. [sent-395, score-0.148]

96 The missing link - a probabilistic model of document content and hypertext connectivity. [sent-453, score-0.214]

97 Analysis of a complex of statistical variables into principal components. [sent-486, score-0.037]

98 A Bayesian framework for community detection integrating content and link. [sent-562, score-0.093]

99 Combining link and content for community detection: a discriminative approach. [sent-569, score-0.163]

100 Combining content and link for classiﬁcation using matrix factorization. [sent-582, score-0.199]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('prpca', 0.755), ('ppca', 0.373), ('relational', 0.276), ('pca', 0.18), ('dr', 0.107), ('content', 0.093), ('mf', 0.087), ('politicalbook', 0.085), ('cora', 0.078), ('variate', 0.075), ('webkb', 0.075), ('links', 0.073), ('link', 0.07), ('iq', 0.069), ('bij', 0.066), ('em', 0.066), ('instances', 0.065), ('latent', 0.064), ('cornell', 0.061), ('dii', 0.058), ('linked', 0.057), ('retained', 0.05), ('id', 0.049), ('lwp', 0.049), ('rgp', 0.049), ('xgp', 0.049), ('tr', 0.044), ('wwt', 0.043), ('aij', 0.04), ('ds', 0.04), ('principal', 0.037), ('tn', 0.037), ('chi', 0.037), ('destination', 0.037), ('gpc', 0.037), ('matrix', 0.036), ('ln', 0.032), ('covariance', 0.032), ('embedding', 0.032), ('texas', 0.032), ('instance', 0.031), ('probabilistic', 0.03), ('ha', 0.029), ('wx', 0.029), ('directed', 0.028), ('aa', 0.028), ('preprocess', 0.027), ('attributes', 0.026), ('physical', 0.026), ('xxt', 0.026), ('relation', 0.026), ('wt', 0.025), ('unreasonable', 0.025), ('cov', 0.025), ('akj', 0.024), ('cites', 0.024), ('dgr', 0.024), ('phits', 0.024), ('rrmf', 0.024), ('uq', 0.024), ('assumption', 0.024), ('accuracy', 0.024), ('graph', 0.024), ('get', 0.023), ('ml', 0.023), ('dn', 0.023), ('dimensionality', 0.023), ('svm', 0.023), ('normal', 0.023), ('textual', 0.022), ('hence', 0.021), ('plsi', 0.021), ('aik', 0.021), ('denotes', 0.021), ('web', 0.021), ('hong', 0.021), ('adjacency', 0.021), ('missing', 0.021), ('boldface', 0.02), ('zhejiang', 0.02), ('wishart', 0.02), ('hw', 0.02), ('seamlessly', 0.02), ('yeung', 0.02), ('dramatically', 0.02), ('ected', 0.019), ('gij', 0.018), ('citation', 0.018), ('nigam', 0.018), ('et', 0.018), ('xt', 0.018), ('identically', 0.018), ('visualization', 0.018), ('devised', 0.017), ('letting', 0.017), ('invert', 0.017), ('kdd', 0.016), ('variance', 0.016), ('topic', 0.016), ('li', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 195 nips-2009-Probabilistic Relational PCA

Author: Wu-jun Li, Dit-Yan Yeung, Zhihua Zhang

2 0.12964702 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

Author: Kurt Miller, Michael I. Jordan, Thomas L. Griffiths

Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets. 1

3 0.11475247 21 nips-2009-Abstraction and Relational learning

Author: Charles Kemp, Alan Jern

Abstract: Most models of categorization learn categories deﬁned by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our ﬁrst experiment suggests that abstraction can help to explain some of the ﬁndings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the ﬁnal word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that speciﬁes which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our ﬁrst experiment uses a standard classiﬁcation task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we ﬁnd that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are ﬁgures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories deﬁned by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that speciﬁes which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7  ﬀ ˘ ¯ ∀x D (x) =, =, <, > vk ∃xﬀ  i  ﬀ ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔  ﬀ ﬀ ﬀ ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1  ﬀ <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1  ﬀ ﬀ <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantiﬁcation (∀x) and existential quantiﬁcation (∃x) are both permitted, and the language includes quantiﬁcation over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and ﬁxed in advance. The bottom level of the hierarchy speciﬁes observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible ﬁnal word. To convert Figure 1 into a fully-speciﬁed probabilistic model it remains to deﬁne a prior distribution P (s) over schemata. An appealing approach is to consider all of the inﬁnitely many sentences in the logical language already mentioned, and to deﬁne a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but ﬁnite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simpliﬁed instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our ﬁrst experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with ﬁve values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantiﬁcation over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantiﬁcation over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantiﬁcation over objects precedes quantiﬁcation over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we deﬁne our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classiﬁcation (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our ﬁrst experiment explores all three of these problems. 2 Experiment 1: Relational classiﬁcation Our ﬁrst experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our ﬁrst experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of ﬁgures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show ﬁve different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The ﬁrst choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. ﬁrst by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their ﬁnding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantiﬁcation over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantiﬁcation is no more complex than other kinds of quantiﬁcation. Future applications of our approach can explore the idea that existential quantiﬁcation over dimensions (∃Q) is psychologically more complex than universal quantiﬁcation over dimensions (∀Q) or existential quantiﬁcation over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a proﬁle that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show proﬁles predicted by the model, and the black plots immediately above show proﬁles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make conﬁdent inferences given a single instance of a category [15, 20], and it is difﬁcult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the ﬁgures in Experiment 2 could now take ﬁve values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the ﬁrst two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and conﬁrm that a single example can provide sufﬁcient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in ﬁve out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a ﬁnite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The speciﬁc logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to ﬁll this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artiﬁcial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Grifﬁths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9

4 0.070003361 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

Author: Chong Wang, David M. Blei

Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efﬁciently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1

5 0.060937628 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

Abstract: We consider the problem of learning probabilistic models for complex relational structures between various types of objects. A model can help us “understand” a dataset of relational facts in at least two ways, by ﬁnding interpretable structure in the data, and by supporting predictions, or inferences about whether particular unobserved relations are likely to be true. Often there is a tradeoff between these two aims: cluster-based models yield more easily interpretable representations, while factorization-based approaches have given better predictive performance on large data sets. We introduce the Bayesian Clustered Tensor Factorization (BCTF) model, which embeds a factorized representation of relations in a nonparametric Bayesian clustering framework. Inference is fully Bayesian but scales well to large data sets. The model simultaneously discovers interpretable clusters and yields predictive performance that matches or beats previous probabilistic models for relational data.

6 0.058427349 104 nips-2009-Group Sparse Coding

7 0.056407996 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

8 0.052594163 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

9 0.052572999 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

10 0.052488603 67 nips-2009-Directed Regression

11 0.045779191 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

12 0.040823918 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

13 0.039941885 46 nips-2009-Bilinear classifiers for visual recognition

14 0.036008988 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model

15 0.035483528 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

16 0.032851178 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

17 0.030678608 182 nips-2009-Optimal Scoring for Unsupervised Learning

18 0.030541983 226 nips-2009-Spatial Normalized Gamma Processes

19 0.030288583 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

20 0.0291411 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.11), (1, -0.001), (2, -0.031), (3, -0.031), (4, 0.027), (5, -0.048), (6, 0.034), (7, -0.016), (8, -0.001), (9, 0.034), (10, 0.038), (11, -0.014), (12, -0.017), (13, -0.071), (14, -0.016), (15, -0.021), (16, 0.073), (17, 0.032), (18, -0.056), (19, 0.026), (20, -0.018), (21, 0.017), (22, 0.028), (23, 0.009), (24, -0.006), (25, 0.096), (26, 0.008), (27, -0.084), (28, 0.147), (29, 0.067), (30, 0.118), (31, -0.097), (32, 0.08), (33, -0.076), (34, 0.006), (35, 0.085), (36, -0.155), (37, 0.059), (38, 0.038), (39, -0.133), (40, 0.035), (41, 0.107), (42, -0.113), (43, -0.162), (44, -0.041), (45, 0.08), (46, -0.136), (47, -0.009), (48, -0.001), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91549003 195 nips-2009-Probabilistic Relational PCA

Author: Wu-jun Li, Dit-Yan Yeung, Zhihua Zhang

2 0.58148944 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

3 0.57821673 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

Author: Kurt Miller, Michael I. Jordan, Thomas L. Griffiths

4 0.53623706 21 nips-2009-Abstraction and Relational learning

Author: Charles Kemp, Alan Jern

5 0.42071319 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

Author: Andrew McCallum, Karl Schultz, Sameer Singh

Abstract: Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to deﬁne these structures in a powerful and ﬂexible way. Rather than using a declarative language, such as SQL or ﬁrst-order logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative deﬁnitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain signiﬁcant efﬁciencies. We have implemented such imperatively deﬁned factor graphs in a system we call FACTORIE, a software library for an object-oriented, strongly-typed, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we ﬁnd our approach to be 3-15 times faster while reducing error by 20-25%—achieving a new state of the art. 1

6 0.35320193 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

7 0.34718972 67 nips-2009-Directed Regression

8 0.31484959 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

9 0.30919242 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

10 0.29359972 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

11 0.29237029 76 nips-2009-Efficient Learning using Forward-Backward Splitting

12 0.28472435 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

13 0.28343225 46 nips-2009-Bilinear classifiers for visual recognition

14 0.27143624 182 nips-2009-Optimal Scoring for Unsupervised Learning

15 0.27067277 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

16 0.26178652 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

17 0.26134562 33 nips-2009-Analysis of SVM with Indefinite Kernels

18 0.25902638 3 nips-2009-AUC optimization and the two-sample problem

19 0.25763705 104 nips-2009-Group Sparse Coding

20 0.25674948 190 nips-2009-Polynomial Semantic Indexing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.033), (25, 0.067), (35, 0.063), (36, 0.093), (39, 0.062), (55, 0.01), (58, 0.109), (71, 0.044), (81, 0.013), (86, 0.077), (91, 0.015), (97, 0.287)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76773006 195 nips-2009-Probabilistic Relational PCA

Author: Wu-jun Li, Dit-Yan Yeung, Zhihua Zhang

2 0.57049531 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

Author: Piyush Rai, Hal Daume

Abstract: Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efﬁcacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction. 1

3 0.56267858 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

Author: Jaakko Luttinen, Alexander T. Ihler

Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.

4 0.55908096 104 nips-2009-Group Sparse Coding

Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow

Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classiﬁcation dataset show that when compact image or dictionary representations are needed for computational efﬁciency, the proposed approach yields better mean average precision in classiﬁcation. 1

5 0.55873448 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

Author: Ilya Sutskever, Joshua B. Tenenbaum, Ruslan Salakhutdinov

6 0.5567494 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

7 0.55630219 70 nips-2009-Discriminative Network Models of Schizophrenia

8 0.5554511 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

9 0.55469972 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

10 0.55364645 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

11 0.55308527 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

12 0.55196685 100 nips-2009-Gaussian process regression with Student-t likelihood

13 0.55087912 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

14 0.55036646 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

15 0.54987663 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

16 0.54937488 97 nips-2009-Free energy score space

17 0.54934657 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

18 0.54923803 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

19 0.54871798 137 nips-2009-Learning transport operators for image manifolds

20 0.5486452 148 nips-2009-Matrix Completion from Power-Law Distributed Samples