nips nips2008 nips2008-236 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. [sent-7, score-0.27]
2 After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. [sent-8, score-0.379]
3 A common Bayesian approach in the one-dimensional setting is to assume there is cluster structure and use a mixture model with a prior distribution over partitions of the objects in X. [sent-15, score-0.151]
4 A similar approach for relational data would na¨vely require a prior distribution on partitions of the product ı space X × Y = {(x, y) | x ∈ X, y ∈ Y }. [sent-16, score-0.29]
5 , by placing a Chinese restaurant process (CRP) prior on partitions of X × Y . [sent-19, score-0.177]
6 An unsatisfactory implication of this choice is that the distribution on partitions of (Ri,j ) is exchangeable, i. [sent-20, score-0.121]
7 Stochastic block models2 place prior distributions on partitions of X and Y separately, which can be interpreted as inducing a distribution on partitions of the product space by considering the product of the partitions. [sent-23, score-0.364]
8 By arranging the rows and columns of (Ri,j ) so that clustered objects have adjacent indices, such partitions look like regular grids (Figure 1. [sent-24, score-0.19]
9 (2007) generate random partitions which are not constrained to be regular grids (Figure 1. [sent-28, score-0.139]
10 Motivated by the need for a consistent distribution on partitions of product spaces with more structure than classic block models, we define a class of nonparametric distributions we have named Mondrian processes after Piet Mondrian and his abstract grid-based paintings. [sent-30, score-0.26]
11 Mondrian processes are random partitions on product spaces not constrained to be regular grids. [sent-31, score-0.2]
12 We begin by introducing the notion of partially exchangeable arrays by Aldous (1981) and Hoover (1979), a generalization of exchangeability on sequences appropriate for modeling relational data. [sent-34, score-0.391]
13 2 1 We then define the Mondrian process, highlight a few of its elegant properties, and describe two nonparametric models for relational data that use the Mondrian process as a prior on partitions. [sent-42, score-0.215]
14 is an exchangeable sequence, then there exists a random parameter θ such that the sequence is conditionally iid given θ: n p(x1 , . [sent-47, score-0.207]
15 , xn ) = px (xi |θ)dθ pθ (θ) (1) i=1 That is, exchangeable sequences arise as a mixture of iid sequences, where the mixing distribution is p(θ). [sent-50, score-0.187]
16 In this section we describe notions of exchangeability for relational data originally proposed by Aldous (1981) and Hoover (1979) in the context of exchangeable arrays. [sent-52, score-0.341]
17 Kallenberg (2005) significantly expanded on the concept, and Diaconis and Janson (2007) showed a strong correspondence between such exchangeable relations and a notion of limits on graph structures (Lov´ sz and Szegedy, 2006). [sent-53, score-0.229]
18 We say that R is separately exchangeable if its distribution is invariant to separate permutations on its rows and columns. [sent-60, score-0.164]
19 Aldous (1981) and Hoover (1979) showed that separately exchangeable relations can always be represented in the following way: each object i (and j) has a latent representation ξi (ηj ) drawn iid from some distribution pξ (pη ); independently let θ be an additional random parameter. [sent-62, score-0.252]
20 We say R is jointly exchangeable if it is invariant to jointly permuting rows and columns; that is, for each n ≥ 1 and each permutation π ∈ Sn we have p(R1:n,1:n ) = p(Rπ(1:n),π(1:n) ) (4) Such jointly exchangeable relations also have a form similar to (3). [sent-67, score-0.393]
21 The first impression from (5) is that joint exchangeability implies a more restricted functional form than separately exchangeable (3). [sent-69, score-0.206]
22 In fact, the reverse holds—(5) means that the latent representations of row i and column i need not be independent, and that Ri,j and Rj,i need not be conditionally independent given the row and column representations, while (3) assumes independence of both. [sent-70, score-0.122]
23 The above Aldous-Hoover representation serves as the theoretical foundation for hierarchical Bayesian modeling of exchangeable relational data, just as de Finetti’s representation serves as a foundation for the modeling of exchangeable sequences. [sent-74, score-0.463]
24 , 2006) induce regular partitions on the product space, introducing structure where the data do not support it. [sent-81, score-0.173]
25 (2) Axis- aligned partitions, like those produced by annotated hierarchies and the Mondrian process provide (a posteriori) resolution only where it is needed. [sent-82, score-0.113]
26 (4) We can visualize the sequential hierarchical process by spreading the cuts out over time. [sent-84, score-0.155]
27 (5) Mondrian process with beta L´ vy e measure, µ(dx) = x−1 dx on [0, 1]2 . [sent-86, score-0.115]
28 3 The Mondrian Process The Mondrian process can be expressed as a recursive generative process that randomly makes axisaligned cuts, partitioning the underlying product space in a hierarchical fashion akin to decision trees or kd- trees. [sent-89, score-0.176]
29 The implication of consistency is that we can extend the Mondrian process to infinite spaces and use it as a nonparametric prior for modeling exchangeable relational data. [sent-91, score-0.399]
30 1 The one dimensional case The simplest space to introduce the Mondrian process is the unit interval [0, 1]. [sent-93, score-0.11]
31 Each cut costs a random amount, eventually exhausting the budget and resulting in a finite partition m of the unit interval. [sent-95, score-0.211]
32 The cost, EI , to cut an interval I is exponentially distributed with inverse mean given by the length of the interval. [sent-96, score-0.139]
33 If λ < 0, we make no cuts and the process returns the trivial partition m = {[0, 1]}. [sent-99, score-0.225]
34 Otherwise, we make a cut uniformly at random, splitting the unit interval into two subintervals A and B. [sent-100, score-0.167]
35 The process recurses independently on A and B, with independent budgets λ , producing partitions mA and mB , which are then combined into a partition m = mA mB of [0, 1]. [sent-101, score-0.247]
36 The resulting cuts can be shown to be a Poisson (point) process. [sent-102, score-0.099]
37 Unlike the standard description of the Poisson process, the cuts in this “break and branch” process are organized in a hierarchy. [sent-103, score-0.155]
38 As the Poisson process is a fundamental building block for random measures such as the Dirichlet process (DP), we will later exploit this relationship to build various multidimensional generalizations. [sent-104, score-0.193]
39 2 Generalizations to higher dimensions and trees We begin in two dimensions by describing the generative process for a Mondrian process m ∼ MP(λ, (a, A), (b, B)) on the rectangle (a, A)×(b, B). [sent-106, score-0.142]
40 If λ < 0, the process halts, and returns the trivial partition {(a, A) × (b, B)}. [sent-108, score-0.126]
41 Otherwise, an axis- aligned cut is made uniformly at random along the combined lengths of (a, A) and (b, B); that is, the cut lies along a particular dimension with probability proportional to its length, and is drawn uniformly within that interval. [sent-109, score-0.226]
42 , a cut x ∈ (a, A) splits the interval into (a, x) and (x, A). [sent-114, score-0.139]
43 Like the one- dimensional special case, the λ parameter controls the number of cuts, with the process more likely to cut rectangles with large perimeters. [sent-117, score-0.169]
44 In higher dimensions, the cost E to make an additional cut is exponentially distributed with rate given by the sum over all dimensions of the interval lengths. [sent-119, score-0.139]
45 Similarly, the cut point is chosen uniformly at random from all intervals, splitting only that interval in the recursion. [sent-120, score-0.139]
46 Like non- homogeneous Poisson processes, the cut point need not 3 In this paper we shall always mean infinite exchangeability when we state exchangeability. [sent-121, score-0.155]
47 The key property of intervals that the Mondrian process relies upon is that any point cuts the space into one-dimensional, simplyconnected pieces. [sent-125, score-0.175]
48 Trees also have this property: a cut along an edge splits a tree into two trees. [sent-126, score-0.145]
49 We denote a Mondrian process m with rate λ on a product of one-dimensional, simply-connected domains Θ1 ×···×ΘD by m ∼ MP(λ, Θ1 , . [sent-127, score-0.112]
50 Thus a draw from the Mondrian process is either a trivial partition or a tuple m = d, x, λ , m< , m> , representing a cut at x along the d’th dimension Θd , with nested Mondrians m< and m> on either side of the cut. [sent-138, score-0.261]
51 Therefore, m is itself a tree of axis-aligned cuts (a kd-tree data structure), with the leaves of the tree forming the partition of the original product space. [sent-139, score-0.267]
52 Conditional Independencies: The generative process for the Mondrian produces a tree of cuts, where each subtree is itself a draw from a Mondrian. [sent-140, score-0.11]
53 Consistency: The Mondrian process satisfies an important self-consistency property: given a draw from a Mondrian on some domain, the partition on any subdomain has the same distribution as if we sampled a Mondrian process directly on that subdomain. [sent-144, score-0.225]
54 , ΦD ) of m to Φ1 × · · · × ΦD is the subtree of cuts within Φ1 × · · · × ΦD . [sent-152, score-0.099]
55 We define restrictions inductively: If there are no cuts in m, i. [sent-153, score-0.099]
56 A consequence of this consistency property is that we can now use the Daniell-Kolmogorov extension theorem to extend the Mondrian process to σ-finite domains (those that can be written as a countable union of finite domains). [sent-191, score-0.098]
57 For example, from a Mondrian process on products of intervals, we can construct a Mondrian process on all of RD . [sent-192, score-0.112]
58 Note that if the domains have infinite measure, the tree of cuts will be infinitely deep with no root and infinitely many leaves (being the infinite partition of the product space). [sent-193, score-0.257]
59 However the restriction of the tree to any given finite subdomains will be finite with a root (with probability one). [sent-194, score-0.1]
60 But since µ1 is non-atomic, µ1 ({y}) = 0 thus ρ will not have any cuts in the first domain (with probability 1). [sent-202, score-0.099]
61 5 1 3 5 6 2 1 4 Figure 2: Modeling a Mondrian with a Mondrian: A posterior sample given relational data created from an actual Mondrian painting. [sent-211, score-0.135]
62 These synthetic data were generated by fitting a regular 6 × 7 point array over the painting (6 row objects, 7 column objects), and using the blocks in the painting to determine the block structure of these 42 relations. [sent-214, score-0.221]
63 We then sampled 18 relational arrays with this block structure. [sent-215, score-0.239]
64 The colors are for visual effect only as the partitions are contiguous rectangles. [sent-217, score-0.143]
65 Each point represents a relation Ri,j ; each row of points are the relations (Ri,· ) for an object ξi , and similarly for columns. [sent-219, score-0.117]
66 (4) Induced partition on the (discrete) relational array, matching the painting. [sent-221, score-0.205]
67 (5) Partitioned and permuted relational data showing block structure. [sent-222, score-0.189]
68 To do so, we have to consider three possibilities: when m contains no cuts, when the first cut of m is in ρ, and when the first cut of m is above ρ. [sent-228, score-0.226]
69 Fortunately the probabilities of each of these events can be computed easily, and amounts to drawing an exponential sample E ∼ Exp( d µd (Θd \ Φd )), and comparing it against the diminished rate after the first cut in ρ. [sent-229, score-0.113]
70 if ρ has no cuts then λ ← 0 else d , x , λ , ρ< , ρ> ← ρ. [sent-240, score-0.099]
71 As an example, the expected number of slices along each dimension of (0, A) × (0, B) is λA and λB, while the expected total number of partitions is (1 + λA)(1 + λB). [sent-266, score-0.121]
72 Interestingly, this is also the expected number of partitions in a biclustering model where we first have two independent Poisson processes with rate λ partition (0, A) and (0, B), and then form the product partition of (0, A) × (0, B). [sent-267, score-0.322]
73 The colors are for visual effect only as the partitions are contiguous rectangles. [sent-270, score-0.143]
74 Note that partitions are not necessarily contiguous; we use color to identify partitions. [sent-317, score-0.121]
75 The partition structure is related to the annotated hierarchies model (Roy et al. [sent-318, score-0.127]
76 (3) A sequence of cuts; each cut separates a subtree. [sent-321, score-0.113]
77 5 Relational Modeling To illustrate how the Mondrian process can be used to model relational data, we describe two nonparametric block models for exchangeable relations. [sent-323, score-0.433]
78 Recall the Aldous- Hoover representation (θ, ξi , ηj , pR ) for exchangeable arrays. [sent-325, score-0.164]
79 Using a Mondrian process with beta L´ vy measure µ(dx) = αx−1 dx, we first sample a random partition of the unit e square into blocks and assign each block a probability: M ∼ MP(λ, [0, 1], [0, 1]) φS | M ∼ Beta(a0 , a1 ), ∀S ∈ M. [sent-326, score-0.287]
80 slices up unit square into blocks each block S gets a probability φS (6) (7) The pair (M, φ) plays the role of θ in the Aldous- Hoover representation. [sent-327, score-0.102]
81 shared x coordinate for each row shared y coordinate for each column (8) (9) Let Sij be the block S ∈ M such that (ξi , ηj ) ∈ S. [sent-335, score-0.105]
82 φSij (10) This model clusters relations together whose (ξi , ηj ) pairs fall in the same blocks in the Mondrian partition and models each cluster with a beta-binomial likelihood model. [sent-342, score-0.135]
83 By mirroring the AldousHoover representation, we guarantee that R is exchangeable and that there is no order dependence. [sent-343, score-0.164]
84 , 2006), where rows and columns are first clustered using a CRP prior, then each relation Rij is conditionally independent from others given the clusters that row i and column j belong to. [sent-346, score-0.116]
85 (6) with M ∼ MP(λ, [0, 1]) × MP(λ, [0, 1]), product of partitions of unit intervals (11) then we recover the same marginal distribution over relations as the IRM/IHRM. [sent-348, score-0.268]
86 To see this, recall that a Mondrian process in one-dimension produces a partition whose cut points follow a Poisson point process. [sent-349, score-0.239]
87 , partitions) induced by a Poisson point process on [0, 1] with the beta L´ vy measure have the same distribution as those in the sticke breaking construction of the DP. [sent-353, score-0.115]
88 We can also construct an exchangeable variant of the Annotated Hierarchies model (a hierarchical block model) by moving from the unit square to a product of random trees drawn from Kingman’s coalescent prior (Kingman, 1982a). [sent-358, score-0.33]
89 for each dimension, sample a tree partition the cross product of trees each block S gets a probability φS (12) (13) (14) Let Sij be the subset S ∈ M where leaves (i, j) fall in S. [sent-367, score-0.22]
90 Kingman shows that the partition on the leaves of a coalescent tree when its edges are cut by a Poisson point process is the same as that of a DP (Figure 4). [sent-376, score-0.271]
91 Therefore, the partition structure along every row and column is marginally the same as a DP. [sent-377, score-0.121]
92 Both the unit square and product of random trees models give DP distributed partitions on each row and column, but they have different inductive biases. [sent-378, score-0.261]
93 Figure 2 shows a sample after 1500 iterations (starting from a random initialization) where the partition on the array is exactly recovered. [sent-383, score-0.104]
94 We next analyzed the classic Countries data set from the network analysis literature (Wasserman and Faust, 1994), which reports trade in 1984 between 24 countries in food and live animals; crude materials; minerals and fuels; basic manufactured goods; and exchange of diplomats. [sent-387, score-0.136]
95 Given three relations (friends, works-with, and 7 gives-orders-to), the maximum a posteriori Mondrian process partitions the relations into homogeneous blocks. [sent-395, score-0.307]
96 7 Discussion While the Mondrian process has many elegant properties, much more work is required to determine its usefulness for relational modeling. [sent-398, score-0.191]
97 We are currently investigating improved MCMC sampling schemes for the Mondrian process, as well as working to develop a combinatorial representation of the distribution on partitions induced by the Mondrian process. [sent-400, score-0.121]
98 The axis-aligned partitions of [0, 1]n produced by the Mondrian process have been studied extensively in combinatorics and computational geometry, where they are known as guillotine partitions. [sent-402, score-0.209]
99 Guillotine partitions have wide ranging applications including circuit design, approximation algorithms and computer graphics. [sent-403, score-0.121]
100 Learning systems of concepts with an infinite relational model. [sent-456, score-0.135]
wordName wordTfidf (topN-words)
[('mondrian', 0.748), ('janitor', 0.411), ('student', 0.24), ('exchangeable', 0.164), ('relational', 0.135), ('partitions', 0.121), ('cut', 0.113), ('cuts', 0.099), ('mp', 0.089), ('partition', 0.07), ('relations', 0.065), ('countries', 0.063), ('rij', 0.063), ('sij', 0.063), ('process', 0.056), ('poisson', 0.055), ('block', 0.054), ('hoover', 0.053), ('arrays', 0.05), ('kingman', 0.047), ('kemp', 0.047), ('roy', 0.042), ('exchangeability', 0.042), ('restriction', 0.036), ('beta', 0.035), ('array', 0.034), ('product', 0.034), ('tree', 0.032), ('aldous', 0.032), ('diplomats', 0.032), ('fuels', 0.032), ('guillotine', 0.032), ('janitors', 0.032), ('minerals', 0.032), ('painting', 0.032), ('subdomains', 0.032), ('wasserman', 0.032), ('trees', 0.03), ('objects', 0.03), ('hierarchies', 0.029), ('row', 0.028), ('unit', 0.028), ('goods', 0.028), ('materials', 0.028), ('finetti', 0.028), ('annotated', 0.028), ('processes', 0.027), ('multidimensional', 0.027), ('social', 0.026), ('interval', 0.026), ('china', 0.025), ('generalizations', 0.025), ('nonparametric', 0.024), ('relation', 0.024), ('vy', 0.024), ('iid', 0.023), ('slice', 0.023), ('column', 0.023), ('dp', 0.023), ('spain', 0.022), ('contiguous', 0.022), ('stochastic', 0.022), ('draw', 0.022), ('domains', 0.022), ('clustered', 0.021), ('crude', 0.021), ('alger', 0.021), ('anowadya', 0.021), ('argen', 0.021), ('brazl', 0.021), ('ecuad', 0.021), ('egypt', 0.021), ('ethio', 0.021), ('finla', 0.021), ('hondu', 0.021), ('indon', 0.021), ('irm', 0.021), ('israe', 0.021), ('kallenberg', 0.021), ('liber', 0.021), ('madag', 0.021), ('mondrians', 0.021), ('newze', 0.021), ('pakis', 0.021), ('piet', 0.021), ('professors', 0.021), ('subdomain', 0.021), ('switz', 0.021), ('syria', 0.021), ('thail', 0.021), ('yugos', 0.021), ('japan', 0.02), ('food', 0.02), ('square', 0.02), ('consistency', 0.02), ('intervals', 0.02), ('conditionally', 0.02), ('pr', 0.019), ('nite', 0.019), ('regular', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
2 0.05764275 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8
3 0.053695578 248 nips-2008-Using matrices to model symbolic relationship
Author: Ilya Sutskever, Geoffrey E. Hinton
Abstract: We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3, +3) ∈ plus and (+3, −3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations +3 or has wife even though it has not been trained on any first-order examples involving these relations. 1
4 0.052845486 134 nips-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: In many settings, such as protein interactions and gene regulatory networks, collections of author-recipient email, and social networks, the data consist of pairwise measurements, e.g., presence or absence of links between pairs of objects. Analyzing such data with probabilistic models requires non-standard assumptions, since the usual independence or exchangeability assumptions no longer hold. In this paper, we introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) with a local model to instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodel with applications to social networks and protein interaction networks. 1
5 0.045399711 234 nips-2008-The Infinite Factorial Hidden Markov Model
Author: Jurgen V. Gael, Yee W. Teh, Zoubin Ghahramani
Abstract: We introduce a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After constructing an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation. 1
6 0.044977229 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
7 0.042584244 235 nips-2008-The Infinite Hierarchical Factor Regression Model
8 0.042072561 233 nips-2008-The Gaussian Process Density Sampler
9 0.04080826 69 nips-2008-Efficient Exact Inference in Planar Ising Models
10 0.036411926 107 nips-2008-Influence of graph construction on graph-based clustering measures
11 0.035477776 81 nips-2008-Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM
12 0.029764429 26 nips-2008-Analyzing human feature learning as nonparametric Bayesian inference
13 0.029409824 171 nips-2008-Online Prediction on Large Diameter Graphs
14 0.028126653 231 nips-2008-Temporal Dynamics of Cognitive Control
15 0.028000252 219 nips-2008-Spectral Hashing
16 0.02797737 229 nips-2008-Syntactic Topic Models
17 0.026728502 179 nips-2008-Phase transitions for high-dimensional joint support recovery
18 0.024862619 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
19 0.024265863 4 nips-2008-A Scalable Hierarchical Distributed Language Model
20 0.023930401 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
topicId topicWeight
[(0, -0.085), (1, -0.008), (2, 0.033), (3, -0.001), (4, 0.02), (5, -0.052), (6, 0.025), (7, 0.037), (8, 0.004), (9, -0.054), (10, -0.033), (11, 0.015), (12, 0.037), (13, -0.032), (14, -0.03), (15, 0.029), (16, 0.003), (17, -0.015), (18, 0.021), (19, 0.021), (20, -0.042), (21, 0.024), (22, 0.031), (23, -0.018), (24, -0.039), (25, 0.014), (26, -0.04), (27, 0.093), (28, 0.055), (29, 0.024), (30, -0.022), (31, 0.009), (32, 0.012), (33, 0.019), (34, 0.04), (35, 0.027), (36, 0.061), (37, -0.015), (38, 0.027), (39, -0.025), (40, 0.004), (41, 0.013), (42, 0.045), (43, -0.03), (44, 0.022), (45, 0.053), (46, 0.006), (47, 0.006), (48, -0.06), (49, -0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.90621769 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
2 0.6022048 134 nips-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: In many settings, such as protein interactions and gene regulatory networks, collections of author-recipient email, and social networks, the data consist of pairwise measurements, e.g., presence or absence of links between pairs of objects. Analyzing such data with probabilistic models requires non-standard assumptions, since the usual independence or exchangeability assumptions no longer hold. In this paper, we introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) with a local model to instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodel with applications to social networks and protein interaction networks. 1
3 0.54870361 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan
Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1
4 0.54761362 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response
Author: Alexander Braunstein, Zhi Wei, Shane T. Jensen, Jon D. Mcauliffe
Abstract: Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method. 1
5 0.54747087 152 nips-2008-Non-stationary dynamic Bayesian networks
Author: Joshua W. Robinson, Alexander J. Hartemink
Abstract: A principled mechanism for identifying conditional dependencies in time-series data is provided through structure learning of dynamic Bayesian networks (DBNs). An important assumption of DBN structure learning is that the data are generated by a stationary process—an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical models called non-stationary dynamic Bayesian networks, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. 1
6 0.5250693 233 nips-2008-The Gaussian Process Density Sampler
7 0.51540029 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
8 0.50932556 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
9 0.50607163 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
10 0.4865371 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering
11 0.48498586 234 nips-2008-The Infinite Factorial Hidden Markov Model
12 0.48088738 248 nips-2008-Using matrices to model symbolic relationship
13 0.43569231 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
14 0.43022913 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
15 0.42673469 69 nips-2008-Efficient Exact Inference in Planar Ising Models
16 0.42353401 111 nips-2008-Kernel Change-point Analysis
17 0.42040798 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
18 0.41172418 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
19 0.40328312 211 nips-2008-Simple Local Models for Complex Dynamical Systems
20 0.39604914 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
topicId topicWeight
[(6, 0.037), (7, 0.044), (12, 0.027), (28, 0.118), (57, 0.539), (59, 0.011), (63, 0.01), (77, 0.037), (83, 0.03)]
simIndex simValue paperId paperTitle
1 0.94139302 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
2 0.93274349 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
3 0.92303562 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
same-paper 4 0.91316605 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
5 0.8971864 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.79942435 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
7 0.77297717 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
8 0.72032815 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
9 0.6792503 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
10 0.64111954 234 nips-2008-The Infinite Factorial Hidden Markov Model
11 0.62273139 35 nips-2008-Bayesian Synchronous Grammar Induction
12 0.61863112 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
13 0.60728401 200 nips-2008-Robust Kernel Principal Component Analysis
14 0.59440053 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction
15 0.59029174 66 nips-2008-Dynamic visual attention: searching for coding length increments
16 0.58611917 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
17 0.58561397 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
18 0.58122778 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
19 0.57708031 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
20 0.57163596 229 nips-2008-Syntactic Topic Models