jmlr jmlr2011 jmlr2011-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
Reference: text
sentIndex sentText sentNum sentScore
1 Blei Abstract The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. [sent-7, score-0.669]
2 Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices 1. [sent-11, score-0.645]
3 An alternative is to assume that the amount of latent structure is actually potentially unbounded, and that the observed objects only manifest a sparse subset of those classes or features (Rasmussen and Ghahramani, 2001). [sent-16, score-0.449]
4 The prior distribution over assignments of datapoints to classes is specified in such a way that the number of classes used by the model is bounded only by the number of objects, making Dirichlet process mixture models “infinite” mixture models (Rasmussen, 2000). [sent-24, score-0.612]
5 In this paper, we summarize recent work exploring the extension of this nonparametric approach to models in which objects are represented using an unknown number of latent features. [sent-41, score-0.431]
6 This distribution can be specified in terms of a simple stochastic process called the Indian buffet process, by analogy to the Chinese restaurant process used in Dirichlet process mixture models. [sent-43, score-0.66]
7 We illustrate how the Indian buffet process can be used to specify prior distributions in latent feature models, using a simple linear-Gaussian model to show how such models can be defined and used. [sent-44, score-0.626]
8 We review these applications, summarizing some of the innovations that have been introduced in order to use the Indian buffet process in different settings, as well as extensions to the basic model and alternative inference algorithms. [sent-47, score-0.412]
9 As for the Chinese restaurant process, we can arrive at the Indian buffet process in a number of different ways: as the infinite limit of a finite model, via the constructive specification of an infinite model, or by marginalizing out an underlying measure. [sent-49, score-0.449]
10 Section 2 summarizes the principles behind infinite mixture models, focusing on the prior on class assignments assumed in these models, which can be defined in terms of a simple stochastic process—the Chinese restaurant process. [sent-52, score-0.379]
11 We then develop a distribution on infinite binary matrices by considering how this approach can be extended to the case where objects are represented with multiple binary features. [sent-53, score-0.413]
12 Section 6 describes further applications of this approach, both in latent feature models and for inferring graph structures, and Section 7 discusses recent work extending the Indian buffet process and providing connections to other stochastic processes. [sent-57, score-0.594]
13 In a latent class model, such as a mixture model, each object is assumed to belong to a single class, ci , and the properties xi are generated from a distribution determined by that class. [sent-61, score-0.403]
14 , K ) α ∏K Γ(mk + K ) Γ(α) k=1 , α Γ( K )K Γ(N + α) (3) where mk = ∑N δ(ci = k) is the number of objects assigned to class k. [sent-115, score-0.389]
15 Rather, they are exchangeable (Bernardo and Smith, 1994), with the probability of an assignment vector remaining the same when the indices of the objects are permuted. [sent-118, score-0.37]
16 However, the distribution on assignment vectors defined by Equation 3 assumes an upper bound on the number of classes of objects, since it only allows assignments of objects to up to K classes. [sent-122, score-0.458]
17 These limiting probabilities define a valid distribution over partitions, and thus over equivalence classes of class assignments, providing a prior over class assignments for an infinite mixture model. [sent-154, score-0.368]
18 This process continues until all customers have seats, defining a distribution over allocations of people to tables, and, more generally, objects to classes. [sent-172, score-0.427]
19 If we assume an ordering on our N objects, then we can assign them to classes sequentially using the method specified by the CRP, letting objects play the role of customers and classes play the role of tables. [sent-175, score-0.399]
20 , ci−1 ) = mk i−1+α α i−1+α k ≤ K+ k = K +1 where mk is the number of objects currently assigned to class k, and K+ is the number of classes for which mk > 0. [sent-179, score-0.864]
21 If all N objects are assigned to classes via this process, the probability of a partition of objects c is that given in Equation 5. [sent-180, score-0.393]
22 The CRP thus provides an intuitive means of specifying a prior for infinite mixture models, as well as revealing that there is a simple sequential process by which exchangeable class assignments can be generated. [sent-181, score-0.42]
23 In a finite mixture model with P(c) defined as in Equation 3, we can compute P(ci = k|c−i ) by integrating over θ, obtaining P(ci = k|c−i ) = P(ci = k|θ)p(θ|c−i ) dθ α m−i,k + K , (6) N −1+α where m−i,k is the number of objects assigned to class k, not including object i. [sent-198, score-0.371]
24 In the next two sections, we use the insights underlying infinite mixture models to derive methods for representing objects in terms of infinitely many latent features, leading us to derive a distribution on infinite binary matrices. [sent-213, score-0.553]
25 Latent Feature Models In a latent feature model, each object is represented by a vector of latent feature values fi , and the properties xi are generated from a distribution determined by those latent feature values. [sent-215, score-0.784]
26 Using the matrix F = fT fT · · · fT to N 1 2 indicate the latent feature values for all N objects, the model is specified by a prior over features, p(F), and a distribution over observed property matrices conditioned on those features, p(X|F). [sent-218, score-0.447]
27 We can break the matrix F into two components: a binary matrix Z indicating which features are possessed by each object, with zik = 1 if object i has feature k and 0 otherwise, and a second matrix V indicating the value of each feature for each object. [sent-221, score-0.785]
28 In many latent feature models, such as PCA and CVQ, objects have non-zero values on every feature, and every entry of Z is 1. [sent-223, score-0.395]
29 Assuming that Z is sparse, we can define a prior for infinite latent feature models by defining a distribution over infinite binary matrices. [sent-231, score-0.399]
30 Our analysis of latent class models provides two desiderata for such a distribution: objects should be exchangeable, and inference should be tractable. [sent-232, score-0.474]
31 1193 G RIFFITHS AND G HAHRAMANI K features (b) K features N objects N objects 0. [sent-237, score-0.504]
32 A binary matrix Z, as shown in (a), can be used as the basis for sparse infinite latent feature models, indicating which features take non-zero values. [sent-247, score-0.396]
33 1 A Finite Feature Model We have N objects and K features, and the possession of feature k by object i is indicated by a binary variable zik . [sent-251, score-0.668]
34 The zik thus form a binary N × K feature matrix, Z. [sent-253, score-0.413]
35 , πK }, is K N K P(Z|π) = ∏ ∏ P(zik |πk ) = ∏ πmk (1 − πk )N−mk , k k=1 i=1 k=1 where mk = ∑N zik is the number of objects possessing feature k. [sent-259, score-0.734]
36 Γ(r + s) and s = 1, so Equation 8 becomes α B( K , 1) = α Γ( K ) K α = , Γ(1 + K ) α 1194 (8) I NDIAN B UFFET P ROCESS α πk zik N K Figure 4: Graphical model for the beta-binomial model used in defining the Indian buffet process. [sent-263, score-0.593]
37 The marginal probability of a binary matrix Z is K P(Z) = ∏ k=1 N ∏ P(zik |πk ) p(πk ) dπk i=1 = α B(mk + K , N − mk + 1) ∏ α B( K , 1) k=1 = α α K Γ(mk + K )Γ(N − mk + 1) . [sent-270, score-0.527]
38 This left-ordered matrix was generated from the exchangeable Indian buffet process with α = 10. [sent-284, score-0.456]
39 2 Equivalence Classes In order to find the limit of the distribution specified by Equation 10 as K → ∞, we need to define equivalence classes of binary matrices—the analogue of partitions for assignment vectors. [sent-287, score-0.349]
40 lo f (Z) is obtained by ordering the columns of the binary matrix Z from left to right by the magnitude of the binary number expressed by that column, taking the first row as the most significant bit. [sent-291, score-0.348]
41 Kh will denote the number of features possessing the history 2N −1 h, with K0 being the number of features for which mk = 0 and K+ = ∑h=1 Kh being the number of features for which mk > 0, so K = K0 + K+ . [sent-307, score-0.708]
42 Any two binary matrices Y and Z are lo f -equivalent if lo f (Y) = lo f (Z), that is, if Y and Z map to the same left-ordered form. [sent-311, score-0.485]
43 The lo f -equivalence class of a binary matrix Z, denoted [Z], is the set of binary matrices that are lo f -equivalent to Z. [sent-312, score-0.464]
44 Performing inference at the level of lo f -equivalence classes is appropriate in models where feature order is not identifiable, with p(X|F) being unaffected by the order of the columns of F. [sent-314, score-0.409]
45 The columns of a binary matrix are not guaranteed to be unique: since an object can possess multiple features, it is possible for two features to be possessed by exactly the same set of objects. [sent-317, score-0.378]
46 lo f -equivalence classes play the same role for binary matrices as partitions do for assignment vectors: they collapse together all binary matrices (assignment vectors) that differ only in column ordering (class labels). [sent-325, score-0.536]
47 This relationship can be made precise by examining the lo f -equivalence classes of binary matrices constructed from assignment vectors. [sent-326, score-0.366]
48 Define the class matrix generated by an assignment vector c to be a binary matrix Z where zik = 1 if and only if ci = k. [sent-327, score-0.52]
49 3 Taking the Infinite Limit Under the distribution defined by Equation 10, the probability of a particular lo f -equivalence class of binary matrices, [Z], is P([Z]) = ∑ P(Z) Z∈[Z] = α α K Γ(mk + K )Γ(N − mk + 1) . [sent-330, score-0.441]
50 K ∏ (12) In order to take the limit of this expression as K → ∞, we will divide the columns of Z into two subsets, corresponding to the features for which mk = 0 and the features for which mk > 0. [sent-333, score-0.699]
51 Reordering the columns such that mk > 0 if k ≤ K+ , and mk = 0 otherwise, we can break the product in Equation 12 into two parts, corresponding to these two subsets. [sent-334, score-0.495]
52 N α ∏ j=1 ( j + K ) K−K+ K K K+ α ∏K k=1 K+ α Γ(mk + K )Γ(N − mk + 1) α Γ(N + 1 + K ) α Γ(mk + K )Γ(N − mk + 1) ∏ α Γ( K )Γ(N + 1) k=1 α K K+ K+ 1197 ∏ k=1 α (N − mk )! [sent-336, score-0.645]
53 We can define a distribution over infinite binary matrices by specifying a procedure by which customers (objects) choose dishes (features). [sent-357, score-0.477]
54 In our Indian buffet process (IBP), N customers enter a restaurant one after another. [sent-358, score-0.536]
55 Each customer encounters a buffet consisting of infinitely many dishes arranged in a line. [sent-359, score-0.559]
56 The first customer starts at the left of the buffet and takes a serving from each dish, stopping after a Poisson(α) number of dishes as his plate becomes overburdened. [sent-360, score-0.559]
57 The ith customer moves along the buffet, sampling dishes in proportion to their popularity, serving himself with probability mk , where mk is the number i of previous customers who have sampled a dish. [sent-361, score-0.94]
58 i We can indicate which customers chose which dishes using a binary matrix Z with N rows and infinitely many columns, where zik = 1 if the ith customer sampled the kth dish. [sent-363, score-0.867]
59 The third customer tried 3 dishes tried by both previous customers, 5 dishes tried by only the first customer, and 2 new dishes. [sent-367, score-0.477]
60 1198 I NDIAN B UFFET P ROCESS Dishes 1 2 3 4 5 6 7 Customers 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 6: A binary matrix generated by the Indian buffet process with α = 10. [sent-378, score-0.4]
61 However, if we only pay attention to the lo f -equivalence classes of the matrices generated by this process, we obtain the exchangeable dis(i) tribution P([Z]) given by Equation 14: ∏N K1 ! [sent-380, score-0.35]
62 It is possible to define a similar sequential process that directly produces a distribution on lo f equivalence classes in which customers are exchangeable, but this requires more effort on the part of the customers. [sent-383, score-0.425]
63 In the exchangeable Indian buffet process, the first customer samples a Poisson(α) number of dishes, moving from left to right. [sent-384, score-0.512]
64 If there are Kh dishes with history h, under which mh previous customers have sampled each of those dishes, then the customer samples a Binomial( mh , Kh ) number of those dishes, starting at the left. [sent-386, score-0.685]
65 Attending to the i history of the dishes and always sampling from the left guarantees that the resulting matrix is in left-ordered form, and it is easy to show that the matrices produced by this process have the same probability as the corresponding lo f -equivalence classes under Equation 14. [sent-388, score-0.599]
66 2, we noted that lo f -equivalence classes of binary matrices generated from assignment vectors correspond to partitions. [sent-391, score-0.366]
67 This connection between lo f -equivalence classes of feature matrices and collections of feature histories suggests another means of deriving the distribution specified by Equation 14, operating directly on the frequencies of these histories. [sent-400, score-0.453]
68 7 Inference by Gibbs Sampling We have defined a distribution over infinite binary matrices that satisfies one of our desiderata— objects (the rows of the matrix) are exchangeable under this distribution. [sent-444, score-0.469]
69 It remains to be shown that inference in infinite latent feature models is tractable, as was the case for infinite mixture models. [sent-445, score-0.417]
70 We will derive a Gibbs sampler for sampling from the distribution defined by the IBP, which suggests a strategy for inference in latent feature models in which the exchangeable IBP is used as a prior. [sent-446, score-0.603]
71 To sample from the distribution defined by the IBP, we need to compute the conditional distribution P(zik = 1|Z−(ik) ), where Z−(ik) denotes the entries of Z other than zik . [sent-448, score-0.356]
72 Integrating over πk gives 1 P(zik = 1|z−i,k ) = = 0 P(zik |πk )p(πk |z−i,k ) dπk α m−i,k + K α , N+K (17) where z−i,k is the set of assignments of other objects, not including i, for feature k, and m−i,k is the number of objects possessing feature k, not including i. [sent-450, score-0.439]
73 Choosing an ordering on objects such that the ith object corresponds to the last customer to visit the buffet, we obtain m−i,k , (18) P(zik = 1|z−i,k ) = N for any k such that m−i,k > 0. [sent-453, score-0.416]
74 2 2 A A If Z is in left-ordered form, we can write it as [Z+ Z0 ], where Z+ consists of K+ columns with sums mk > 0, and Z0 consists of K0 columns with sums mk = 0. [sent-500, score-0.56]
75 The data set consisting of images of objects was constructed in a way that removes many of the basic challenges of computer vision, with objects appearing in fixed orientations and locations. [sent-578, score-0.392]
76 Further Applications and Alternative Inference Algorithms We now outline six applications of the Indian buffet process, each of which uses the same prior over infinite binary matrices, P(Z), but different choices for the likelihood relating such matrices to observed data. [sent-583, score-0.419]
77 The latent binary feature zik indicates whether protein i belongs to complex k. [sent-607, score-0.593]
78 In this model, both movies and viewers are represented by binary latent vectors with an unbounded number of elements, corresponding to the features they might possess (e. [sent-624, score-0.408]
79 The two corresponding infinite binary matrices interact via a real-valued weight matrix which links features of movies to features of viewers, resulting in a binary matrix factorization of the observed ratings. [sent-627, score-0.445]
80 Given a square matrix of judgments of the similarity between N objects, where si j is the similarity between objects i and j, the additive clustering model seeks to recover a N × K binary feature matrix F and a vector of K weights associated with those features such that si j ≈ ∑K wk fik f jk . [sent-636, score-0.515]
81 Just as allowing objects to have latent features rather than a single latent class makes it possible to go beyond mixture models, this approach allows us to define models for link prediction that are richer than stochastic blockmodels. [sent-658, score-0.71]
82 The binary variable zik indicates whether hidden variable k has a direct causal influence on observed variable i; in other words whether k is a parent of i in the graph. [sent-694, score-0.383]
83 Extensions and Connections to Other Processes The Indian buffet process gives a way to characterize our distribution on infinite binary matrices in terms of a simple stochastic process. [sent-737, score-0.509]
84 6 This generalization keeps the average number of features per object at α as before, but allows the overall number of represented features to range from α, an extreme where all features are shared between all objects, to Nα, an extreme where no features are shared at all. [sent-748, score-0.393]
85 To return to the language of the Indian buffet, the first customer starts at the left of the buffet and samples Poisson(α) dishes. [sent-752, score-0.388]
86 The ith customer serves himself from any dish previously sampled by mk > 0 customers with probability mk /(β + i − 1), and in addition from Poisson(αβ/(β + i − 1)) new dishes. [sent-753, score-0.726]
87 1215 G RIFFITHS AND G HAHRAMANI Thibaux and Jordan (2007) provided an answer to this question, showing that the exchangeable distribution produced by the IBP corresponds to the use of a latent measure based on the beta process (Hjort, 1990). [sent-799, score-0.472]
88 Sampling each of the zik independently according to the distribution defined by the appropriate parameter results in the same distribution on Z as the IBP. [sent-801, score-0.356]
89 Producing correlations between the columns of Z can be done by supplementing the IBP with a secondary process capturing patterns in the latent features (Doshi-Velez and Ghahramani, 2009b). [sent-809, score-0.345]
90 By assuming that these parameters are generated from a Beta distribution and taking a limit analogous to that used in the derivation of the IBP, it is possible to define a distribution over equivalence classes of binary matrices in which the rows of the matrix reflect a Markov dependency structure. [sent-821, score-0.37]
91 This model can be used to define richer nonparametric models for temporal data, such as an infinite factorial hidden Markov model, and probabilistic inference can be carried out using a slice sampler (see Van Gael et al. [sent-822, score-0.345]
92 Conclusions and Future Work The methods that have been used to define infinite latent class models can be extended to models in which objects are represented in terms of a set of latent features, and used to derive distributions on infinite binary matrices that can be used as priors for such models. [sent-832, score-0.679]
93 We used this method to derive a prior that is the infinite limit of a simple distribution on finite binary matrices, and showed that the same distribution can be specified in terms of a simple stochastic process—the Indian buffet process. [sent-833, score-0.519]
94 This distribution satisfies our two desiderata for a prior for infinite latent feature models: objects are exchangeable, and inference remains tractable. [sent-834, score-0.583]
95 When used as a prior in models that represent objects using latent features, this distribution can be used to automatically infer the number of features required to account for observed data. [sent-835, score-0.514]
96 Our success in transferring the strategy of taking the limit of a finite model from latent classes to latent features suggests that the same strategy might be applied with other representations, broadening the kinds of latent structure that can be recovered through unsupervised learning. [sent-842, score-0.659]
97 The second expression is mk −1 mk −1 j=1 j=1 α α ∏ ( j + K ) = (mk − 1)! [sent-858, score-0.43]
98 Infinite latent feature models and the Indian buffet process. [sent-1056, score-0.509]
99 Infinite latent feature models and the Indian buffet process. [sent-1062, score-0.509]
100 The phylogenetic Indian buffet process: A nonexchangeable nonparametric prior for latent features. [sent-1141, score-0.51]
wordName wordTfidf (topN-words)
[('ibp', 0.419), ('zik', 0.276), ('buffet', 0.253), ('mk', 0.215), ('indian', 0.185), ('objects', 0.174), ('dishes', 0.171), ('hahramani', 0.167), ('ndian', 0.159), ('uffet', 0.159), ('latent', 0.152), ('riffiths', 0.143), ('customer', 0.135), ('customers', 0.135), ('assignments', 0.127), ('exchangeable', 0.124), ('dirichlet', 0.122), ('poisson', 0.118), ('lo', 0.118), ('mh', 0.1), ('kh', 0.1), ('restaurant', 0.098), ('rocess', 0.093), ('zt', 0.091), ('mixture', 0.084), ('crp', 0.082), ('object', 0.081), ('gibbs', 0.079), ('features', 0.078), ('inference', 0.077), ('ths', 0.073), ('assignment', 0.072), ('beta', 0.07), ('nonparametric', 0.07), ('feature', 0.069), ('grif', 0.069), ('binary', 0.068), ('chinese', 0.066), ('columns', 0.065), ('sampler', 0.063), ('matrices', 0.063), ('teh', 0.059), ('ghahramani', 0.058), ('possessed', 0.057), ('exchangeability', 0.057), ('dyadic', 0.051), ('mzt', 0.05), ('viewers', 0.05), ('process', 0.05), ('histories', 0.049), ('ica', 0.049), ('limit', 0.048), ('ci', 0.046), ('classes', 0.045), ('history', 0.044), ('zi', 0.044), ('images', 0.044), ('neal', 0.044), ('wood', 0.043), ('sampling', 0.043), ('bayesian', 0.042), ('ik', 0.041), ('distribution', 0.04), ('markov', 0.04), ('hidden', 0.039), ('partitions', 0.039), ('complexes', 0.038), ('equivalence', 0.037), ('produced', 0.036), ('desiderata', 0.036), ('judgments', 0.036), ('reconstructions', 0.036), ('stick', 0.035), ('prior', 0.035), ('hn', 0.035), ('stochastic', 0.035), ('models', 0.035), ('nite', 0.035), ('posterior', 0.034), ('carlo', 0.033), ('monte', 0.033), ('za', 0.032), ('movies', 0.032), ('ou', 0.032), ('model', 0.032), ('blei', 0.031), ('mcmc', 0.03), ('entities', 0.029), ('factorial', 0.029), ('matrix', 0.029), ('gael', 0.029), ('knowles', 0.029), ('people', 0.028), ('protein', 0.028), ('unbounded', 0.028), ('indicate', 0.027), ('bernardo', 0.027), ('ith', 0.026), ('equation', 0.026), ('responsible', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
2 0.25519136 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
Author: David M. Blei, Peter I. Frazier
Abstract: We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation. Keywords: Chinese restaurant processes, Bayesian nonparametrics
3 0.14760628 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.11295732 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
5 0.092916414 61 jmlr-2011-Logistic Stick-Breaking Process
Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson
Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation
6 0.082256161 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.080517784 36 jmlr-2011-Generalized TD Learning
8 0.072689131 54 jmlr-2011-Learning Latent Tree Graphical Models
9 0.065687001 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
10 0.06551037 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
11 0.061879702 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
12 0.054831851 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
13 0.047778413 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
14 0.046469286 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
15 0.045709755 55 jmlr-2011-Learning Multi-modal Similarity
16 0.045600954 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
17 0.045041721 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
18 0.039985519 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
19 0.039543178 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
20 0.039081916 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
topicId topicWeight
[(0, 0.245), (1, -0.13), (2, -0.186), (3, 0.016), (4, -0.222), (5, -0.029), (6, 0.092), (7, 0.329), (8, -0.08), (9, -0.017), (10, 0.286), (11, 0.224), (12, -0.031), (13, -0.067), (14, -0.059), (15, 0.001), (16, -0.152), (17, 0.005), (18, -0.042), (19, -0.067), (20, 0.109), (21, -0.077), (22, 0.01), (23, -0.017), (24, 0.011), (25, -0.001), (26, 0.052), (27, -0.091), (28, 0.062), (29, -0.008), (30, -0.052), (31, -0.064), (32, -0.038), (33, 0.074), (34, -0.091), (35, -0.027), (36, 0.001), (37, 0.007), (38, -0.01), (39, 0.012), (40, 0.04), (41, -0.015), (42, -0.035), (43, -0.024), (44, 0.023), (45, -0.052), (46, 0.066), (47, -0.011), (48, 0.053), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.94384629 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
2 0.92304468 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
Author: David M. Blei, Peter I. Frazier
Abstract: We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation. Keywords: Chinese restaurant processes, Bayesian nonparametrics
3 0.67061728 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
4 0.43047071 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
5 0.42638698 61 jmlr-2011-Logistic Stick-Breaking Process
Author: Lu Ren, Lan Du, Lawrence Carin, David Dunson
Abstract: A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. Keywords: Bayesian, nonparametric, dependent, hierarchical models, segmentation
6 0.36260629 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
7 0.31131706 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
8 0.31060398 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
9 0.30375621 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
10 0.29258788 54 jmlr-2011-Learning Latent Tree Graphical Models
11 0.23757209 36 jmlr-2011-Generalized TD Learning
12 0.23179339 12 jmlr-2011-Bayesian Co-Training
13 0.23102666 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
14 0.22940631 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.22741479 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.22654018 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
17 0.21306653 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
18 0.20894462 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
19 0.20393194 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
20 0.20125966 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
topicId topicWeight
[(4, 0.052), (9, 0.016), (10, 0.026), (13, 0.01), (24, 0.53), (31, 0.111), (32, 0.019), (41, 0.02), (60, 0.014), (67, 0.013), (73, 0.028), (78, 0.039), (90, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.97282475 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
2 0.91034341 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
3 0.87118489 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
4 0.69185966 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.58819443 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar
Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning
6 0.5853762 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
7 0.57648814 61 jmlr-2011-Logistic Stick-Breaking Process
8 0.57491064 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
9 0.51973528 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
10 0.49988487 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
11 0.49446496 48 jmlr-2011-Kernel Analysis of Deep Networks
12 0.48663491 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
13 0.48334351 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.46825713 16 jmlr-2011-Clustering Algorithms for Chains
16 0.46693793 66 jmlr-2011-Multiple Kernel Learning Algorithms
17 0.46258125 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
18 0.46105596 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
19 0.45998064 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
20 0.45951313 7 jmlr-2011-Adaptive Exact Inference in Graphical Models