nips nips2010 nips2010-120 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. [sent-10, score-0.409]
2 One innovation of the SM over [3] is its use of coagulation and fragmentation properties [4, 5] that allow for efficient representation of the model using a data structure whose size is linear in the sequence length. [sent-17, score-0.292]
3 In particular, current algorithms use a Chinese restaurant franchise representation for the HPYP, where the seating arrangement of customers in each restaurant is represented by a list, each entry being the number of customers sitting around one table [3]. [sent-23, score-1.631]
4 This is already an improvement over the na¨ve ı Chinese restaurant franchise in [6] which stores pointers from customers to the tables they sit at, but can still lead to huge memory requirements when restaurants contain many tables. [sent-24, score-0.979]
5 Our proposal is to store even less, namely only the minimal statistics about each restaurant required to make predictions: the number of customers and the number of tables occupied by the customers. [sent-26, score-0.741]
6 1 In Section 2 we will give precise definitions of Pitman-Yor processes and Chinese restaurant processes. [sent-28, score-0.26]
7 As a side benefit we will also be able to give an elementary proof of the coagulation and fragmentation properties in Section 4, which was presented as a fait accompli in [1], while the general and rigorous treatment in the original papers [4, 5] is somewhat inaccessible to a wider audience. [sent-30, score-0.269]
8 We can describe a Pitman-Yor process using its associated Chinese restaurant process (CRP). [sent-33, score-0.237]
9 A Chinese restaurant has customers sitting around tables which serve dishes. [sent-34, score-0.751]
10 If there are c customers we index them with [c] = {1, . [sent-35, score-0.254]
11 We define a seating arrangement of the customers as a set of disjoint non-empty subsets partitioning [c]. [sent-39, score-0.69]
12 Each subset is a table and consists of the customers sitting around it, e. [sent-40, score-0.351]
13 {{1, 3}, {2}} means customers 1 and 3 sit at one table and customer 2 sits at another by itself. [sent-42, score-0.578]
14 Let Ac be the set of seating arrangements of c customers, and Act those with exactly t tables. [sent-43, score-0.474]
15 The CRP describes a distribution over seating arrangements as follows: customer 1 sits at a table; for customer c + 1, if A ∈ Ac is the current seating arrangement, then she joins a table a ∈ A with probability |a|−d and starts a new table with α+c probability α+|A|d . [sent-44, score-1.341]
16 Note that conditioning on a fixed t the seating arrangement will not depend on α, only on d. [sent-50, score-0.436]
17 In particular, marginalizing out G, the distribution of z1:c can be described as follows: draw A ∼ CRPc (α, d), on each table serve a dish which is an iid draw from G0 , finally let variable zi take on the value of the dish served at the table that customer i sat at. [sent-59, score-0.634]
18 This is equivalent to conditioning on the dishes that each customer is served. [sent-61, score-0.221]
19 Since customers at the same table are served the same dish, the different values among the zi ’s split the restaurant into multiple sections, with customers and tables in each section being served a distinct dish. [sent-62, score-1.191]
20 There can be more than one table in each section since multiple tables can serve the same dish (if G0 has atoms). [sent-63, score-0.346]
21 If s ∈ Σ is a dish, let cs be the number of zi ’s with value s (number of customers served dish s), ts the number of tables, and As ∈ Acs ts the seating arrangement of customers around the tables serving dish s (we reindex the cs customers to be [cs ]). [sent-64, score-1.907]
22 The joint distribution over seating arrangements and observations is then:1 [α + d]t· −1 d c [α + 1]1· −1 G0 (s)ts P ({cs , ts , As }, z1:c ) = s∈Σ where t· = s∈Σ ts |a|−1 [1 − d]1 , (3) s∈Σ a∈As and similarly for c· . [sent-65, score-0.534]
23 2 3 The Sequence Memoizer and its Chinese Restaurant Representation In this section we review the sequence memoizer (SM) and its representation using Chinese restaurants [3, 11, 1, 2]. [sent-70, score-0.28]
24 As in previous works, the hierarchy over {Gu } is represented using a Chinese restaurant franchise [6]. [sent-80, score-0.339]
25 Each Gu has a corresponding restaurant indexed by u. [sent-81, score-0.237]
26 Customers in the restaurant are draws from Gu , tables are draws from its base distribution Gσ(u) , and dishes are the drawn values from Σ. [sent-82, score-0.52]
27 For each s ∈ Σ and u ∈ Σ∗ , let cus and tus be the numbers of customers and tables in restaurant u served dish s, and let Aus ∈ Acus tus be their seating arrangement. [sent-83, score-2.578]
28 Each observation of xi in context x1:i−1 corresponds to a customer in restaurant x1:i−1 who is served dish xi , and each table in each restaurant u, being a draw from the base distribution Gσ(u) , corresponds to a customer in the parent restaurant σ(u). [sent-84, score-1.365]
29 Thus, the numbers of customers and tables have to satisfy the constraints cus = cx + us tvs , (7) v:σ(v)=u where cx = 1 if s = xi and u = x1:i−1 for some i, and 0 otherwise. [sent-85, score-0.942]
30 us The goal of inference is to compute the posterior over the states {cus , tus , Aus }s∈Σ,u∈Σ∗ of the restaurants (and possibly the concentration and discount parameters). [sent-86, score-0.713]
31 The joint distribution can be obtained by multiplying the probabilities of all seating arrangements (3) in all restaurants: t [αu + du ]du· −1 u H(s)tεs P ({cus , tus , Aus }, x1:T ) = s∈Σ u∈Σ∗ c [αu + 1]1u· −1 |a|−1 [1 − du ]1 . [sent-87, score-1.206]
32 s∈Σ a∈Aus (8) The first parentheses contain the probability of draws from the overall base distribution H, and the second parentheses contain the probability of the seating arrangement in restaurant u. [sent-88, score-0.711]
33 Given a state of the restaurants drawn from the posterior, the predictive probability of symbol s in context v can ∗ then be computed recursively (with Pσ(ε) (s) defined to be H(s)): ∗ Pv (s) = 4 cvs − tvs d αv + tv· d ∗ (s). [sent-89, score-0.225]
34 Though limiting the flexibility of the model, this allowed them to take advantage of coagulation and fragmentation properties of PYPs [4, 5] to marginalize out all but a linear number (in T ) of restaurants from the hierarchy. [sent-91, score-0.381]
35 Our hyperparameter settings also retain the coagulation and fragmentation properties which allow us to marginalize out many PYPs in the hierarchy for efficient inference. [sent-99, score-0.272]
36 Let c ≥ 1 and suppose A2 ∈ Ac and A1 ∈ A|A2 | are two seating arrangements where the number of customers in A1 is the same as that of tables in A2 . [sent-102, score-0.931]
37 Each customer in A1 can be put in one-to-one correspondence to a table in A2 and sits at a table in A1 . [sent-103, score-0.329]
38 Let C ∈ Ac be the seating arrangement obtained by coagulating (merging) tables of A2 corresponding to customers in A1 sitting at the same table. [sent-105, score-0.973]
39 Further, split A2 into sections, one for each table a ∈ C, where each section Fa ∈ A|a| contains the |a| customers and tables merged to make up a. [sent-106, score-0.497]
40 The converse of coagulating tables of A2 into C is of course to fragment each table a ∈ C into the smaller tables in Fa . [sent-107, score-0.501]
41 Note that there is a one-to-one correspondence between tables in C and in A1 , and the number of customers in each table of A1 is that of tables in the corresponding Fa . [sent-108, score-0.7]
42 Statement (I) of the theorem is exactly the Chinese restaurant franchise of the hierarchical model G1 |G0 ∼ PY(α, d1 , G0 ), G2 |G1 ∼ PY(αd2 , d2 , G1 ) with c iid draws from G2 . [sent-123, score-0.359]
43 The theorem shows 4 that the clustering structure of the c customers in the franchise is equivalent to the seating arrangement in a CRP with parameters αd2 , d1 d2 , i. [sent-124, score-0.772]
44 Conversely, the fragmentation operation (II) regains Chinese restaurant representations for both G2 |G1 and G1 |G0 from one for G2 |G0 . [sent-127, score-0.378]
45 In the rest of this paper we will refer to (6) and its Chinese restaurant franchise representation (8) with the understanding that we are operating in this reduced hierarchy. [sent-130, score-0.353]
46 5 Compact Representation Current inference algorithms for the SM and hierarchical Pitman-Yor processes operate in the Chinese restaurant franchise representation, and use either Gibbs sampling [3, 11, 1] or particle filtering [2]. [sent-133, score-0.557]
47 To lower memory requirements, instead of storing the precise seating arrangement of each restaurant, the algorithms only store the numbers of customers, numbers of tables and sizes of all tables in the franchise. [sent-134, score-0.967]
48 However, for large data sets the amount of memory required to store the sizes of the tables can still be very large. [sent-136, score-0.27]
49 We propose algorithms that only store the numbers of customers and tables but not the table sizes. [sent-137, score-0.554]
50 This compact representation needs to store only two integers (cus , tus ) per context/symbol pair, as opposed to tus integers. [sent-138, score-1.09]
51 Our starting point is the joint distribution over the Chinese restaurant franchise (8). [sent-141, score-0.319]
52 Integrating out the seating arrangements {Aus } using (2) gives the joint distribution over {cus , tus }: t [αu + du ]du· −1 u H(s)tεs P ({cus , tus }, x1:T ) = s∈Σ u∈U c [αu + 1]1u· −1 Sdu (cus , tus ) . [sent-142, score-2.06]
53 (10) s∈Σ Note that each cus is in fact determined by (7) so in fact the only unobserved variables in (10) are {tus }. [sent-143, score-0.375]
54 1 Sampling Algorithms Direct Gibbs Sampling of {cus , tus }. [sent-146, score-0.492]
55 Since each cus is determined by cx and the tvs at child restaurants v, it is sufficient to update each tus , us which for tus in the range {1, . [sent-148, score-1.546]
56 , cus } has conditional distribution P (tus |rest) ∝ [αu + du ]tu· −1 du c [ασ(u) + 1]1σ(u)· −1 Sdu (cus , tus )Sdσ(u) (cσ(u)s , tσ(u)s ), (11) where tu· , cσ(u)· and cσ(u)s all depend on tus through the constraints (7). [sent-151, score-1.579]
57 One problem with this sampler is that we need to compute Sdu (c, t) for all 1 ≤ c, t ≤ cus . [sent-152, score-0.463]
58 If du is fixed these can be precomputed and stored, but the resulting memory requirement is again large since each restaurant typically has its own du value. [sent-153, score-0.496]
59 Another strategy is to re-instantiate the seating arrangement by sampling Aus ∼ CRPcus tus (du ) from its conditional distribution given cus , tus (see Section 5. [sent-159, score-1.842]
60 2 below), then performing the original Gibbs sampling of seating arrangements [3, 11]. [sent-160, score-0.541]
61 This produces a new number of tables tus and the seating arrangement can be discarded. [sent-161, score-1.131]
62 Note however that when tus changes this sampler will introduce changes to ancestor restaurants (by adding 2 In both representations one may also want to store the total number of customers and tables in each restaurant for efficiency. [sent-162, score-1.431]
63 In practice, where there is additional overhead due to the data structures involved, storage space for the full representation can be reduced by treating context/symbol pairs with only one customer separately. [sent-163, score-0.26]
64 5 or removing customers), so these will need to have their seating arrangements instantiated as well. [sent-164, score-0.474]
65 To implement this sampler efficiently, we visit restaurants in depth-first order, keeping in memory only the seating arrangements of all restaurants on the path to the current one. [sent-165, score-0.859]
66 The computational cost is O(cus tus ), but with a potentially smaller hidden constant (no log/exp computations are required). [sent-166, score-0.51]
67 A third strategy is to “imagine” having a seating arrangement and running the original Gibbs sampler, incrementing tus if a table would have been created, and decrementing tus if a table would have been deleted. [sent-168, score-1.569]
68 Recall that the original Gibbs sampler operates by iterating over customers, treating each as the last customer in the restaurant, removing it, then adding it back into the restaurant. [sent-169, score-0.306]
69 When removing, if the customer were sitting by himself, a table would need to be deleted too, so the probability of decrementing tus is the probability of a customer sitting by himself. [sent-170, score-1.063]
70 From (2), this can be worked out to be Sd (cus − 1, tus − 1) P (decrement tus ) = u . [sent-171, score-0.984]
71 (12) Sdu (cus , tus ) The numerator is due to a sum over all seating arrangements where the other cus − 1 customers sit at the other tus − 1 tables. [sent-172, score-2.122]
72 This sampler also requires computation of Sdu (c, t), but only for 1 ≤ t ≤ tus which can be significantly smaller than cus . [sent-174, score-0.955]
73 Computation cost is O(cus tus ) (but again with a larger constant due to computing the Stirling numbers in a stable way). [sent-175, score-0.521]
74 We did not find a sampling method taking less time than O(cus tus ). [sent-176, score-0.539]
75 (13) gives the probability of incrementing tus (and adding a customer to the parent restaurant) when a customer is added into a restaurant. [sent-178, score-0.959]
76 This can be used as the basis for a particle filter, which iterates through the sequence x1:T , adding a customer corresponding to s = xi in context u = x1:i−1 at each step. [sent-179, score-0.358]
77 Since no customer deletion is required, the cost is very small: just O(cus ) for the cus customers per s and u (plus the cost of traversing the hierarchy to the current restaurant, which is always necessary). [sent-180, score-0.847]
78 2 Re-instantiating Aus given cus , tus To simplify notation, here we will let d = du , c = cus , t = tus and A = Aus ∈ Act . [sent-185, score-1.844]
79 Label a table a ∈ A using the index of the first customer at the table, i. [sent-194, score-0.238]
80 Let zi be the number of tables occupied by the first i customers, and yi the label of the table that customer i sits at. [sent-197, score-0.629]
81 The variables satisfy the following constraints: z1 = 1, zc = t, and zi = zi−1 in which case yi ∈ [i − 1] or zi = zi−1 + 1 in which case yi = i. [sent-198, score-0.318]
82 This gives a one-to-one correspondence between seating arrangements in Act and settings of the variables satisfying the above constraints. [sent-199, score-0.474]
83 , zc is distributed according to a Markov network with z1 = 1, zc = t, and edge potentials: i − 1 − zi d if zi = zi−1 , f (zi , zi−1 ) = 1 (14) if zi = zi−1 + 1, 0 otherwise. [sent-203, score-0.455]
84 Sd (c, t) Given z1:c , we give each yi the following distribution conditioned on y1:i−1 : P (yi |z1:c , y1:i−1 ) = 1 if yi = i and zi = zi−1 + 1, Pi−1 1(yj =yi )−d i−1−zi d j=1 6 if zi = zi−1 and yi ∈ [i − 1]. [sent-205, score-0.257]
85 (15) (16) 7 Calgary: news x 10 8 context/symbol pairs tables (sampling) tables (particle filter) 2 Brown corpus x 10 Brown corpus 40 context/symbol pairs tables (sampling) tables (particle filter) 6 Seconds per iteration 7 2. [sent-206, score-0.916]
86 Note also that sampling tends to increase the number of tables over the particle filter initialization. [sent-212, score-0.375]
87 (c) Time per iteration (seconds) as a function of input size for the original Gibbs sampler in the compact representation and the re-instantiating sampler (on the Brown corpus). [sent-213, score-0.274]
88 In particle filtering and in prediction, we often need to re-instantiate a restaurant which was previously marginalized out. [sent-219, score-0.383]
89 We can do so by sampling Aus given cus , tus for each s, then fragmenting each Aus using Theorem 1, counting the resulting numbers of customers and tables, then forgetting the seating arrangements. [sent-220, score-1.56]
90 While the difference does not seem dramatic, there is still a significant amount of memory that can be saved by using the compact representation, as there is no additional overhead and memory fragmentation due to variable-size arrays. [sent-224, score-0.291]
91 The comparison between the byte-level model and the word-level model in Figure 2 also demonstrates that the compact representation saves more space when |Σ| is small (which leads to context/symbol pairs having larger cus ’s and tus ’s). [sent-225, score-0.945]
92 Finally, Figure 2 illustrates another interesting effect: the number of tables is generally larger after a few iterations of Gibbs sampling have been performed after the initialization using a single-particle particle filter [2]. [sent-226, score-0.375]
93 The second experiment compares the computational cost of the compact original sampler and the sampler that re-instantiates full seating arrangements. [sent-227, score-0.58]
94 The main computational cost of the original sampler is computing the ratio (12), while sampling the seating arrangements is the main computational cost of the re-instantiating sampler. [sent-228, score-0.629]
95 Inference is performed by either just using the particle filter or using the particle filter followed by 50 burn-in iterations of Gibbs sampling. [sent-289, score-0.25]
96 The final two columns labelled Online show the results obtained by using the particle filter on the test set as well, after training with either just the particle filter or particle filter followed by 50 Gibbs iterations. [sent-292, score-0.375]
97 One also has a choice when making predictions involving contexts that were marginalized out from the model: one can either re-instantiate these contexts by fragmentation or simply predict from the parent (or even the child) of the required node. [sent-300, score-0.269]
98 We found that the algorithm which re-instantiates seating arrangements is significantly more efficient than the other two Gibbs samplers, while particle filtering is most efficient but produces slightly worse predictions. [sent-304, score-0.599]
99 Along the way, we formalized the metaphorical language often used to describe Chinese restaurant processes in the machine learning literature, and were able to provide an elementary proof of the coagulation/fragmentation properties. [sent-305, score-0.309]
100 A parting remark is that the posterior distribution over {cus , tus } in (10) is in the form of a standard Markov network with sum constraints (7). [sent-308, score-0.492]
wordName wordTfidf (topN-words)
[('tus', 0.492), ('cus', 0.375), ('seating', 0.34), ('customers', 0.254), ('restaurant', 0.237), ('tables', 0.203), ('customer', 0.198), ('fragmentation', 0.141), ('arrangements', 0.134), ('restaurants', 0.129), ('particle', 0.125), ('du', 0.11), ('dish', 0.103), ('aus', 0.103), ('chinese', 0.1), ('zi', 0.097), ('arrangement', 0.096), ('fa', 0.089), ('sampler', 0.088), ('gibbs', 0.084), ('franchise', 0.082), ('memoizer', 0.082), ('coagulation', 0.082), ('zc', 0.082), ('ac', 0.071), ('stirling', 0.07), ('crp', 0.066), ('gu', 0.062), ('crpc', 0.059), ('sdu', 0.059), ('sitting', 0.057), ('served', 0.053), ('sm', 0.053), ('sits', 0.051), ('py', 0.05), ('sd', 0.049), ('sampling', 0.047), ('cs', 0.045), ('compact', 0.044), ('corpus', 0.043), ('parent', 0.043), ('concentration', 0.042), ('table', 0.04), ('memory', 0.039), ('sequence', 0.035), ('brown', 0.035), ('calgary', 0.035), ('sit', 0.035), ('tvs', 0.035), ('symbols', 0.034), ('representation', 0.034), ('lter', 0.033), ('contexts', 0.032), ('fragment', 0.032), ('predictive', 0.032), ('gasthaus', 0.031), ('gatsby', 0.031), ('hpyp', 0.031), ('pyp', 0.031), ('pyps', 0.031), ('ts', 0.03), ('symbol', 0.029), ('ltering', 0.029), ('numbers', 0.029), ('tu', 0.029), ('marginalize', 0.029), ('incrementing', 0.028), ('wood', 0.028), ('discount', 0.028), ('overhead', 0.028), ('store', 0.028), ('elementary', 0.026), ('act', 0.025), ('enlarged', 0.024), ('compression', 0.023), ('coagulating', 0.023), ('crpct', 0.023), ('dishes', 0.023), ('fragmenting', 0.023), ('language', 0.023), ('cx', 0.023), ('processes', 0.023), ('inference', 0.022), ('filter', 0.022), ('hierarchical', 0.021), ('yi', 0.021), ('london', 0.021), ('marginalized', 0.021), ('decrementing', 0.021), ('mitigates', 0.021), ('hierarchy', 0.02), ('original', 0.02), ('multiplying', 0.02), ('base', 0.019), ('draws', 0.019), ('occupied', 0.019), ('news', 0.018), ('computations', 0.018), ('discounts', 0.018), ('singapore', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 120 nips-2010-Improvements to the Sequence Memoizer
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
2 0.07857164 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
3 0.069449589 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
4 0.068173572 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
5 0.050301738 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
6 0.048569605 155 nips-2010-Learning the context of a category
7 0.046775829 150 nips-2010-Learning concept graphs from text with stick-breaking priors
8 0.046074681 215 nips-2010-Probabilistic Deterministic Infinite Automata
9 0.045059267 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
10 0.041655529 101 nips-2010-Gaussian sampling by local perturbations
11 0.037854966 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
12 0.035402361 63 nips-2010-Distributed Dual Averaging In Networks
13 0.033950571 138 nips-2010-Large Margin Multi-Task Metric Learning
14 0.031811103 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
15 0.030525619 151 nips-2010-Learning from Candidate Labeling Sets
16 0.029631432 257 nips-2010-Structured Determinantal Point Processes
17 0.027600184 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
18 0.026860481 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
19 0.026155086 224 nips-2010-Regularized estimation of image statistics by Score Matching
20 0.026002709 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
topicId topicWeight
[(0, 0.081), (1, 0.018), (2, -0.001), (3, 0.012), (4, -0.087), (5, 0.029), (6, 0.038), (7, 0.014), (8, 0.004), (9, 0.012), (10, -0.006), (11, -0.018), (12, -0.001), (13, 0.006), (14, -0.021), (15, -0.016), (16, -0.002), (17, 0.01), (18, -0.004), (19, -0.005), (20, -0.055), (21, 0.035), (22, 0.035), (23, -0.044), (24, -0.008), (25, 0.097), (26, 0.008), (27, -0.037), (28, -0.014), (29, -0.055), (30, -0.007), (31, 0.122), (32, -0.027), (33, -0.053), (34, -0.052), (35, -0.057), (36, -0.042), (37, 0.067), (38, -0.033), (39, 0.004), (40, -0.029), (41, 0.028), (42, -0.048), (43, -0.032), (44, 0.065), (45, 0.024), (46, -0.052), (47, -0.035), (48, 0.037), (49, 0.105)]
simIndex simValue paperId paperTitle
same-paper 1 0.90352869 120 nips-2010-Improvements to the Sequence Memoizer
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
2 0.58051383 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
3 0.53844839 215 nips-2010-Probabilistic Deterministic Infinite Automata
Author: Nicholas Bartlett, Frank Wood, David Tax
Abstract: We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. 1
4 0.52629012 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
5 0.5139817 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
Author: Bela Frigyik, Maya Gupta, Yihua Chen
Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-specific prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a fixed generic model q0 [2]. But once q0 and λ are fixed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classifier estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We first discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difficult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difficult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefficient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more flexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps first noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the flexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some fixed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between finite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be profitably used as a design parameter that is chosen based on application-specific knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for specific values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct specified bounds on the variation, we first analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufficient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by definition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be defined as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a specific choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be specified (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then finding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward: N 1 arg max log + log p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1 1 αj −1 (˜i )j q , ≡ arg max log B(α)N i j α∈Rk + N α −1 (M −1 qi )j j i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already specified M , we first discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By definition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of five samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM first, for example, by finding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to find the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then finding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Specifically, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a finite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simplified by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, finding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that finds the d optimal such convex combinations such that a specificed percentage of the sample pmfs are within the support, which reduces the non-convex problem of finding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual confirmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to find the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next five months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classification: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9
6 0.49106595 155 nips-2010-Learning the context of a category
7 0.44884765 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
8 0.42854565 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
9 0.42264211 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
10 0.407354 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
11 0.40703768 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
12 0.40330589 190 nips-2010-On the Convexity of Latent Social Network Inference
13 0.39208969 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
14 0.38589838 139 nips-2010-Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development
15 0.3807075 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
16 0.37552363 267 nips-2010-The Multidimensional Wisdom of Crowds
17 0.37478834 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
18 0.37252066 101 nips-2010-Gaussian sampling by local perturbations
19 0.35198906 63 nips-2010-Distributed Dual Averaging In Networks
20 0.34302774 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
topicId topicWeight
[(13, 0.029), (27, 0.038), (30, 0.038), (35, 0.01), (45, 0.128), (50, 0.578), (52, 0.019), (60, 0.016), (77, 0.021), (90, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.85857695 120 nips-2010-Improvements to the Sequence Memoizer
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
2 0.85831809 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
Author: Danny Bickson, Carlos Guestrin
Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1
3 0.83097875 101 nips-2010-Gaussian sampling by local perturbations
Author: George Papandreou, Alan L. Yuille
Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1
4 0.8054902 42 nips-2010-Boosting Classifier Cascades
Author: Nuno Vasconcelos, Mohammad J. Saberian
Abstract: The problem of optimal and automatic design of a detector cascade is considered. A novel mathematical model is introduced for a cascaded detector. This model is analytically tractable, leads to recursive computation, and accounts for both classification and complexity. A boosting algorithm, FCBoost, is proposed for fully automated cascade design. It exploits the new cascade model, minimizes a Lagrangian cost that accounts for both classification risk and complexity. It searches the space of cascade configurations to automatically determine the optimal number of stages and their predictors, and is compatible with bootstrapping of negative examples and cost sensitive learning. Experiments show that the resulting cascades have state-of-the-art performance in various computer vision problems. 1
5 0.75933158 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
Author: Yi Zhang, Jeff G. Schneider
Abstract: In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via ℓ1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.
6 0.69163626 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
7 0.52780867 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
8 0.52399248 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
9 0.52315962 54 nips-2010-Copula Processes
10 0.50998563 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
11 0.4961082 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
12 0.48351145 96 nips-2010-Fractionally Predictive Spiking Neurons
13 0.47639152 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
14 0.47097149 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
15 0.460722 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
16 0.45604697 217 nips-2010-Probabilistic Multi-Task Feature Selection
17 0.45021737 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
18 0.44794494 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
19 0.44751909 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
20 0.44587994 158 nips-2010-Learning via Gaussian Herding