nips nips2010 nips2010-215 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicholas Bartlett, Frank Wood, David Tax
Abstract: We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). [sent-3, score-0.191]
2 We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). [sent-4, score-0.341]
3 Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. [sent-5, score-0.17]
4 We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. [sent-6, score-0.121]
5 The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. [sent-8, score-0.124]
6 1 Introduction The focus of this paper is a novel Bayesian framework for learning with probabilistic deterministic finite automata (PDFA) [9]. [sent-9, score-0.191]
7 Intuitively a PDFA is similar to a hidden Markov model (HMM) [10] in that it consists of a set of states, each of which when visited emits a symbol according to an emission probability distribution. [sent-11, score-0.235]
8 It differs from an HMM in how state-to-state transitions occur; transitions are deterministic in a PDFA and nondeterministic in an HMM. [sent-12, score-0.24]
9 The inductive bias introduced by the PDFA prior provides a soft constraint on the number of states used to generate the data. [sent-14, score-0.122]
10 We take the limit as the number of states becomes infinite, yielding a model we call the probabilistic deterministic infinite automata (PDIA). [sent-15, score-0.288]
11 Using such a mixture we can average over our uncertainty about the model parameters (including state cardinality) in a Bayesian way during prediction and other inference tasks. [sent-18, score-0.19]
12 We find that averaging over a finite number of PDFAs trained on naturalistic data leads to better predictive performance than using a single “best” PDFA. [sent-19, score-0.171]
13 We argue that a finite mixture of PDFAs has greater expressivity than that of a single PDFA but is not as expressive as a probabilistic nondeterministic finite automata (PNFA)1 . [sent-22, score-0.337]
14 We show that prediction in a trained mixture of PDFAs can have lower asymptotic cost than forward prediction in the PNFA/HMM class of models. [sent-26, score-0.13]
15 We also present evidence that averaging over PDFAs gives predictive performance superior to HMMs trained with standard methods on naturalistic data. [sent-27, score-0.171]
16 We find that PDIA predictive performance is competitive with that of fixed-order, smoothed Markov models with the same number of states. [sent-28, score-0.163]
17 While sequence learning approaches such as the HMM and smoothed Markov models are well known and now highly optimized, our PDIA approach to learning is novel and is amenable to future improvement. [sent-29, score-0.104]
18 Section 2 reviews PDFAs, Section 3 introduces Bayesian PDFA inference, Section 4 presents experimental results on DNA and natural language, and Section 5 discusses related work on PDFA induction and the theoretical expressive power of mixtures of PDFAs. [sent-30, score-0.117]
19 For example, δij is shorthand for δ(qi , σj ), where qi ∈ Q and σj ∈ Σ. [sent-34, score-0.234]
20 Given a state qi , the probability that the next symbol takes the value σj is given by π(qi , σj ). [sent-35, score-0.387]
21 We use the shorthand π qi for the state-specific discrete distribution over symbols for state qi . [sent-36, score-0.586]
22 We can also write σ|qi ∼ π qi where σ is a random variable that takes values in Σ. [sent-37, score-0.234]
23 Given a state qi and a symbol σj , however, the next state qi is deterministic: qi = δ(qi , σj ). [sent-38, score-0.928]
24 Generating from a PDFA involves first generating a symbol stochastically given the state the process is in: xt |ξt ∼ π ξt where ξt ∈ Q is the state at time t. [sent-39, score-0.268]
25 Given a state and a symbol xn+1 , the unique next state is the one corresponding to the suffix x2 . [sent-51, score-0.226]
26 For an HMM, given data and an initial distribution over states, there is a posterior probability for every path through the state space. [sent-56, score-0.121]
27 As we explain in Section 5, mixtures of PDFAs are strictly more expressive than single PDFAs, but still less expressive than PNFAs. [sent-58, score-0.182]
28 We then show how to analytically marginalize nuisance parameters out of the model and derive a Metropolis-Hastings sampler for posterior inference using the resulting collapsed representation. [sent-60, score-0.172]
29 We call this limit the probabilistic deterministic infinite automaton (PDIA). [sent-62, score-0.142]
30 1 A PDFA Prior We assume that the set of states Q, set of symbols Σ, and initial state q0 of a PDFA are known but that the transition and emission functions are unknown. [sent-65, score-0.355]
31 The PDFA prior then consists of a prior over both the transition function δ and the emission probability function π. [sent-66, score-0.204]
32 For each column j (j co-indexes columns and set elements) of the transition matrix δ, our prior stipulates that the elements of that column are i. [sent-69, score-0.182]
33 The φj represent transition tendencies given a symbol, if the ith element of φj is large then state qi is likely to be transitioned to anytime the last symbol was σj . [sent-76, score-0.508]
34 If the ith element of µ is large then the ith state is likely to be transitioned to regardless of the emitted symbol. [sent-78, score-0.189]
35 We also place a uniform Dirichlet prior over the per-state emission probabilities π qi with β total mass which smooths emission distribution estimates. [sent-81, score-0.465]
36 , γ/|Q|) φj |α, µ ∼ Dir(αµ) π qi |β, |Σ| ∼ δij ∼ (1) (2) Dir(β/|Σ|, . [sent-85, score-0.234]
37 If we instead had a single Dirichlet prior over all elements of δ, transitions to a few states would be highly likely no matter the context and those states would dominate the behavior of the automata. [sent-91, score-0.267]
38 If we tied together rows of δ instead of columns, being in a particular state would tell us more about the sequence of states we came from than the symbols that got us there. [sent-92, score-0.243]
39 Note that this prior stipulates a fully connected PDFA in which all states may transition to all others and all symbols may be emitted from each state. [sent-93, score-0.273]
40 t=1 Remember that ξt |ξt−1 , xt−1 is deterministic given the transition function δ. [sent-99, score-0.102]
41 We can marginalize π out of this expression and express the likelihood of the data in a form that depends only on the counts of symbols emitted from each state. [sent-100, score-0.119]
42 Define the count matrix c for the sequence x0:T and transition T matrix δ as cij = t=0 Iij (ξt , xt ), where Iij (ξt , xt ) is an indicator function for the automaton being in state qi when it generates xt , i. [sent-101, score-0.635]
43 This matrix c = [cij ] gives the number of times each symbol is emitted from each state. [sent-104, score-0.132]
44 Let vij be the number of times state qi is transitioned to given that σj was the last symbol emitted, i. [sent-106, score-0.473]
45 vij is the number of times δi j = qi for all states i in the column 3 j. [sent-108, score-0.383]
46 The marginal likelihood of δ in terms of µ is then: |Σ| p(δ|µ, α) = p(δ|φ)p(φ|µ, α)dφ = j=1 Γ(α) |Q|−1 i=0 Γ(αµi ) |Q|−1 i=0 Γ(αµi + vij ) Γ(α + |Q|) (4) We perform posterior inference in the finite model by sampling elements of δ and the vector µ. [sent-109, score-0.168]
47 The second can be found from (4) and is P (δij = qi |δ−ij , α, µ) = αµi + vi j α + |Q| − 1 (6) where vi j is the number of elements in column j equal to qi excluding δij . [sent-111, score-0.521]
48 The values in c depend on the specific sequence of states the automata used to generate x. [sent-114, score-0.215]
49 We can reduce the computational cost of inference by deleting transitions δij for which the corresponding counts cij become 0. [sent-117, score-0.239]
50 In practical sampler implementations this means that one need not even represent transitions corresponding to zero counts. [sent-118, score-0.146]
51 The likelihood of the data (3) does not depend on the value of δij if symbol σj is never emitted while the machine is in state qi . [sent-119, score-0.439]
52 Thus, if while sampling we change some transition that renders cij = 0 for some values for each of i and j, we can delete δij until another transition is changed such that cij becomes nonzero again, when we sample δij anew. [sent-121, score-0.284]
53 When all δij for some state qi are marginalized out, we can say the state itself is marginalized out. [sent-123, score-0.522]
54 When we delete an element from a column of δ, we replace the |Q| − 1 in the denominator of (6) with |Q|−1 + Dj = i=0 I(vij = 0), the number of entries in the jth column of δ that are not marginalized out yielding αµi + vi j (7) P (δij = qi |δ−ij , α, µ) = + . [sent-124, score-0.385]
55 α + Dj If when sampling δij it is assigned it a state qi such that some ci j which was zero is now nonzero, + we simply reinstantiate δi j by drawing from (7) and update Dj . [sent-125, score-0.307]
56 When sampling a single δij there can be many such transitions as the path through the machine dictated by x0:T may use many + transitions in δ that were deleted. [sent-126, score-0.142]
57 While it is possible to construct a Gibbs sampler using (5) in this collapsed representation, such a sampler requires a Monte Carlo integration over a potentially large subset of the marginalized-out transitions in δ, which may be costly. [sent-128, score-0.221]
58 This gives rise to a Metropolis Hastings (MH) sampler for δ where the proposed value for δij is either one of the instantiated states or any one of the equivalent marginalized out states. [sent-130, score-0.221]
59 Any time any marginalized out element of δ is required we can pretend as if we had just sampled its value, and we know that because its value had no effect on the likelihood of the data, we know that it would have been sampled directly from (7). [sent-131, score-0.118]
60 It is in this sense that all marginalized out states are equivalent – we known nothing more about their connectivity structure than that given by the prior in (7). [sent-132, score-0.167]
61 The Hastings correction exactly cancels out the proposal probability in the accept/reject ratio leaving an MH accept probability for the δij being set to qi∗ given that its previous value was qi of α(δij = qi∗ |δij = qi ) = min 1, + p(x0:T |δij = qi∗ , δ−ij ) + p(x0:T |δij = qi , δ−ij ) . [sent-160, score-0.702]
62 (8) + Whether qi∗ is marginalized out or not, evaluating p(x0:T |δij = qi∗ , δ−ij ) may require reinstantiating marginalized out elements of δ. [sent-161, score-0.167]
63 If the new value is accepted, all δij ∈ δ + for which cij = 0 are removed, and then move to the next transition in δ to sample. [sent-163, score-0.133]
64 4 Experiments and Results To test our PDIA inference approach we evaluated it on discrete natural sequence prediction and compared its performance to HMMs and smoothed n-gram models. [sent-180, score-0.174]
65 datasets: a character sequence from Alice in Wonderland [2] and a short sequence of mouse DNA. [sent-193, score-0.106]
66 For DNA, the state of the PDIA model at the start of the test data was set to the last state of the model after accepting the training data. [sent-199, score-0.146]
67 We evaluated the performance of the learned models by calculating the average per character predictive perplexity of the test data. [sent-202, score-0.173]
68 For the smoothed n-gram models, we report thousand-sample average perplexity results for hierarchical Pitman-Yor process (HPYP) [14] models of varying Markov order (1 through 5 notated as bigram through 6-gram) after burning each model in for one hundred samples. [sent-212, score-0.175]
69 We determined the best number of hidden states by cross-validation on the test data (a procedure used here to produce optimistic HMM performance for comparison purposes only). [sent-215, score-0.105]
70 The performance of the PDIA exceeds that of the HMM and is approximately equal to that of a smoothed 4-gram model, though it does not outperform very deep, smoothed Markov models. [sent-216, score-0.15]
71 This is in contrast to [16], which found that PDFAs trained on natural language data were able to predict as well as unsmoothed trigrams, but were significantly worse than smoothed trigrams, even when averaging over multiple learned PDFAs. [sent-217, score-0.157]
72 As can be seen in the second line of each subtable in Table 1, the MAP number of states learned by the PDIA is significantly lower than that of the n-gram model with equal predictive performance. [sent-218, score-0.163]
73 Unlike the HMM, the computational complexity of PDFA prediction does not depend on the number of states in the model because only a single path through the states is followed. [sent-219, score-0.171]
74 This is because all possible paths must be followed to achieve the given HMM predictive performance (although a subset of possible paths could be followed if doing approximate 6 (a) (b) 0 0 A/0. [sent-222, score-0.134]
75 (a) can be represented by a mixture of two PDFAs, one following the right branch from state 0, the other following the left branch. [sent-235, score-0.12]
76 In PDIA inference we too can choose the number of samples used for prediction, but here even a single sample has empirical prediction performance superior to averaging over all paths in an HMM. [sent-238, score-0.126]
77 The computational complexity of smoothing n-gram inference is equivalent to PDIA inference, however, the storage cost for the large n-gram models is significantly higher than that of the estimated PDIA for the same predictive performance. [sent-239, score-0.202]
78 In practice, we run a sampler for some number of iterations and approximate the posterior with a finite mixture of PDFAs. [sent-241, score-0.17]
79 We show that they are strictly more expressive than PDFAs, but strictly less expressive than hidden Markov models. [sent-243, score-0.208]
80 Probabilistic non-deterministic finite automata (PNFA) are a strictly larger model class than PDFAs. [sent-244, score-0.132]
81 In general, any PNFA where the nondeterministic transitions can only be visited once can be expressed as a mixture of PDFAs. [sent-248, score-0.191]
82 However, if we replace transitions to q3 with transitions to q0 , as in 2(b), there is no longer any equivalent finite mixture of PDFAs, since the nondeterministic branch from q0 can be visited an arbitrary number of times. [sent-249, score-0.262]
83 State merging algorithms do this by starting with the trivial PDFA that only accepts the training data and merging states that pass a similarity test [1, 17], and have been proven to identify the correct model in the limit of infinite data. [sent-251, score-0.143]
84 To test if we can learn the generative mechanism given our inductive bias, we trained the PDIA on data from three synthetic grammars: the even process [13], the Reber grammar [11] and the Feldman grammar [4], which have up to 7 states and 7 symbols in the alphabet. [sent-254, score-0.283]
85 6 Discussion Our Bayesian approach to PDIA inference can be interpreted as a stochastic search procedure for PDFA structure learning where the number of states is unknown. [sent-258, score-0.124]
86 (d) posterior mean and standard deviation of number of states discovered during PDIA inference for varying amounts of data generated by each of the synthetic PDFAs. [sent-291, score-0.172]
87 PDIA inference discovers PDFAs with the correct number of states We ourselves are more interested in establishing new ways to produce smoothed predictive conditional distributions. [sent-292, score-0.287]
88 Inference in the PDIA presents a completely new approach to smoothing, smoothing by averaging over PDFA model structure rather than hierarchically smoothing related emission distribution estimates. [sent-293, score-0.287]
89 Our PDIA approach gives us an attractive ability to trade-off between model simplicity in terms of number of states, computational complexity in terms of asymptotic cost of prediction, and predictive perplexity. [sent-294, score-0.106]
90 This suggests that a future combination of smoothing over model structure and smoothing over emission distributions could produce excellent results. [sent-296, score-0.235]
91 We indeed believe the most promising approach to improving PDIA predictive performance is to construct a smoothing hierarchy over the state specific emission distributions, as is done in the smoothing n-gram models. [sent-299, score-0.396]
92 For an n-gram, where every state corresponds to a suffix of the sequence, the predictive distributions for a suffix is smoothed by the predictive distribution for a shorter suffix, for which there are more observations. [sent-300, score-0.324]
93 In the PDIA, by contrast, the predictive probabilities for states are not tied together. [sent-302, score-0.184]
94 Since states of the PDIA are not uniquely identified by suffixes, it is no longer clear what the natural smoothing hierarchy is. [sent-303, score-0.14]
95 It is somewhat surprising that PDIA learning works nearly as well as n-gram modeling even without a smoothing hierarchy for its emission distributions. [sent-304, score-0.17]
96 Imposing a hierarchical smoothing of the PDIA emission distributions remains an open problem. [sent-305, score-0.197]
97 Learning stochastic regular grammars by means of a state merging method. [sent-309, score-0.116]
98 Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. [sent-322, score-0.2]
99 The power of amnesia: Learning probabilistic automata with variable memory length. [sent-381, score-0.146]
100 Improving probabilistic grammatical inference core algorithms with post-processing techniques. [sent-409, score-0.104]
wordName wordTfidf (topN-words)
[('pdia', 0.586), ('pdfa', 0.506), ('pdfas', 0.306), ('qi', 0.234), ('ij', 0.178), ('automata', 0.111), ('emission', 0.105), ('hmm', 0.095), ('predictive', 0.088), ('symbol', 0.08), ('cij', 0.076), ('sampler', 0.075), ('states', 0.075), ('smoothed', 0.075), ('state', 0.073), ('transitions', 0.071), ('marginalized', 0.071), ('expressive', 0.068), ('aiw', 0.067), ('smoothing', 0.065), ('transition', 0.057), ('grammar', 0.057), ('dna', 0.055), ('perplexity', 0.055), ('nondeterministic', 0.053), ('pnfa', 0.053), ('reber', 0.053), ('emitted', 0.052), ('hmms', 0.052), ('nite', 0.052), ('feldman', 0.05), ('inference', 0.049), ('markov', 0.049), ('posterior', 0.048), ('mixture', 0.047), ('hpyp', 0.047), ('vij', 0.046), ('symbols', 0.045), ('deterministic', 0.045), ('dj', 0.044), ('xt', 0.042), ('dirichlet', 0.041), ('automaton', 0.04), ('transitioned', 0.04), ('alice', 0.038), ('probabilistic', 0.035), ('alphabet', 0.035), ('averaging', 0.033), ('wood', 0.032), ('character', 0.03), ('hidden', 0.03), ('sequence', 0.029), ('column', 0.028), ('hierarchical', 0.027), ('dupont', 0.027), ('iij', 0.027), ('naturalistic', 0.027), ('pfau', 0.027), ('pnfas', 0.027), ('trigrams', 0.027), ('wonderland', 0.027), ('language', 0.026), ('mh', 0.026), ('inductive', 0.026), ('dir', 0.025), ('characters', 0.025), ('mixtures', 0.025), ('elements', 0.025), ('py', 0.024), ('induction', 0.024), ('element', 0.024), ('paths', 0.023), ('expressivity', 0.023), ('franchise', 0.023), ('gasthaus', 0.023), ('memoizer', 0.023), ('pretend', 0.023), ('stipulates', 0.023), ('verbal', 0.023), ('merging', 0.023), ('bayesian', 0.023), ('trained', 0.023), ('counts', 0.022), ('limit', 0.022), ('deleting', 0.021), ('prior', 0.021), ('tied', 0.021), ('prediction', 0.021), ('strictly', 0.021), ('visited', 0.02), ('grammars', 0.02), ('grammatical', 0.02), ('hierarchically', 0.019), ('hastings', 0.019), ('incrementally', 0.018), ('mouse', 0.018), ('url', 0.018), ('bigram', 0.018), ('nonzero', 0.018), ('asymptotic', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 215 nips-2010-Probabilistic Deterministic Infinite Automata
Author: Nicholas Bartlett, Frank Wood, David Tax
Abstract: We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. 1
2 0.079158202 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
3 0.074241288 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
4 0.054801766 36 nips-2010-Avoiding False Positive in Multi-Instance Learning
Author: Yanjun Han, Qing Tao, Jue Wang
Abstract: In multi-instance learning, there are two kinds of prediction failure, i.e., false negative and false positive. Current research mainly focus on avoiding the former. We attempt to utilize the geometric distribution of instances inside positive bags to avoid both the former and the latter. Based on kernel principal component analysis, we define a projection constraint for each positive bag to classify its constituent instances far away from the separating hyperplane while place positive instances and negative instances at opposite sides. We apply the Constrained Concave-Convex Procedure to solve the resulted problem. Empirical results demonstrate that our approach offers improved generalization performance.
5 0.051125634 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
6 0.050832726 161 nips-2010-Linear readout from a neural population with partial correlation data
7 0.050658152 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
8 0.050108172 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
9 0.048258118 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
10 0.048159543 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
11 0.047811028 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
12 0.046392761 27 nips-2010-Agnostic Active Learning Without Constraints
13 0.046074681 120 nips-2010-Improvements to the Sequence Memoizer
14 0.042177279 101 nips-2010-Gaussian sampling by local perturbations
15 0.041117787 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
16 0.040774979 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
17 0.040500846 212 nips-2010-Predictive State Temporal Difference Learning
18 0.040105373 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
19 0.039668735 150 nips-2010-Learning concept graphs from text with stick-breaking priors
20 0.038621742 144 nips-2010-Learning Efficient Markov Networks
topicId topicWeight
[(0, 0.111), (1, 0.003), (2, 0.002), (3, 0.033), (4, -0.104), (5, 0.029), (6, 0.035), (7, -0.002), (8, -0.006), (9, 0.021), (10, -0.02), (11, -0.023), (12, -0.019), (13, -0.007), (14, -0.039), (15, -0.003), (16, -0.046), (17, 0.032), (18, 0.033), (19, 0.007), (20, -0.029), (21, 0.084), (22, 0.024), (23, 0.044), (24, 0.016), (25, 0.036), (26, -0.036), (27, -0.047), (28, -0.079), (29, -0.046), (30, -0.084), (31, 0.03), (32, -0.04), (33, -0.027), (34, 0.008), (35, 0.003), (36, 0.013), (37, 0.068), (38, -0.018), (39, -0.057), (40, -0.012), (41, -0.03), (42, -0.073), (43, -0.036), (44, 0.008), (45, 0.015), (46, 0.017), (47, 0.078), (48, 0.017), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.91018236 215 nips-2010-Probabilistic Deterministic Infinite Automata
Author: Nicholas Bartlett, Frank Wood, David Tax
Abstract: We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. 1
2 0.6271897 120 nips-2010-Improvements to the Sequence Memoizer
Author: Jan Gasthaus, Yee W. Teh
Abstract: The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
3 0.57176787 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
Author: Lauren Hannah, Warren Powell, David M. Blei
Abstract: In this paper we study convex stochastic optimization problems where a noisy objective function value is observed after a decision is made. There are many stochastic optimization problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state s. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel based weights and Dirichlet process based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems. 1
4 0.57074875 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
5 0.50237399 237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
Author: Bela Frigyik, Maya Gupta, Yihua Chen
Abstract: Although the Dirichlet distribution is widely used, the independence structure of its components limits its accuracy as a model. The proposed shadow Dirichlet distribution manipulates the support in order to model probability mass functions (pmfs) with dependencies or constraints that often arise in real world problems, such as regularized pmfs, monotonic pmfs, and pmfs with bounded variation. We describe some properties of this new class of distributions, provide maximum entropy constructions, give an expectation-maximization method for estimating the mean parameter, and illustrate with real data. 1 Modeling Probabilities for Machine Learning Modeling probability mass functions (pmfs) as random is useful in solving many real-world problems. A common random model for pmfs is the Dirichlet distribution [1]. The Dirichlet is conjugate to the multinomial and hence mathematically convenient for Bayesian inference, and the number of parameters is conveniently linear in the size of the sample space. However, the Dirichlet is a distribution over the entire probability simplex, and for many problems this is simply the wrong domain if there is application-specific prior knowledge that the pmfs come from a restricted subset of the simplex. For example, in natural language modeling, it is common to regularize a pmf over n-grams by some generic language model distribution q0 , that is, the pmf to be modeled is assumed to have the form θ = λq + (1 − λ)q0 for some q in the simplex, λ ∈ (0, 1) and a fixed generic model q0 [2]. But once q0 and λ are fixed, the pmf θ can only come from a subset of the simplex. Another natural language processing example is modeling the probability of keywords in a dictionary where some words are related, such as espresso and latte, and evidence for the one is to some extent evidence for the other. This relationship can be captured with a bounded variation model that would constrain the modeled probability of espresso to be within some of the modeled probability of latte. We show that such bounds on the variation between pmf components also restrict the domain of the pmf to a subset of the simplex. As a third example of restricting the domain, the similarity discriminant analysis classifier estimates class-conditional pmfs that are constrained to be monotonically increasing over an ordered sample space of discrete similarity values [3]. In this paper we propose a simple variant of the Dirichlet whose support is a subset of the simplex, explore its properties, and show how to learn the model from data. We first discuss the alternative solution of renormalizing the Dirichlet over the desired subset of the simplex, and other related work. Then we propose the shadow Dirichlet distribution; explain how to construct a shadow Dirichlet for three types of restricted domains: the regularized pmf case, bounded variation between pmf components, and monotonic pmfs; and discuss the most general case. We show how to use the expectation-maximization (EM) algorithm to estimate the shadow Dirichlet parameter α, and present simulation results for the estimation. 1 Dirichlet Shadow Dirichlet Renormalized Dirichlet Figure 1: Dirichlet, shadow Dirichlet, and renormalized Dirichlet for α = [3.94 2.25 2.81]. 2 Related Work One solution to modeling pmfs on only a subset of the simplex is to simply restrict the support of ˜ ˜ the Dirichlet to the desired support S, and renormalize the Dirichlet over S (see Fig. 1 for an example). This renormalized Dirichlet has the advantage that it is still a conjugate distribution for the multinomial. Nallapati et al.considered the renormalized Dirichlet for language modeling, but found it difficult to use because the density requires numerical integration to compute the normalizer [4] . In addition, there is no closed form solution for the mean, covariance, or peak of the renormalized Dirichlet, making it difficult to work with. Table 1 summarizes these properties. Additionally, generating samples from the renormalized Dirichlet is inefficient: one draws samples from the stan˜ dard Dirichlet, then rejects realizations that are outside S. For high-dimensional sample spaces, this could greatly increase the time to generate samples. Although the Dirichlet is a classic and popular distribution on the simplex, Aitchison warns it “is totally inadequate for the description of the variability of compositional data,” because of its “implied independence structure and so the Dirichlet class is unlikely to be of any great use for describing compositions whose components have even weak forms of dependence” [5]. Aitchison instead championed a logistic normal distribution with more parameters to control covariance between components. A number of variants of the Dirichlet that can capture more dependence have been proposed and analyzed. For example, the scaled Dirichlet enables a more flexible shape for the distribution [5], but does not change the support. The original Dirichlet(α1 , α2 , . . . αd ) can be derived as Yj / j Yj where Yj ∼ Γ(αj , β), whereas the scaled Dirichlet is derived from Yj ∼ Γ(αj , βj ), resulting in α density p(θ) = γ j ( α −1 βj j θ j j βi θi )α1 +···+αd i , where β, α ∈ Rd are parameters, and γ is the normalizer. + Another variant is the generalized Dirichlet [6] which also has parameters β, α ∈ Rd , and allows + greater control of the covariance structure, again without changing the support. As perhaps first noted by Karl Pearson [7] and expounded upon by Aitchison [5], correlations of proportional data can be very misleading. Many Dirichlet variants have been generalizations of the Connor-Mossiman variant, Dirichlet process variants, other compound Dirichlet models, and hierarchical Dirichlet models. Ongaro et al. [8] propose the flexible Dirichlet distribution by forming a re-parameterized mixture of Dirichlet distributions. Rayens and Srinivasan [9] considered the dependence structure for the general Dirichlet family called the generalized Liouville distributions. In contrast to prior efforts, the shadow Dirichlet manipulates the support to achieve various kinds of dependence that arise frequently in machine learning problems. 3 Shadow Dirichlet Distribution We introduce a new distribution that we call the shadow Dirichlet distribution. Let S be the prob˜ ability (d − 1)-simplex, and let Θ ∈ S be a random pmf drawn from a Dirichlet distribution with density pD and unnormalized parameter α ∈ Rd . Then we say the random pmf Θ ∈ S is distributed + ˜ according to a shadow Dirichlet distribution if Θ = M Θ for some fixed d × d left-stochastic (that ˜ is, each column of M sums to 1) full-rank (and hence invertible) matrix M , and we call Θ the gen2 erating Dirichlet of Θ, or Θ’s Dirichlet shadow. Because M is a left-stochastic linear map between finite-dimensional spaces, it is a continuous map from the convex and compact S to a convex and compact subset of S that we denote SM . The shadow Dirichlet has two parameters: the generating Dirichlet’s parameter α ∈ Rd , and the + d × d matrix M . Both α and M can be estimated from data. However, as we show in the following subsections, the matrix M can be profitably used as a design parameter that is chosen based on application-specific knowledge or side-information to specify the restricted domain SM , and in that way impose dependency between the components of the random pmfs. The shadow Dirichlet density p(θ) is the normalized pushforward of the Dirichlet density, that is, it is the composition of the Dirichlet density and M −1 with the Jacobian: 1 α −1 p(θ) = (M −1 θ)j j , (1) B(α) |det(M )| j Γ(αj ) d j is the standard Dirichlet normalizer, and α0 = j=1 αj is the standard where B(α) Γ(α0 ) Dirichlet precision factor. Table 1 summarizes the basic properties of the shadow Dirichlet. Fig. 1 shows an example shadow Dirichlet distribution. Generating samples from the shadow Dirichlet is trivial: generate samples from its generating Dirichlet (for example, using stick-breaking or urn-drawing) and multiply each sample by M to create the corresponding shadow Dirichlet sample. Table 1: Table compares and summarizes the Dirichlet, renormalized Dirichlet, and shadow Dirichlet distributions. Dirichlet(α) Density p(θ) Mean 1 B(α) d j=1 α −1 θj j Shadow Dirichlet (α, M ) 1 B(α)|det(M )| α α0 Renormalized ˜ Dirichlet (α, S) d −1 αj −1 θ)j j=1 (M 1 ˜ S M d j=1 α α0 αj −1 qj dq d j=1 α −1 θj j ˜ θp(θ)dθ S ¯ ¯ − θ)(θ − θ)T p(θ)dθ Cov(Θ) M Cov(Θ)M T αj −1 α0 −d j M α0 −d max p(θ) stick-breaking, urn-drawing draw from Dirichlet(α), multiply by M draw from Dirichlet(α), ˜ reject if not in S ML Estimate iterative (simple functions) iterative (simple functions) unknown complexity ML Compound Estimate iterative (simple functions) iterative (numerical integration) unknown complexity Covariance Mode (if α > 1) How to Sample 3.1 α −1 ˜ (θ S ˜ θ∈S Example: Regularized Pmfs The shadow Dirichlet can be designed to specify a distribution over a set of regularized pmfs SM = ˜ ˘ ˜ ˘ ˘ {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for specific values of λ and θ. In general, for a given λ and θ ∈ S, the following d × d matrix M will change the support to the desired subset SM by mapping the extreme points of S to the extreme points of SM : ˘ M = (1 − λ)θ1T + λI, (2) where I is the d × d identity matrix. In Section 4 we show that the M given in (2) is optimal in a maximum entropy sense. 3 3.2 Example: Bounded Variation Pmfs We describe how to use the shadow Dirichlet to model a random pmf that has bounded variation such that |θk − θl | ≤ k,l for any k, ∈ {1, 2, . . . , d} and k,l > 0. To construct specified bounds on the variation, we first analyze the variation for a given M . For any d × d left stochastic matrix T d d ˜ ˜ ˜ M, θ = Mθ = M1j θj . . . Mdj θj , so the difference between any two entries is j=1 j=1 ˜ |Mkj − Mlj | θj . ˜ (Mkj − Mlj )θj ≤ |θk − θl | = (3) j j Thus, to obtain a distribution over pmfs with bounded |θk − θ | ≤ k,l for any k, components, it is sufficient to choose components of the matrix M such that |Mkj − Mlj | ≤ k,l for all j = 1, . . . , d ˜ because θ in (3) sums to 1. One way to create such an M is using the regularization strategy described in Section 3.1. For this ˜ ˜ ˘ case, the jth component of θ is θj = M θ = λθj + (1 − λ)θj , and thus the variation between the j ith and jth component of any pmf in SM is: ˜ ˘ ˜ ˘ ˜ ˜ ˘ ˘ |θi − θj | = λθi + (1 − λ)θi − λθj − (1 − λ)θj ≤ λ θi − θj + (1 − λ) θi − θj ˘ ˘ ≤ λ + (1 − λ) max θi − θj . (4) i,j ˘ Thus by choosing an appropriate λ and regularizing pmf θ, one can impose the bounded variation ˘ to be the uniform pmf, and choose any λ ∈ (0, 1), then the matrix given by (4). For example, set θ M given by (2) will guarantee that the difference between any two entries of any pmf drawn from the shadow Dirichlet (M, α) will be less than or equal to λ. 3.3 Example: Monotonic Pmfs For pmfs over ordered components, it may be desirable to restrict the support of the random pmf distribution to only monotonically increasing pmfs (or to only monotonically decreasing pmfs). A d × d left-stochastic matrix M that will result in a shadow Dirichlet that generates only monotonically increasing d × 1 pmfs has kth column [0 . . . 0 1/(d − k + 1) . . . 1/(d − k + 1)]T , we call this the monotonic M . It is easy to see that with this M only monotonic θ’s can be produced, 1˜ 1˜ 1 ˜ because θ1 = d θ1 which is less than or equal to θ2 = d θ1 + d−1 θ2 and so on. In Section 4 we show that the monotonic M is optimal in a maximum entropy sense. Note that to provide support over both monotonically increasing and decreasing pmfs with one distribution is not achievable with a shadow Dirichlet, but could be achieved by a mixture of two shadow Dirichlets. 3.4 What Restricted Subsets are Possible? Above we have described solutions to construct M for three kinds of dependence that arise in machine learning applications. Here we consider the more general question: What subsets of the simplex can be the support of the shadow Dirichlet, and how to design a shadow Dirichlet for a particular support? For any matrix M , by the Krein-Milman theorem [10], SM = M S is the convex hull of its extreme points. If M is injective, the extreme points of SM are easy to specify, as a d × d matrix M will have d extreme points that occur for the d choices of θ that have only one nonzero component, as the rest of the θ will create a non-trivial convex combination of the columns of M , and therefore cannot result in extreme points of SM by definition. That is, the extreme points of SM are the d columns of M , and one can design any SM with d extreme points by setting the columns of M to be those extreme pmfs. However, if one wants the new support to be a polytope in the probability (d − 1)-simplex with m > d extreme points, then one must use a fat M with d × m entries. Let S m denote the probability 4 (m − 1)-simplex, then the domain of the shadow Dirichlet will be M S m , which is the convex hull of the m columns of M and forms a convex polytope in S with at most m vertices. In this case M cannot be injective, and hence it is not bijective between S m and M S m . However, a density on M S m can be defined as: p(θ) = 1 B(α) ˜ {θ ˜α −1 ˜ θj j dθ. ˜ M θ=θ} j (5) On the other hand, if one wants the support to be a low-dimensional polytope subset of a higherdimensional probability simplex, then a thin d × m matrix M , where m < d, can be used to implement this. If M is injective, then it has a left inverse M ∗ that is a matrix of dimension m × d, and the normalized pushforward of the original density can be used as a density on the image M S m : p(θ) = 1 α −1 1/2 B(α) |det(M T M )| (M ∗ θ)j j , j If M is not injective then one way to determine a density is to use (5). 4 Information-theoretic Properties In this section we note two information-theoretic properties of the shadow Dirichlet. Let Θ be drawn ˜ from shadow Dirichlet density pM , and let its generating Dirichlet Θ be drawn from pD . Then the differential entropy of the shadow Dirichlet is h(pM ) = log |det(M )| + h(pD ), where h(pD ) is the differential entropy of its generating Dirichlet. In fact, the shadow Dirichlet always has less entropy than its Dirichlet shadow because log |det(M )| ≤ 0, which can be shown as a corollary to the following lemma (proof not included due to lack of space): Lemma 4.1. Let {x1 , . . . , xn } and {y1 , . . . , yn } be column vectors in Rn . If each yj is a convex n n combination of the xi ’s, i.e. yj = i=1 γji xi , i=1 γji = 1, γjk ≥ 0, ∀j, k ∈ {1, . . . , n} then |det[y1 , . . . , yn ]| ≤ |det[x1 , . . . , xn ]|. It follows from Lemma 4.1 that the constructive solutions for M given in (2) and the monotonic M are optimal in the sense of maximizing entropy: Corollary 4.1. Let Mreg be the set of left-stochastic matrices M that parameterize shadow Dirichlet ˜ ˘ ˜ ˘ distributions with support in {θ θ = λθ + (1 − λ)θ, θ ∈ S}, for a specific choice of λ and θ. Then the M given in (2) results in the shadow Dirichlet with maximum entropy, that is, (2) solves arg maxM ∈Mreg h(pM ). Corollary 4.2. Let Mmono be the set of left-stochastic matrices M that parameterize shadow Dirichlet distributions that generate only monotonic pmfs. Then the monotonic M given in Section 3.3 results in the shadow Dirichlet with maximum entropy, that is, the monotonic M solves arg maxM ∈Mmono h(pM ). 5 Estimating the Distribution from Data In this section, we discuss the estimation of α for the shadow Dirichlet and compound shadow Dirichlet, and the estimation of M . 5.1 Estimating α for the Shadow Dirichlet Let matrix M be specified (for example, as described in the subsections of Section 3), and let q be a d × N matrix where the ith column qi is the ith sample pmf for i = 1 . . . N , and let (qi )j be the jth component of the ith sample pmf for j = 1, . . . , d. Then finding the maximum likelihood estimate 5 of α for the shadow Dirichlet is straightforward: N 1 arg max log + log p(qi |α) ≡ arg max log B(α) |det(M )| α∈Rk α∈Rk + + i=1 1 αj −1 (˜i )j q , ≡ arg max log B(α)N i j α∈Rk + N α −1 (M −1 qi )j j i j (6) where q = M −1 q. Note (6) is the maximum likelihood estimation problem for the Dirichlet dis˜ tribution given the matrix q , and can be solved using the standard methods for that problem (see ˜ e.g. [11, 12]). 5.2 Estimating α for the Compound Shadow Dirichlet For many machine learning applications the given data are modeled as samples from realizations of a random pmf, and given these samples one must estimate the random pmf model’s parameters. We refer to this case as the compound shadow Dirichlet, analogous to the compound Dirichlet (also called the multivariate P´ lya distribution). Assuming one has already specified M , we first discuss o method of moments estimation, and then describe an expectation-maximization (EM) method for computing the maximum likelihood estimate α. ˘ One can form an estimate of α by the method of moments. For the standard compound Dirichlet, one treats the samples of the realizations as normalized empirical histograms, sets the normalized α parameter equal to the empirical mean of the normalized histograms, and uses the empirical variances to determine the precision α0 . By definition, this estimate will be less likely than the maximum likelihood estimate, but may be a practical short-cut in some cases. For the compound shadow Dirichlet, we believe the method of moments estimator will be a poorer estimate in general. The problem is that if one draws samples from a pmf θ from a restricted subset SM of the simplex, ˘ then the normalized empirical histogram θ of those samples may not be in SM . For example given a monotonic pmf, the histogram of five samples drawn from it may not be monotonic. Then the empirical mean of such normalized empirical histograms may not be in SM , and so setting the shadow Dirichlet mean M α equal to the empirical mean may lead to an infeasible estimate (one that is outside SM ). A heuristic solution is to project the empirical mean into SM first, for example, by finding the nearest pmf in SM in squared error or relative entropy. As with the compound Dirichlet, this may still be a useful approach in practice for some problems. Next we state an EM method to find the maximum likelihood estimate α. Let s be a d × N matrix ˘ of sample histograms from different experiments, such that the ith column si is the ith histogram for i = 1, . . . , N , and (si )j is the number of times we have observed the jth event from the ith pmf vi . Then the maximum log-likelihood estimate of α solves arg max log p(s|α) for α ∈ Rk . + If the random pmfs are drawn from a Dirichlet distribution, then finding this maximum likelihood estimate requires an iterative procedure, and can be done in several ways including a gradient descent (ascent) approach. However, if the random pmfs are drawn from a shadow Dirichlet distribution, then a direct gradient descent approach is highly inconvenient as it requires taking derivatives of numerical integrals. However, it is practical to apply the expectation-maximization (EM) algorithm [13][14], as we describe in the rest of this section. Code to perform the EM estimation of α can be downloaded from idl.ee.washington.edu/publications.php. We assume that the experiments are independent and therefore p(s|α) = p({si }|α) = and hence arg maxα∈Rk log p(s|α) = arg maxα∈Rk i log p(si |α). + + i p(si |α) To apply the EM method, we consider the complete data to be the sample histograms s and the pmfs that generated them (s, v1 , v2 , . . . , vN ), whose expected log-likelihood will be maximized. Specifically, because of the assumed independence of the {vi }, the EM method requires one to repeatedly maximize the Q-function such that the estimate of α at the (m + 1)th iteration is: N α(m+1) = arg max α∈Rk + Evi |si ,α(m) [log p(vi |α)] . i=1 6 (7) Like the compound Dirichlet likelihood, the compound shadow Dirichlet likelihood is not necessarily concave. However, note that the Q-function given in (7) is concave, because log p(vi |α) = − log |det(M )| + log pD,α M −1 vi , where pD,α is the Dirichlet distribution with parameter α, and by a theorem of Ronning [11], log pD,α is a concave function, and adding a constant does not change the concavity. The Q-function is a finite integration of such concave functions and hence also concave [15]. We simplify (7) without destroying the concavity to yield the equivalent problem α(m+1) = d d arg max g(α) for α ∈ Rk , where g(α) = log Γ(α0 ) − j=1 log Γ(αj ) + j=1 βj αj , and + βj = N tij i=1 zi , 1 N where tij and zi are integrals we compute with Monte Carlo integration: d (s ) log(M −1 vi )j γi tij = SM (vi )k i k pM (vi |α(m) )dvi k=1 d zi = (vi )j k(si )k pM (vi |α(m) )dvi , γi SM k=1 where γi is the normalization constant for the multinomial with histogram si . We apply the Newton method [16] to maximize g(α), where the gradient g(α) has kth component ψ0 (α0 ) − ψ0 (α1 ) + β1 , where ψ0 denotes the digamma function. Let ψ1 denote the trigamma function, then the Hessian matrix of g(α) is: H = ψ1 (α0 )11T − diag (ψ1 (α1 ), . . . , ψ1 (αd )) . Note that because H has a very simple structure, the inversion of H required by the Newton step is greatly simplified by using the Woodbury identity [17]: H −1 = − diag(ξ1 , . . . , ξd ) − 1 1 1 [ξi ξj ]d×d , where ξ0 = ψ1 (α0 ) and ξj = ψ1 (αj ) , j = 1, . . . , d. ξ − d ξ 0 5.3 j=1 j Estimating M for the Shadow Dirichlet Thus far we have discussed how to construct M to achieve certain desired properties and how to interpret a given M ’s effect on the support. In some cases it may be useful to estimate M directly from data, for example, finding the maximum likelihood M . In general, this is a non-convex problem because the set of rank d − 1 matrices is not convex. However, we offer two approximations. First, note that as in estimating the support of a uniform distribution, the maximum likelihood M will correspond to a support that is no larger than needed to contain the convex hull of sample pmfs. Second, the mean of the empirical pmfs will be in the support, and thus a heuristic is to set the kth column of M (which corresponds to the kth vertex of the support) to be a convex combination of the kth vertex of the standard probability simplex and the empirical mean pmf. We provide code that finds the d optimal such convex combinations such that a specificed percentage of the sample pmfs are within the support, which reduces the non-convex problem of finding the maximum likelihood d × d matrix M to a d-dimensional convex relaxation. 6 Demonstrations It is reasonable to believe that if the shadow Dirichlet better matches the problem’s statistics, it will perform better in practice, but an open question is how much better? To motivate the reader to investigate this question further in applications, we provide two small demonstrations. 6.1 Verifying the EM Estimation We used a broad suite of simulations to test and verify the EM estimation. Here we include a simple visual confirmation that the EM estimation works: we drew 100 i.i.d. pmfs from a shadow Dirichlet with monotonic M for d = 3 and α = [3.94 2.25 2.81] (used in [18]). From each of the 100 pmfs, we drew 100 i.i.d. samples. Then we applied the EM algorithm to find the α for both the standard compound Dirichlet, and the compound shadow Dirichlet with the correct M . Fig. 2 shows the true distribution and the two estimated distributions. 7 True Distribution (Shadow Dirichlet) Estimated Shadow Dirichlet Estimated Dirichlet Figure 2: Samples were drawn from the true distribution and the given EM method was applied to form the estimated distributions. 6.2 Estimating Proportions from Sales Manufacturers often have constrained manufacturing resources, such as equipment, inventory of raw materials, and employee time, with which to produce multiple products. The manufacturer must decide how to proportionally allocate such constrained resources across their product line based on their estimate of proportional sales. Manufacturer Artifact Puzzles gave us their past retail sales data for the 20 puzzles they sold during July 2009 through Dec 2009, which we used to predict the proportion of sales expected for each puzzle. These estimates were then tested on the next five months of sales data, for January 2010 through April 2010. The company also provided a similarity between puzzles S, where S(A, B) is the proportion of times an order during the six training months included both puzzle A and B if it included puzzle A. We compared treating each of the six training months of sales data as a sample from a compound Dirichlet versus or a compound shadow Dirichlet. For the shadow Dirichlet, we normalized each column of the similarity matrix S to sum to one so that it was left-stochastic, and used that as the M matrix; this forces puzzles that are often bought together to have closer estimated proportions. We estimated each α parameter by EM to maximize the likelihood of the past sales data, and then estimated the future sales proportions to be the mean of the estimated Dirichlet or shadow Dirichlet distribution. We also compared with treating all six months of sales data as coming from one multinomial which we estimated as the maximum likelihood multinomial, and to taking the mean of the six empirical pmfs. Table 2: Squared errors between estimates and actual proportional sales. Jan. Feb. Mar. Apr. 7 Multinomial .0129 .0185 .0231 .0240 Mean Pmf .0106 .0206 .0222 .0260 Dirichlet .0109 .0172 .0227 .0235 Shadow Dirichlet .0093 .0164 .0197 .0222 Summary In this paper we have proposed a variant of the Dirichlet distribution that naturally captures some of the dependent structure that arises often in machine learning applications. We have discussed some of its theoretical properties, and shown how to specify the distribution for regularized pmfs, bounded variation pmfs, monotonic pmfs, and for any desired convex polytopal domain. We have derived the EM method and made available code to estimate both the shadow Dirichlet and compound shadow Dirichlet from data. Experimental results demonstrate that the EM method can estimate the shadow Dirichlet effectively, and that the shadow Dirichlet may provide worthwhile advantages in practice. 8 References [1] B. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Tech. Rep., University of Washington, 2010. [2] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval,” ACM Trans. on Information Systems, vol. 22, no. 2, pp. 179–214, 2004. [3] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti, “Similarity-based classification: Concepts and algorithms,” Journal of Machine Learning Research, vol. 10, pp. 747–776, March 2009. [4] R. Nallapati, T. Minka, and S. Robertson, “The smoothed-Dirichlet distribution: a building block for generative topic models,” Tech. Rep., Microsoft Research, Cambridge, 2007. [5] Aitchison, Statistical Analysis of Compositional Data, Chapman Hall, New York, 1986. [6] R. J. Connor and J. E. Mosiman, “Concepts of independence for proportions with a generalization of the Dirichlet distibution,” Journal of the American Statistical Association, vol. 64, pp. 194–206, 1969. [7] K. Pearson, “Mathematical contributions to the theory of evolution–on a form of spurious correlation which may arise when indices are used in the measurement of organs,” Proc. Royal Society of London, vol. 60, pp. 489–498, 1897. [8] A. Ongaro, S. Migliorati, and G. S. Monti, “A new distribution on the simplex containing the Dirichlet family,” Proc. 3rd Compositional Data Analysis Workshop, 2008. [9] W. S. Rayens and C. Srinivasan, “Dependence properties of generalized Liouville distributions on the simplex,” Journal of the American Statistical Association, vol. 89, no. 428, pp. 1465– 1470, 1994. [10] Walter Rudin, Functional Analysis, McGraw-Hill, New York, 1991. [11] G. Ronning, “Maximum likelihood estimation of Dirichlet distributions,” Journal of Statistical Computation and Simulation, vol. 34, no. 4, pp. 215221, 1989. [12] T. Minka, “Estimating a Dirichlet distribution,” Tech. Rep., Microsoft Research, Cambridge, 2009. [13] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [14] M. R. Gupta and Y. Chen, Theory and Use of the EM Method, Foundations and Trends in Signal Processing, Hanover, MA, 2010. [15] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970. [16] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. [17] K. B. Petersen and M. S. Pedersen, Matrix Cookbook, 2009, Available at matrixcookbook.com. [18] R. E. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” in Proc. Intl. Conf. Machine Learning, 2005. 9
6 0.50151473 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
7 0.49520046 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
8 0.48461819 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
9 0.480185 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
10 0.47647178 101 nips-2010-Gaussian sampling by local perturbations
11 0.47594529 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
12 0.47197539 71 nips-2010-Efficient Relational Learning with Hidden Variable Detection
13 0.44044396 148 nips-2010-Learning Networks of Stochastic Differential Equations
14 0.43067515 2 nips-2010-A Bayesian Approach to Concept Drift
15 0.42831156 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
16 0.42813268 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
17 0.42492759 265 nips-2010-The LASSO risk: asymptotic results and real world examples
18 0.42461348 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
19 0.4154537 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
20 0.41475883 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
topicId topicWeight
[(0, 0.015), (13, 0.032), (27, 0.059), (30, 0.059), (35, 0.013), (45, 0.204), (50, 0.092), (52, 0.023), (56, 0.297), (60, 0.049), (77, 0.032), (90, 0.015)]
simIndex simValue paperId paperTitle
1 0.76902431 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
Author: Remi Cuingnet, Marie Chupin, Habib Benali, Olivier Colliot
Abstract: Support vector machines (SVM) are increasingly used in brain image analyses since they allow capturing complex multivariate relationships in the data. Moreover, when the kernel is linear, SVMs can be used to localize spatial patterns of discrimination between two groups of subjects. However, the features’ spatial distribution is not taken into account. As a consequence, the optimal margin hyperplane is often scattered and lacks spatial coherence, making its anatomical interpretation difficult. This paper introduces a framework to spatially regularize SVM for brain image analysis. We show that Laplacian regularization provides a flexible framework to integrate various types of constraints and can be applied to both cortical surfaces and 3D brain images. The proposed framework is applied to the classification of MR images based on gray matter concentration maps and cortical thickness measures from 30 patients with Alzheimer’s disease and 30 elderly controls. The results demonstrate that the proposed method enables natural spatial and anatomical regularization of the classifier. 1
same-paper 2 0.74621868 215 nips-2010-Probabilistic Deterministic Infinite Automata
Author: Nicholas Bartlett, Frank Wood, David Tax
Abstract: We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models. 1
3 0.62980497 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
4 0.62841052 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
Author: Matthias Broecheler, Lise Getoor
Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1
5 0.6254524 287 nips-2010-Worst-Case Linear Discriminant Analysis
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
6 0.62505591 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
7 0.62471288 63 nips-2010-Distributed Dual Averaging In Networks
8 0.6245898 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
9 0.62440789 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
10 0.62190247 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
11 0.62171996 155 nips-2010-Learning the context of a category
12 0.62155718 158 nips-2010-Learning via Gaussian Herding
13 0.62118274 148 nips-2010-Learning Networks of Stochastic Differential Equations
14 0.62100911 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
15 0.61998385 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
16 0.61960721 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
17 0.61909252 217 nips-2010-Probabilistic Multi-Task Feature Selection
18 0.61865234 257 nips-2010-Structured Determinantal Point Processes
19 0.61862099 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
20 0.61770731 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery