nips nips2007 nips2007-197 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daichi Mochihashi, Eiichiro Sumita
Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically infinite order. By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. [sent-8, score-0.85]
2 Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. [sent-9, score-0.414]
3 We expect that this basic model will also extend to the variable order hierarchical clustering of general data. [sent-10, score-0.147]
4 In particular, (n−1)th order Markov models over words are called “n-gram” language models and play a key role in speech recognition and machine translation, as regards choosing the most natural sentence among candidate transcriptions [4]. [sent-12, score-0.237]
5 Because higher-order Markov models have an exponentially large number of parameters, their orders have been restricted to a small, often fixed number. [sent-14, score-0.152]
6 However, word dependencies will often have a span of greater than n for phrasal expressions or compound proper nouns, or a much shorter n will suffice for some grammatical relationships. [sent-16, score-0.16]
7 However, all stemming from [5] and [7], they are based on pruning a huge candidate suffix tree by employing such criteria as KL-divergences. [sent-19, score-0.208]
8 This kind of “post-hoc” approach suffers from several important limitations: First, when we want to consider deeper dependences, the candidate tree to be pruned will be extremely large. [sent-20, score-0.199]
9 Second, the criteria and threshold for pruning the tree are inherently exogeneous and must be set carefully so that they match the desired model and current data. [sent-22, score-0.208]
10 Third, pruning by empirical counts in advance, which is often used to build “arbitrary order” candidate trees in these approaches, is shown to behave very badly [8] and has no theoretical standpoints. [sent-23, score-0.177]
11 Figure 1: Hierarchical Chinese restaurant processes over the finite and infinite suffix trees. [sent-32, score-0.202]
12 defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. [sent-33, score-0.85]
13 Furthermore, we extend the variable model by latent topics to show that we can induce the variable length “stochastic phrases” for topic by topic. [sent-35, score-0.157]
14 However, now we have the hierarchical (Poisson-) Dirichlet process that can be used as a fixed order language model [9][10], it is natural for us to extend these models to variable orders also by using a nonparametric approach. [sent-37, score-0.383]
15 For concreteness below we use a language model example, but the same model can be applied to any discrete sequences, such as characters, DNAs, or even binary streams for compression. [sent-39, score-0.118]
16 Consider a trigram language model, which is a second-order Markov model over words often employed in speech recognition. [sent-40, score-0.197]
17 Following [9], this Markov model can be represented by a suffix tree of depth two, as shown in Figure 1(a). [sent-41, score-0.29]
18 When we predict a word “sing” after a context “she will”, we descend this suffix tree from the root (which corresponds to null string context), using the context backwards to follow a branch “will” and then “she will”. [sent-42, score-0.599]
19 1 Now we arrive at the leaf node that represents the context, and we can predict “sing” by using the count distribution at this node. [sent-43, score-0.221]
20 During the learning phase, we begin with a suffix tree that has no counts. [sent-44, score-0.164]
21 For every time a three word sequence appears in the training data, such as “she will sing” mentioned above, we add a count of a final word (“sing”) given the context (“she will”) to the context node in the suffix tree. [sent-45, score-0.611]
22 In fact this corresponds to a hierarchical Chinese restaurant process, where each context node is a restaurant and each count is a customer associated with a word. [sent-46, score-1.05]
23 restaurant, might not have customers for all the words in the lexicon. [sent-49, score-0.203]
24 Therefore, when a customer arrives at a node and stochastically needs a new table to sit down, a copy of him, namely a proxy customer, is sent to its parent node. [sent-50, score-0.837]
25 When a node has no customer to compute the probability of some word, it uses the distribution of customers at the parent node and appropriately interpolates it to sum to 1. [sent-51, score-0.902]
26 Assume that the node “she will” does not have a customer of “like. [sent-52, score-0.475]
27 ” We can nevertheless compute the probability of “like” given “she will” if its sibling “he will” has a customer “like”. [sent-53, score-0.347]
28 Because that sibling has sent a copy of the customer to the common parent “will”, the probability is computed by appropriately interpolating the trigram probability given “she will”, which is zero, with the bigram probability given “will”, which is not zero at the parent node. [sent-54, score-0.637]
29 1 − qi 1 − qj 1 − qk i j k Figure 2: Probabilistic suffix tree of an infinite depth. [sent-57, score-0.341]
30 (1 − qi ) is a “penetration probability” of a descending customer at each node i, defining a stick-breaking process over the infinite tree. [sent-58, score-0.716]
31 c(s|h) is the count of s at node h, and c(h) = s c(s|h) is the total count at node h. [sent-60, score-0.442]
32 ths is the number of times symbol s is estimated to be generated from its parent distribution p(s|h′ ) rather than p(s|h) in the training data: th· = s ths is its total. [sent-61, score-0.321]
33 θ and d are the parameters of the Pitman-Yor process, and can be estimated through the distribution of customers on a suffix tree by Gamma and Beta posterior distributions, respectively. [sent-62, score-0.323]
34 Although this Bayesian Markov model is very principled and attractive, we can see from Figure 1(a) that all the real customers (i. [sent-64, score-0.159]
35 , counts) are fixed at the depth (n−1) in the suffix tree. [sent-66, score-0.126]
36 Because actual sequences will have heterogeneous Markov dependencies, we want a Markov model that deploys customers at different levels in the suffix tree according to the true Markov order from which each customer originated. [sent-67, score-0.723]
37 3 Infinite-order Hierarchical Chinese Restaurant Processes Intuitively, we know that suffix trees that are too deep are improbable and symbol dependencies decay largely exponentially with context lengths. [sent-69, score-0.455]
38 However, some customers may reside in a very deep node (for example, “the united states of america”) and some in a shallow node (“shorter than”). [sent-70, score-0.581]
39 Our model for deploying customers must be flexible enough to accommodate all these possibilities. [sent-71, score-0.202]
40 1 Introducing Suffix Tree Prior For this purpose, we assume that each node i in the suffix tree has a hidden probability qi of stopping at node i when following a path from the root of the tree to add a customer. [sent-73, score-0.927]
41 In other words, (1 − qi ) is the “penetration probability” when descending an infinite depth suffix tree from its root (Figure 2). [sent-74, score-0.575]
42 We assume that each qi is generated from a prior Beta distribution independently as: qi ∼ Be(α, β) i. [sent-75, score-0.354]
43 When we want to generate a symbol st given a context h = s−∞ · · · st−2 st−1 , we descend the suffix tree from the root following a path st−1 → st−2 → · · · , according to the probability of stopping at a l−1 level l given by p(n = l|h) = ql (1 − qi ) . [sent-79, score-0.865]
44 (l = 0, 1, · · · , ∞) (3) i=0 When we stop at level l, we generate a symbol st using the context st−l · · ·st−2 st−1 . [sent-80, score-0.475]
45 Since qi differs from node to node, we may reach very deep nodes with high probability if the qi ’s along the path are equally small (the “penetration” of this branch is high); or, we may stop at a very shallow node if the qi ’s are very high (the “penetration” is low). [sent-81, score-1.101]
46 In general, the probability to reach a node decays exponentially with levels according to (3), but the degrees are different to allow for long sequences of typical phrases. [sent-82, score-0.263]
47 Note that even for the same context h, the context length that was used to generate the next symbol may differ stochastically for each appearance according to (3). [sent-83, score-0.395]
48 2 Inference Of course, we do not know the hidden probability qi possessed by each node. [sent-85, score-0.256]
49 Note that the generative model above amounts to introducing a vector of hidden variables, n = n1 n2 · · · nT , that corresponds to each Markov order (n = 0 · · · ∞) from which each symbol st in s = s1 s2 · · · sT originated. [sent-87, score-0.42]
50 n (4) z Here, z = z1 z2 · · · zT is a vector that represents the hidden seatings of the proxy customers described in Section 2, where 0 ≤ zt ≤ nt means how recursively the st ’s proxy customers are stochastically sent to parent nodes. [sent-89, score-1.198]
51 Since in the hierarchical (Poisson-)Dirichlet process the customers are exchangeable [9] and qi is i. [sent-91, score-0.445]
52 as shown in (2), this process is also exchangeable and therefore we can always assume, by a suitable permutation, that the customer to resample is the final customer. [sent-94, score-0.35]
53 In our case, we only explicitly resample nt given n−t (n excluding nt ), as follows: nt ∼ p(nt |s, z−t , n−t ). [sent-95, score-0.547]
54 (5) Notice here that when we sample nt , we already know the other depths n−t that other words have reached in the suffix tree. [sent-96, score-0.254]
55 Therefore, when computing (5) using (3), the expectation of each qi is ai +α E[qi ] = , (6) ai +bi +α+β where ai is the number of times node i was stopped at when generating other words, and bi is the number of times node i was passed by. [sent-97, score-0.694]
56 (7) The first term is the probability of st under HPYLM when the Markov order is known to be nt , given by (1). [sent-99, score-0.403]
57 The second term is the prior probability of reaching that node at depth nt . [sent-100, score-0.508]
58 Using these probabilities, we can construct a Gibbs sampler, as shown in Figure 3, to iteratively resample n and z in order to estimate the parameter of the variable order hierarchical Pitman-Yor language model (VPYLM)2 . [sent-103, score-0.349]
59 In this sampler, we first remove the t’th customer who resides at a depth of order[t] in the suffix tree, and decrement ai or bi accordingly along the path. [sent-104, score-0.523]
60 Markov order) according to (7), we put the t’th customer back at the new depth recorded as order[t], and increment ai or bi accordingly along the new path. [sent-107, score-0.523]
61 When we add a customer st , zt is implicitly sampled because st ’s proxy customer is recursively sent to parent nodes in case a new table is needed to sit him down. [sent-108, score-1.429]
62 int id; /* symbol id */ }; 7: end for 8: end for Figure 4: Data structure of a suffix tree node. [sent-110, score-0.388]
63 This is a specific application of our model to the hierarchical Pitman-Yor processes for discrete data. [sent-114, score-0.144]
64 Character s a i d a l i c e ; ‘ l e t u s a l l f o r any t h i ng t he s e cond l y , · · · Markov order 5 6 5 4 7 10 6 5 4 3 7 1 4 8 2 4 4 6 5 5 4 4 5 5 6 4 5 6 7 7 7 5 3 3 4 5 9 11 6 4 8 9 8 9 4 4 4 7 3 4 3 · · · (b) Markov orders used to generate each character above. [sent-118, score-0.247]
65 ” This sampler is an extension of that reported in [9] using stochastically different orders n (n = 0 · · · ∞) for each customer. [sent-120, score-0.273]
66 In practice, we can place some maximum order nmax on n and sample within it 3 , or use a small threshold ǫ to stop the descent when the prior probability (8) of reaching that level is smaller than ǫ. [sent-121, score-0.205]
67 Because each node in the suffix tree may have a huge number of children, we used Splay Trees [11] for the efficient search as in [6]. [sent-123, score-0.335]
68 Figure 4 shows our data structure of a node in a suffix tree. [sent-126, score-0.171]
69 3 Prediction Since we do not usually know the Markov order of a context h = s−∞ · · · s−2 s−1 beforehand, when making predictions we consider n as a latent variable and average over it, as follows: ∞ n=0 ∞ n=0 p(s|h) = = p(s, n|h) p(s|h, n)p(n|h) . [sent-128, score-0.159]
70 (9) (10) Here, p(s|n, h) is a HPYLM prediction of order n through (1), and p(n|h) is the probability distribution of latent Markov order n possessed by the context h, obtained through (8). [sent-129, score-0.24]
71 Since p(n|h) has a product form as (3), we can also write the above expression recursively by introducing an auxiliary probability p(s|h, n+ ) as follows: p(s|h, n+ ) = qn · p(s|h, n) + (1 − qn ) · p(s|h, (n+1)+ ) , (11) + p(s|h) ≡ p(s|h, 0 ) . [sent-131, score-0.134]
72 (12) This formula shows that qn in fact defines the stick-breaking process on an infinite tree, where breaking proportions will differ branch to branch as opposed to a single proportion on a unit interval used in ordinary Dirichlet processes. [sent-132, score-0.172]
73 4 “Stochastic Phrases” on Suffix Tree In the expression (9) above, p(s, n|h) is the probability that the symbol s is generated by a Markov process of order n on h, that is, using the last n symbols of h as a Markov state. [sent-135, score-0.253]
74 In other words, instead of emitting a single symbol s at the root node of suffix tree, we can first stochastically descend the tree according to the probability to stop by (3). [sent-137, score-0.709]
75 Finally, we emit s given the context s−n · · · s−1 , which yields a phrase s−n · · · s−1 s and its cohesion probability. [sent-138, score-0.212]
76 3 Notice that by setting (α, β) = (0, ∞), we always obtain qi = 0: with some maximum order nmax , this is equivalent to always using the maximum depth, and thus to reducing the model to the original HPYLM. [sent-141, score-0.291]
77 delegates this week EOS n 0 1 2 3 4 5 6 7 8 9 Figure 6: Estimated Markov order distributions from which each word has been generated. [sent-145, score-0.151]
78 4 Experiments To investigate the behavior of the infinite Markov model, we conducted experiments on character and word sequences in natural language. [sent-146, score-0.262]
79 1 Infinite character Markov model Character-based Markov model is widely employed in data compression and has important application in language processing, such as OCR and unknown word recognition. [sent-148, score-0.322]
80 Figure 5(b) is the actual Markov orders used for generation by (8). [sent-159, score-0.118]
81 In fact, the model contained many nodes that correspond to valid words as well as the connective fragments between them. [sent-162, score-0.118]
82 2 Data For a word-based “n-gram” model of language, we used a random subset of the standard NAB Wall Street Journal language modeling corpus [12] 5 , totalling 10,007,108 words (409,246 sentences) for training and 10,000 sentences for testing. [sent-166, score-0.202]
83 As HPYLM is shown to converge very fast [9], according to preliminary experiments we used N = 200 Gibbs iterations for burn-in, and a further 50 iterations to evaluate the perplexity of the test data. [sent-169, score-0.173]
84 n means the fixed order for HPYLM, and the maximum order nmax in VPYLM. [sent-172, score-0.152]
85 0×100 0 1 2 3 4 5 6 7 8 9 10 11 12 n Figure 7: Global distribution of sampled Markov orders on the ∞-gram VPYLM over the NAB corpus. [sent-196, score-0.118]
86 same order despite the additional cost of sampling n-gram orders, because it appropriately avoids the addition of unnecessarily deep nodes on the suffix tree. [sent-198, score-0.157]
87 The perplexity at n = ∞ is the lowest compared to all fixed truncations, and contains only necessary number of nodes in the model. [sent-199, score-0.247]
88 Note that since we added an infinite number of dummy symbols to the sentence heads as usual, every word context has a maximum possible length of ∞. [sent-201, score-0.299]
89 Because of the tradeoff between using a longer, more predictive context and the penalty incurred when reaching a deeper node, interestingly a peak emerges around n = 3 ∼ 4 as a global phenomenon. [sent-203, score-0.197]
90 In fact, this hyperparameter can be optimized by the empirical Bayes method using each Beta posterior of qi in (6). [sent-205, score-0.177]
91 Figure 9 shows perplexity results for the same data, using (α, β) ∈ (0. [sent-210, score-0.173]
92 3 Variable Order Topic Model While previous approaches to latent topic modeling assumed a fixed order such as unigrams or bigrams, the order is generally not fixed and unknown to us. [sent-217, score-0.192]
93 Therefore, we used a Gibbs sampler for the Markov chain LDA [14] and augmented it by sampling Markov orders at the same time. [sent-218, score-0.19]
94 6617 : Stochastic phrases in the suffix tree primary new issues ˆ at the same time is a unit of from # % in # to # % in a number of in new york stock exchange composite trading mechanism of the european monetary increase as a result of tiffany & co. [sent-232, score-0.24]
95 1 and the number of topics M = 5, nmax = 5 and ran a N = 200 Gibbs iterations to obtain a single posterior set of models. [sent-259, score-0.117]
96 Although in predictive perplexity the improvements are slight (VPYLDA=116. [sent-260, score-0.209]
97 Although we used a small number of latent topics in this experiment to avoid data sparsenesses, in future research we need a more flexible topic model where the number of latent topics will differ from node to node in the suffix tree. [sent-263, score-0.579]
98 By extending a stick-breaking process “vertically” over a suffix tree of hierarchical Chinese restaurant processes, we can make a posterior inference on the Markov orders from which each data originates. [sent-265, score-0.558]
99 In addition to apparent application of our approach to hierarchical continuous distributions like Gaussians, we expect that the basic model can be used for the distribution of latent variables. [sent-267, score-0.148]
100 Each data is assigned to a deeper level just when needed, and resides not only in leaf nodes but also in the intermediate nodes, by stochastically descending a clustering hierarchy from the root as described in this paper. [sent-268, score-0.3]
wordName wordTfidf (topN-words)
[('customer', 0.304), ('hpylm', 0.282), ('vpylm', 0.217), ('st', 0.198), ('markov', 0.185), ('qi', 0.177), ('nab', 0.174), ('perplexity', 0.173), ('node', 0.171), ('restaurant', 0.167), ('nt', 0.167), ('tree', 0.164), ('customers', 0.159), ('symbol', 0.148), ('sing', 0.132), ('depth', 0.126), ('chinese', 0.12), ('orders', 0.118), ('language', 0.118), ('word', 0.113), ('hierarchical', 0.109), ('splay', 0.109), ('trees', 0.099), ('parent', 0.097), ('character', 0.091), ('penetration', 0.087), ('phrase', 0.087), ('proxy', 0.083), ('stochastically', 0.083), ('suf', 0.083), ('context', 0.082), ('topic', 0.077), ('int', 0.076), ('phrases', 0.076), ('nmax', 0.076), ('nodes', 0.074), ('sampler', 0.072), ('gibbs', 0.07), ('symbols', 0.067), ('vertically', 0.065), ('descending', 0.064), ('branch', 0.062), ('sent', 0.061), ('beta', 0.058), ('sequences', 0.058), ('ngram', 0.057), ('alice', 0.057), ('polya', 0.057), ('nite', 0.053), ('bi', 0.052), ('descend', 0.052), ('th', 0.051), ('count', 0.05), ('qn', 0.048), ('dirichlet', 0.047), ('stop', 0.047), ('dependencies', 0.047), ('resample', 0.046), ('deep', 0.045), ('reaching', 0.044), ('pruning', 0.044), ('root', 0.044), ('words', 0.044), ('bread', 0.043), ('butter', 0.043), ('cohesion', 0.043), ('cry', 0.043), ('daichi', 0.043), ('deploying', 0.043), ('depths', 0.043), ('hikaridai', 0.043), ('keihanna', 0.043), ('pharases', 0.043), ('possessed', 0.043), ('sibling', 0.043), ('truncations', 0.043), ('characters', 0.042), ('ai', 0.041), ('topics', 0.041), ('sentences', 0.04), ('america', 0.04), ('latent', 0.039), ('ths', 0.038), ('recursively', 0.038), ('order', 0.038), ('kyoto', 0.038), ('sit', 0.038), ('sentence', 0.037), ('predictive', 0.036), ('hidden', 0.036), ('processes', 0.035), ('children', 0.035), ('deeper', 0.035), ('united', 0.035), ('lexicon', 0.035), ('trigram', 0.035), ('yoram', 0.035), ('counts', 0.034), ('zt', 0.034), ('exponentially', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 197 nips-2007-The Infinite Markov Model
Author: Daichi Mochihashi, Eiichiro Sumita
Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically infinite order. By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1
2 0.19786006 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
Author: Max Welling, Ian Porteous, Evgeniy Bart
Abstract: A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1
3 0.14384533 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
4 0.13337462 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion
Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢
5 0.13207538 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1
6 0.12337144 102 nips-2007-Incremental Natural Actor-Critic Algorithms
7 0.10604264 116 nips-2007-Learning the structure of manifolds using random projections
8 0.10526638 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
9 0.098225713 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
10 0.096306212 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
11 0.092309959 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
12 0.091694266 9 nips-2007-A Probabilistic Approach to Language Change
13 0.0899745 183 nips-2007-Spatial Latent Dirichlet Allocation
14 0.085827246 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
15 0.082088441 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
16 0.08044973 86 nips-2007-Exponential Family Predictive Representations of State
17 0.080168203 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
18 0.080089644 47 nips-2007-Collapsed Variational Inference for HDP
19 0.077558994 189 nips-2007-Supervised Topic Models
20 0.074980587 84 nips-2007-Expectation Maximization and Posterior Constraints
topicId topicWeight
[(0, -0.217), (1, -0.04), (2, -0.04), (3, -0.262), (4, 0.021), (5, -0.127), (6, 0.021), (7, -0.057), (8, 0.051), (9, -0.08), (10, -0.065), (11, 0.103), (12, -0.017), (13, -0.061), (14, 0.021), (15, -0.079), (16, -0.002), (17, -0.062), (18, -0.034), (19, 0.12), (20, -0.07), (21, 0.126), (22, 0.103), (23, -0.007), (24, 0.28), (25, 0.0), (26, -0.112), (27, 0.07), (28, -0.01), (29, -0.003), (30, 0.058), (31, -0.055), (32, -0.002), (33, 0.031), (34, 0.017), (35, -0.069), (36, -0.069), (37, 0.066), (38, -0.014), (39, -0.034), (40, 0.013), (41, 0.041), (42, -0.091), (43, 0.147), (44, -0.046), (45, -0.083), (46, 0.022), (47, -0.001), (48, 0.142), (49, -0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.96488839 197 nips-2007-The Infinite Markov Model
Author: Daichi Mochihashi, Eiichiro Sumita
Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically infinite order. By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1
2 0.6208812 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
Author: Max Welling, Ian Porteous, Evgeniy Bart
Abstract: A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1
3 0.57941979 9 nips-2007-A Probabilistic Approach to Language Change
Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths
Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for defining probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efficient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1
4 0.56864953 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
Author: Yee W. Teh, Daniel J. Hsu, Hal Daume
Abstract: We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman’s coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over the state-of-the-art, and demonstrate our approach in document clustering and phylolinguistics. 1
5 0.54236561 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
Author: Anton Chechetka, Carlos Guestrin
Abstract: We present the first truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efficient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufficiently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very significant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on fixed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1
6 0.46254718 116 nips-2007-Learning the structure of manifolds using random projections
7 0.45091683 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
8 0.443147 191 nips-2007-Temporal Difference Updating without a Learning Rate
9 0.44094411 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
10 0.42816874 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
11 0.414545 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
12 0.41433555 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
13 0.41021991 47 nips-2007-Collapsed Variational Inference for HDP
14 0.40983656 127 nips-2007-Measuring Neural Synchrony by Message Passing
15 0.40812209 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
16 0.39099044 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
17 0.38287619 27 nips-2007-Anytime Induction of Cost-sensitive Trees
18 0.38102093 86 nips-2007-Exponential Family Predictive Representations of State
19 0.37334052 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
20 0.3634426 97 nips-2007-Hidden Common Cause Relations in Relational Learning
topicId topicWeight
[(5, 0.038), (13, 0.061), (16, 0.013), (21, 0.046), (31, 0.018), (35, 0.02), (47, 0.067), (49, 0.018), (83, 0.095), (85, 0.454), (87, 0.034), (90, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.87789077 197 nips-2007-The Infinite Markov Model
Author: Daichi Mochihashi, Eiichiro Sumita
Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically infinite order. By extending a stick-breaking prior, which is usually defined on a unit interval, “vertically” to the trees of infinite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efficient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1
2 0.7959252 77 nips-2007-Efficient Inference for Distributions on Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1
3 0.78041834 135 nips-2007-Multi-task Gaussian Process Prediction
Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams
Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1
4 0.72660333 112 nips-2007-Learning Monotonic Transformations for Classification
Author: Andrew Howard, Tony Jebara
Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1
5 0.49163494 141 nips-2007-New Outer Bounds on the Marginal Polytope
Author: David Sontag, Tommi S. Jaakkola
Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction. 1
6 0.44269687 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
7 0.43726504 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
8 0.41871586 63 nips-2007-Convex Relaxations of Latent Variable Training
9 0.4176091 134 nips-2007-Multi-Task Learning via Conic Programming
10 0.41497722 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
11 0.41045657 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
12 0.40944374 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
13 0.40839365 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
14 0.40821871 97 nips-2007-Hidden Common Cause Relations in Relational Learning
15 0.40742627 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
16 0.40285626 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
17 0.39964682 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
18 0.39900866 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
19 0.39635444 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
20 0.39490706 9 nips-2007-A Probabilistic Approach to Language Change