nips nips2007 nips2007-197 knowledge-graph by maker-knowledge-mining

197 nips-2007-The Infinite Markov Model

Source: pdf

Author: Daichi Mochihashi, Eiichiro Sumita

Abstract: We present a nonparametric Bayesian method of estimating variable order Markov processes up to a theoretically inﬁnite order. By extending a stick-breaking prior, which is usually deﬁned on a unit interval, “vertically” to the trees of inﬁnite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efﬁcient in both time and space. We expect that this basic model will also extend to the variable order hierarchical clustering of general data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 By extending a stick-breaking prior, which is usually deﬁned on a unit interval, “vertically” to the trees of inﬁnite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. [sent-8, score-0.85]

2 Experiments on character and word sequences in natural language showed that the model has a comparative performance with an exponentially large full-order model, while computationally much efﬁcient in both time and space. [sent-9, score-0.414]

3 We expect that this basic model will also extend to the variable order hierarchical clustering of general data. [sent-10, score-0.147]

4 In particular, (n−1)th order Markov models over words are called “n-gram” language models and play a key role in speech recognition and machine translation, as regards choosing the most natural sentence among candidate transcriptions [4]. [sent-12, score-0.237]

5 Because higher-order Markov models have an exponentially large number of parameters, their orders have been restricted to a small, often ﬁxed number. [sent-14, score-0.152]

6 However, word dependencies will often have a span of greater than n for phrasal expressions or compound proper nouns, or a much shorter n will sufﬁce for some grammatical relationships. [sent-16, score-0.16]

7 However, all stemming from [5] and [7], they are based on pruning a huge candidate sufﬁx tree by employing such criteria as KL-divergences. [sent-19, score-0.208]

8 This kind of “post-hoc” approach suffers from several important limitations: First, when we want to consider deeper dependences, the candidate tree to be pruned will be extremely large. [sent-20, score-0.199]

9 Second, the criteria and threshold for pruning the tree are inherently exogeneous and must be set carefully so that they match the desired model and current data. [sent-22, score-0.208]

10 Third, pruning by empirical counts in advance, which is often used to build “arbitrary order” candidate trees in these approaches, is shown to behave very badly [8] and has no theoretical standpoints. [sent-23, score-0.177]

11 Figure 1: Hierarchical Chinese restaurant processes over the ﬁnite and inﬁnite sufﬁx trees. [sent-32, score-0.202]

12 deﬁned on a unit interval, “vertically” to the trees of inﬁnite depth associated with a hierarchical Chinese restaurant process, our model directly infers the hidden orders of Markov dependencies from which each symbol originated. [sent-33, score-0.85]

13 Furthermore, we extend the variable model by latent topics to show that we can induce the variable length “stochastic phrases” for topic by topic. [sent-35, score-0.157]

14 However, now we have the hierarchical (Poisson-) Dirichlet process that can be used as a ﬁxed order language model [9][10], it is natural for us to extend these models to variable orders also by using a nonparametric approach. [sent-37, score-0.383]

15 For concreteness below we use a language model example, but the same model can be applied to any discrete sequences, such as characters, DNAs, or even binary streams for compression. [sent-39, score-0.118]

16 Consider a trigram language model, which is a second-order Markov model over words often employed in speech recognition. [sent-40, score-0.197]

17 Following [9], this Markov model can be represented by a sufﬁx tree of depth two, as shown in Figure 1(a). [sent-41, score-0.29]

18 When we predict a word “sing” after a context “she will”, we descend this sufﬁx tree from the root (which corresponds to null string context), using the context backwards to follow a branch “will” and then “she will”. [sent-42, score-0.599]

19 1 Now we arrive at the leaf node that represents the context, and we can predict “sing” by using the count distribution at this node. [sent-43, score-0.221]

20 During the learning phase, we begin with a sufﬁx tree that has no counts. [sent-44, score-0.164]

21 For every time a three word sequence appears in the training data, such as “she will sing” mentioned above, we add a count of a ﬁnal word (“sing”) given the context (“she will”) to the context node in the sufﬁx tree. [sent-45, score-0.611]

22 In fact this corresponds to a hierarchical Chinese restaurant process, where each context node is a restaurant and each count is a customer associated with a word. [sent-46, score-1.05]

23 restaurant, might not have customers for all the words in the lexicon. [sent-49, score-0.203]

24 Therefore, when a customer arrives at a node and stochastically needs a new table to sit down, a copy of him, namely a proxy customer, is sent to its parent node. [sent-50, score-0.837]

25 When a node has no customer to compute the probability of some word, it uses the distribution of customers at the parent node and appropriately interpolates it to sum to 1. [sent-51, score-0.902]

26 Assume that the node “she will” does not have a customer of “like. [sent-52, score-0.475]

27 ” We can nevertheless compute the probability of “like” given “she will” if its sibling “he will” has a customer “like”. [sent-53, score-0.347]

28 Because that sibling has sent a copy of the customer to the common parent “will”, the probability is computed by appropriately interpolating the trigram probability given “she will”, which is zero, with the bigram probability given “will”, which is not zero at the parent node. [sent-54, score-0.637]

29 1 − qi 1 − qj 1 − qk i j k Figure 2: Probabilistic sufﬁx tree of an inﬁnite depth. [sent-57, score-0.341]

30 (1 − qi ) is a “penetration probability” of a descending customer at each node i, deﬁning a stick-breaking process over the inﬁnite tree. [sent-58, score-0.716]

31 c(s|h) is the count of s at node h, and c(h) = s c(s|h) is the total count at node h. [sent-60, score-0.442]

32 ths is the number of times symbol s is estimated to be generated from its parent distribution p(s|h′ ) rather than p(s|h) in the training data: th· = s ths is its total. [sent-61, score-0.321]

33 θ and d are the parameters of the Pitman-Yor process, and can be estimated through the distribution of customers on a sufﬁx tree by Gamma and Beta posterior distributions, respectively. [sent-62, score-0.323]

34 Although this Bayesian Markov model is very principled and attractive, we can see from Figure 1(a) that all the real customers (i. [sent-64, score-0.159]

35 , counts) are ﬁxed at the depth (n−1) in the sufﬁx tree. [sent-66, score-0.126]

36 Because actual sequences will have heterogeneous Markov dependencies, we want a Markov model that deploys customers at different levels in the sufﬁx tree according to the true Markov order from which each customer originated. [sent-67, score-0.723]

37 3 Inﬁnite-order Hierarchical Chinese Restaurant Processes Intuitively, we know that sufﬁx trees that are too deep are improbable and symbol dependencies decay largely exponentially with context lengths. [sent-69, score-0.455]

38 However, some customers may reside in a very deep node (for example, “the united states of america”) and some in a shallow node (“shorter than”). [sent-70, score-0.581]

39 Our model for deploying customers must be ﬂexible enough to accommodate all these possibilities. [sent-71, score-0.202]

40 1 Introducing Sufﬁx Tree Prior For this purpose, we assume that each node i in the sufﬁx tree has a hidden probability qi of stopping at node i when following a path from the root of the tree to add a customer. [sent-73, score-0.927]

41 In other words, (1 − qi ) is the “penetration probability” when descending an inﬁnite depth sufﬁx tree from its root (Figure 2). [sent-74, score-0.575]

42 We assume that each qi is generated from a prior Beta distribution independently as: qi ∼ Be(α, β) i. [sent-75, score-0.354]

43 When we want to generate a symbol st given a context h = s−∞ · · · st−2 st−1 , we descend the sufﬁx tree from the root following a path st−1 → st−2 → · · · , according to the probability of stopping at a l−1 level l given by p(n = l|h) = ql (1 − qi ) . [sent-79, score-0.865]

44 (l = 0, 1, · · · , ∞) (3) i=0 When we stop at level l, we generate a symbol st using the context st−l · · ·st−2 st−1 . [sent-80, score-0.475]

45 Since qi differs from node to node, we may reach very deep nodes with high probability if the qi ’s along the path are equally small (the “penetration” of this branch is high); or, we may stop at a very shallow node if the qi ’s are very high (the “penetration” is low). [sent-81, score-1.101]

46 In general, the probability to reach a node decays exponentially with levels according to (3), but the degrees are different to allow for long sequences of typical phrases. [sent-82, score-0.263]

47 Note that even for the same context h, the context length that was used to generate the next symbol may differ stochastically for each appearance according to (3). [sent-83, score-0.395]

48 2 Inference Of course, we do not know the hidden probability qi possessed by each node. [sent-85, score-0.256]

49 Note that the generative model above amounts to introducing a vector of hidden variables, n = n1 n2 · · · nT , that corresponds to each Markov order (n = 0 · · · ∞) from which each symbol st in s = s1 s2 · · · sT originated. [sent-87, score-0.42]

50 n (4) z Here, z = z1 z2 · · · zT is a vector that represents the hidden seatings of the proxy customers described in Section 2, where 0 ≤ zt ≤ nt means how recursively the st ’s proxy customers are stochastically sent to parent nodes. [sent-89, score-1.198]

51 Since in the hierarchical (Poisson-)Dirichlet process the customers are exchangeable [9] and qi is i. [sent-91, score-0.445]

52 as shown in (2), this process is also exchangeable and therefore we can always assume, by a suitable permutation, that the customer to resample is the ﬁnal customer. [sent-94, score-0.35]

53 In our case, we only explicitly resample nt given n−t (n excluding nt ), as follows: nt ∼ p(nt |s, z−t , n−t ). [sent-95, score-0.547]

54 (5) Notice here that when we sample nt , we already know the other depths n−t that other words have reached in the sufﬁx tree. [sent-96, score-0.254]

55 Therefore, when computing (5) using (3), the expectation of each qi is ai +α E[qi ] = , (6) ai +bi +α+β where ai is the number of times node i was stopped at when generating other words, and bi is the number of times node i was passed by. [sent-97, score-0.694]

56 (7) The ﬁrst term is the probability of st under HPYLM when the Markov order is known to be nt , given by (1). [sent-99, score-0.403]

57 The second term is the prior probability of reaching that node at depth nt . [sent-100, score-0.508]

58 Using these probabilities, we can construct a Gibbs sampler, as shown in Figure 3, to iteratively resample n and z in order to estimate the parameter of the variable order hierarchical Pitman-Yor language model (VPYLM)2 . [sent-103, score-0.349]

59 In this sampler, we ﬁrst remove the t’th customer who resides at a depth of order[t] in the sufﬁx tree, and decrement ai or bi accordingly along the path. [sent-104, score-0.523]

60 Markov order) according to (7), we put the t’th customer back at the new depth recorded as order[t], and increment ai or bi accordingly along the new path. [sent-107, score-0.523]

61 When we add a customer st , zt is implicitly sampled because st ’s proxy customer is recursively sent to parent nodes in case a new table is needed to sit him down. [sent-108, score-1.429]

62 int id; /* symbol id */ }; 7: end for 8: end for Figure 4: Data structure of a sufﬁx tree node. [sent-110, score-0.388]

63 This is a speciﬁc application of our model to the hierarchical Pitman-Yor processes for discrete data. [sent-114, score-0.144]

64 Character s a i d a l i c e ; ‘ l e t u s a l l f o r any t h i ng t he s e cond l y , · · · Markov order 5 6 5 4 7 10 6 5 4 3 7 1 4 8 2 4 4 6 5 5 4 4 5 5 6 4 5 6 7 7 7 5 3 3 4 5 9 11 6 4 8 9 8 9 4 4 4 7 3 4 3 · · · (b) Markov orders used to generate each character above. [sent-118, score-0.247]

65 ” This sampler is an extension of that reported in [9] using stochastically different orders n (n = 0 · · · ∞) for each customer. [sent-120, score-0.273]

66 In practice, we can place some maximum order nmax on n and sample within it 3 , or use a small threshold ǫ to stop the descent when the prior probability (8) of reaching that level is smaller than ǫ. [sent-121, score-0.205]

67 Because each node in the sufﬁx tree may have a huge number of children, we used Splay Trees [11] for the efﬁcient search as in [6]. [sent-123, score-0.335]

68 Figure 4 shows our data structure of a node in a sufﬁx tree. [sent-126, score-0.171]

69 3 Prediction Since we do not usually know the Markov order of a context h = s−∞ · · · s−2 s−1 beforehand, when making predictions we consider n as a latent variable and average over it, as follows: ∞ n=0 ∞ n=0 p(s|h) = = p(s, n|h) p(s|h, n)p(n|h) . [sent-128, score-0.159]

70 (9) (10) Here, p(s|n, h) is a HPYLM prediction of order n through (1), and p(n|h) is the probability distribution of latent Markov order n possessed by the context h, obtained through (8). [sent-129, score-0.24]

72 (12) This formula shows that qn in fact deﬁnes the stick-breaking process on an inﬁnite tree, where breaking proportions will differ branch to branch as opposed to a single proportion on a unit interval used in ordinary Dirichlet processes. [sent-132, score-0.172]

73 4 “Stochastic Phrases” on Sufﬁx Tree In the expression (9) above, p(s, n|h) is the probability that the symbol s is generated by a Markov process of order n on h, that is, using the last n symbols of h as a Markov state. [sent-135, score-0.253]

74 In other words, instead of emitting a single symbol s at the root node of sufﬁx tree, we can ﬁrst stochastically descend the tree according to the probability to stop by (3). [sent-137, score-0.709]

75 Finally, we emit s given the context s−n · · · s−1 , which yields a phrase s−n · · · s−1 s and its cohesion probability. [sent-138, score-0.212]

76 3 Notice that by setting (α, β) = (0, ∞), we always obtain qi = 0: with some maximum order nmax , this is equivalent to always using the maximum depth, and thus to reducing the model to the original HPYLM. [sent-141, score-0.291]

77 delegates this week EOS n 0 1 2 3 4 5 6 7 8 9 Figure 6: Estimated Markov order distributions from which each word has been generated. [sent-145, score-0.151]

78 4 Experiments To investigate the behavior of the inﬁnite Markov model, we conducted experiments on character and word sequences in natural language. [sent-146, score-0.262]

79 1 Inﬁnite character Markov model Character-based Markov model is widely employed in data compression and has important application in language processing, such as OCR and unknown word recognition. [sent-148, score-0.322]

80 Figure 5(b) is the actual Markov orders used for generation by (8). [sent-159, score-0.118]

81 In fact, the model contained many nodes that correspond to valid words as well as the connective fragments between them. [sent-162, score-0.118]

82 2 Data For a word-based “n-gram” model of language, we used a random subset of the standard NAB Wall Street Journal language modeling corpus [12] 5 , totalling 10,007,108 words (409,246 sentences) for training and 10,000 sentences for testing. [sent-166, score-0.202]

83 As HPYLM is shown to converge very fast [9], according to preliminary experiments we used N = 200 Gibbs iterations for burn-in, and a further 50 iterations to evaluate the perplexity of the test data. [sent-169, score-0.173]

84 n means the ﬁxed order for HPYLM, and the maximum order nmax in VPYLM. [sent-172, score-0.152]

85 0×100 0 1 2 3 4 5 6 7 8 9 10 11 12 n Figure 7: Global distribution of sampled Markov orders on the ∞-gram VPYLM over the NAB corpus. [sent-196, score-0.118]

86 same order despite the additional cost of sampling n-gram orders, because it appropriately avoids the addition of unnecessarily deep nodes on the sufﬁx tree. [sent-198, score-0.157]

87 The perplexity at n = ∞ is the lowest compared to all ﬁxed truncations, and contains only necessary number of nodes in the model. [sent-199, score-0.247]

88 Note that since we added an inﬁnite number of dummy symbols to the sentence heads as usual, every word context has a maximum possible length of ∞. [sent-201, score-0.299]

89 Because of the tradeoff between using a longer, more predictive context and the penalty incurred when reaching a deeper node, interestingly a peak emerges around n = 3 ∼ 4 as a global phenomenon. [sent-203, score-0.197]

90 In fact, this hyperparameter can be optimized by the empirical Bayes method using each Beta posterior of qi in (6). [sent-205, score-0.177]

91 Figure 9 shows perplexity results for the same data, using (α, β) ∈ (0. [sent-210, score-0.173]

92 3 Variable Order Topic Model While previous approaches to latent topic modeling assumed a ﬁxed order such as unigrams or bigrams, the order is generally not ﬁxed and unknown to us. [sent-217, score-0.192]

93 Therefore, we used a Gibbs sampler for the Markov chain LDA [14] and augmented it by sampling Markov orders at the same time. [sent-218, score-0.19]

94 6617 : Stochastic phrases in the sufﬁx tree primary new issues ˆ at the same time is a unit of from # % in # to # % in a number of in new york stock exchange composite trading mechanism of the european monetary increase as a result of tiffany & co. [sent-232, score-0.24]

95 1 and the number of topics M = 5, nmax = 5 and ran a N = 200 Gibbs iterations to obtain a single posterior set of models. [sent-259, score-0.117]

96 Although in predictive perplexity the improvements are slight (VPYLDA=116. [sent-260, score-0.209]

97 Although we used a small number of latent topics in this experiment to avoid data sparsenesses, in future research we need a more ﬂexible topic model where the number of latent topics will differ from node to node in the sufﬁx tree. [sent-263, score-0.579]

98 By extending a stick-breaking process “vertically” over a sufﬁx tree of hierarchical Chinese restaurant processes, we can make a posterior inference on the Markov orders from which each data originates. [sent-265, score-0.558]

99 In addition to apparent application of our approach to hierarchical continuous distributions like Gaussians, we expect that the basic model can be used for the distribution of latent variables. [sent-267, score-0.148]

100 Each data is assigned to a deeper level just when needed, and resides not only in leaf nodes but also in the intermediate nodes, by stochastically descending a clustering hierarchy from the root as described in this paper. [sent-268, score-0.3]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('customer', 0.304), ('hpylm', 0.282), ('vpylm', 0.217), ('st', 0.198), ('markov', 0.185), ('qi', 0.177), ('nab', 0.174), ('perplexity', 0.173), ('node', 0.171), ('restaurant', 0.167), ('nt', 0.167), ('tree', 0.164), ('customers', 0.159), ('symbol', 0.148), ('sing', 0.132), ('depth', 0.126), ('chinese', 0.12), ('orders', 0.118), ('language', 0.118), ('word', 0.113), ('hierarchical', 0.109), ('splay', 0.109), ('trees', 0.099), ('parent', 0.097), ('character', 0.091), ('penetration', 0.087), ('phrase', 0.087), ('proxy', 0.083), ('stochastically', 0.083), ('suf', 0.083), ('context', 0.082), ('topic', 0.077), ('int', 0.076), ('phrases', 0.076), ('nmax', 0.076), ('nodes', 0.074), ('sampler', 0.072), ('gibbs', 0.07), ('symbols', 0.067), ('vertically', 0.065), ('descending', 0.064), ('branch', 0.062), ('sent', 0.061), ('beta', 0.058), ('sequences', 0.058), ('ngram', 0.057), ('alice', 0.057), ('polya', 0.057), ('nite', 0.053), ('bi', 0.052), ('descend', 0.052), ('th', 0.051), ('count', 0.05), ('qn', 0.048), ('dirichlet', 0.047), ('stop', 0.047), ('dependencies', 0.047), ('resample', 0.046), ('deep', 0.045), ('reaching', 0.044), ('pruning', 0.044), ('root', 0.044), ('words', 0.044), ('bread', 0.043), ('butter', 0.043), ('cohesion', 0.043), ('cry', 0.043), ('daichi', 0.043), ('deploying', 0.043), ('depths', 0.043), ('hikaridai', 0.043), ('keihanna', 0.043), ('pharases', 0.043), ('possessed', 0.043), ('sibling', 0.043), ('truncations', 0.043), ('characters', 0.042), ('ai', 0.041), ('topics', 0.041), ('sentences', 0.04), ('america', 0.04), ('latent', 0.039), ('ths', 0.038), ('recursively', 0.038), ('order', 0.038), ('kyoto', 0.038), ('sit', 0.038), ('sentence', 0.037), ('predictive', 0.036), ('hidden', 0.036), ('processes', 0.035), ('children', 0.035), ('deeper', 0.035), ('united', 0.035), ('lexicon', 0.035), ('trigram', 0.035), ('yoram', 0.035), ('counts', 0.034), ('zt', 0.034), ('exponentially', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 197 nips-2007-The Infinite Markov Model

Author: Daichi Mochihashi, Eiichiro Sumita

2 0.19786006 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

Author: Max Welling, Ian Porteous, Evgeniy Bart

Abstract: A general modeling framework is proposed that uniﬁes nonparametric-Bayesian models, topic-models and Bayesian networks. This class of inﬁnite state Bayes nets (ISBN) can be viewed as directed networks of ‘hierarchical Dirichlet processes’ (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). We show that collapsed Gibbs sampling can be done efﬁciently in these models by leveraging the structure of the Bayes net and using the forward-ﬁltering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership stochastic block models as well as a number of new models are described as ISBNs. Two experiments have been performed to illustrate these ideas. 1

3 0.14384533 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in ﬂavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1

4 0.13337462 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

Author: David Newman, Padhraic Smyth, Max Welling, Arthur U. Asuncion

Abstract: We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The ﬁrst scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors—it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using ﬁve real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¤ ¦¥£ ¢ ¢

5 0.13207538 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1

6 0.12337144 102 nips-2007-Incremental Natural Actor-Critic Algorithms

7 0.10604264 116 nips-2007-Learning the structure of manifolds using random projections

8 0.10526638 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

9 0.098225713 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

10 0.096306212 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

11 0.092309959 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

12 0.091694266 9 nips-2007-A Probabilistic Approach to Language Change

13 0.0899745 183 nips-2007-Spatial Latent Dirichlet Allocation

14 0.085827246 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

15 0.082088441 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

16 0.08044973 86 nips-2007-Exponential Family Predictive Representations of State

17 0.080168203 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

18 0.080089644 47 nips-2007-Collapsed Variational Inference for HDP

19 0.077558994 189 nips-2007-Supervised Topic Models

20 0.074980587 84 nips-2007-Expectation Maximization and Posterior Constraints

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.217), (1, -0.04), (2, -0.04), (3, -0.262), (4, 0.021), (5, -0.127), (6, 0.021), (7, -0.057), (8, 0.051), (9, -0.08), (10, -0.065), (11, 0.103), (12, -0.017), (13, -0.061), (14, 0.021), (15, -0.079), (16, -0.002), (17, -0.062), (18, -0.034), (19, 0.12), (20, -0.07), (21, 0.126), (22, 0.103), (23, -0.007), (24, 0.28), (25, 0.0), (26, -0.112), (27, 0.07), (28, -0.01), (29, -0.003), (30, 0.058), (31, -0.055), (32, -0.002), (33, 0.031), (34, 0.017), (35, -0.069), (36, -0.069), (37, 0.066), (38, -0.014), (39, -0.034), (40, 0.013), (41, 0.041), (42, -0.091), (43, 0.147), (44, -0.046), (45, -0.083), (46, 0.022), (47, -0.001), (48, 0.142), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96488839 197 nips-2007-The Infinite Markov Model

Author: Daichi Mochihashi, Eiichiro Sumita

2 0.6208812 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

Author: Max Welling, Ian Porteous, Evgeniy Bart

3 0.57941979 9 nips-2007-A Probabilistic Approach to Language Change

Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths

Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for deﬁning probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efﬁcient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1

4 0.56864953 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

Author: Yee W. Teh, Daniel J. Hsu, Hal Daume

Abstract: We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman’s coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over the state-of-the-art, and demonstrate our approach in document clustering and phylolinguistics. 1

5 0.54236561 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

Author: Anton Chechetka, Carlos Guestrin

Abstract: We present the ﬁrst truly polynomial algorithm for PAC-learning the structure of bounded-treewidth junction trees – an attractive subclass of probabilistic graphical models that permits both the compact representation of probability distributions and efﬁcient exact inference. For a constant treewidth, our algorithm has polynomial time and sample complexity. If a junction tree with sufﬁciently strong intraclique dependencies exists, we provide strong theoretical guarantees in terms of KL divergence of the result from the true distribution. We also present a lazy extension of our approach that leads to very signiﬁcant speed ups in practice, and demonstrate the viability of our method empirically, on several real world datasets. One of our key new theoretical insights is a method for bounding the conditional mutual information of arbitrarily large sets of variables with only polynomially many mutual information computations on ﬁxed-size subsets of variables, if the underlying distribution can be approximated by a bounded-treewidth junction tree. 1

6 0.46254718 116 nips-2007-Learning the structure of manifolds using random projections

7 0.45091683 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data

8 0.443147 191 nips-2007-Temporal Difference Updating without a Learning Rate

9 0.44094411 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

10 0.42816874 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

11 0.414545 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

12 0.41433555 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

13 0.41021991 47 nips-2007-Collapsed Variational Inference for HDP

14 0.40983656 127 nips-2007-Measuring Neural Synchrony by Message Passing

15 0.40812209 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

16 0.39099044 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

17 0.38287619 27 nips-2007-Anytime Induction of Cost-sensitive Trees

18 0.38102093 86 nips-2007-Exponential Family Predictive Representations of State

19 0.37334052 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess

20 0.3634426 97 nips-2007-Hidden Common Cause Relations in Relational Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.038), (13, 0.061), (16, 0.013), (21, 0.046), (31, 0.018), (35, 0.02), (47, 0.067), (49, 0.018), (83, 0.095), (85, 0.454), (87, 0.034), (90, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87789077 197 nips-2007-The Infinite Markov Model

Author: Daichi Mochihashi, Eiichiro Sumita

2 0.7959252 77 nips-2007-Efficient Inference for Distributions on Permutations

Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas

Abstract: Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efﬁciently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efﬁcient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efﬁcient quadratic program deﬁned directly in the Fourier domain to project the approximation onto a relaxed form of the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1

3 0.78041834 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good ﬂexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the beneﬁts of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

4 0.72660333 112 nips-2007-Learning Monotonic Transformations for Classification

Author: Andrew Howard, Tony Jebara

Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classiﬁer. In many domains such as document classiﬁcation, image histogram classiﬁcation and gene microarray experiments, ﬁxed monotonic transformations can be useful as a preprocessing step. However, most classiﬁers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classiﬁer without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classiﬁer. Two algorithmic implementations of the method are formalized. The ﬁrst solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semideﬁnite relaxation that overcomes initialization issues in the greedy optimization problem. The eﬀectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1

5 0.49163494 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efﬁciently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are signiﬁcantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for ﬁnding the MAP assignment in protein structure prediction. 1

6 0.44269687 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

7 0.43726504 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

8 0.41871586 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.4176091 134 nips-2007-Multi-Task Learning via Conic Programming

10 0.41497722 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

11 0.41045657 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

12 0.40944374 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

13 0.40839365 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

14 0.40821871 97 nips-2007-Hidden Common Cause Relations in Relational Learning

15 0.40742627 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

16 0.40285626 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

17 0.39964682 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

18 0.39900866 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation

19 0.39635444 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

20 0.39490706 9 nips-2007-A Probabilistic Approach to Language Change