nips nips2001 nips2001-107 knowledge-graph by maker-knowledge-mining

107 nips-2001-Latent Dirichlet Allocation


Source: pdf

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. [sent-5, score-0.804]

2 Inference and learning are carried out efficiently via variational algorithms. [sent-6, score-0.137]

3 We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. [sent-7, score-0.32]

4 1 Introduction Recent years have seen the development and successful application of several latent factor models for discrete data. [sent-8, score-0.159]

5 One notable example, Hofmann's pLSI/aspect model [3], has received the attention of many researchers, and applications have emerged in text modeling [3], collaborative filtering [7], and link analysis [1]. [sent-9, score-0.332]

6 In the context of text modeling, pLSI is a "bag-of-words" model in that it ignores the ordering of the words in a document. [sent-10, score-0.236]

7 It performs dimensionality reduction, relating each document to a position in low-dimensional "topic" space. [sent-11, score-0.288]

8 A sometimes poorly-understood subtlety of pLSI is that, even though it is typically described as a generative model , its documents have no generative probabilistic semantics and are treated simply as a set of labels for the specific documents seen in the training set. [sent-13, score-0.731]

9 Moreover, since each training document is treated as a separate entity, the pLSI model has a large number of parameters and heuristic "tempering" methods are needed to prevent overfitting. [sent-16, score-0.374]

10 In this paper we describe a new model for collections of discrete data that provides full generative probabilistic semantics for documents. [sent-17, score-0.285]

11 Documents are modeled via a hidden Dirichlet random variable that specifies a probability distribution on a latent, low-dimensional topic space. [sent-18, score-0.323]

12 The distribution over words of an unseen document is a continuous mixture over document space and a discrete mixture over all possible topics. [sent-19, score-1.039]

13 1 Generative models for text Latent Dirichlet Allocation (LDA) model To simplify our discussion, we will use text modeling as a running example throughout this section, though it should be clear that the model is broadly applicable to general collections of discrete data. [sent-21, score-0.425]

14 In LDA, we assume that there are k underlying latent topics according to which documents are generated, and that each topic is represented as a multinomial distribution over the IVI words in the vocabulary. [sent-22, score-1.146]

15 A document is generated by sampling a mixture of these topics and then sampling words from that mixture. [sent-23, score-0.801]

16 More precisely, a document of N words w = (W1,'" ,WN) is generated by the following process. [sent-24, score-0.4]

17 First, B is sampled from a Dirichlet(a1,'" ,ak) distribution. [sent-25, score-0.061]

18 Then, for each of the N words, a topic Zn E {I , . [sent-27, score-0.271]

19 , k} is sampled from a Mult(B) distribution p(zn = ilB) = Bi . [sent-30, score-0.087]

20 Finally, each word Wn is sampled, conditioned on the znth topic, from the multinomial distribution p(wl zn). [sent-31, score-0.2]

21 Intuitively, Bi can be thought of as the degree to which topic i is referred to in the document . [sent-32, score-0.559]

22 Written out in full, the probability of a document is therefore the following mixture: p(w) = Ie (11 z~/(wnlzn; ,8)P( Zn IB») p(B; a)dB, (1) where p(B ; a) is Dirichlet , p(znIB) is a multinomial parameterized by B, and p( Wn IZn;,8) is a multinomial over the words. [sent-33, score-0.578]

23 ,ak) and a k x IVI matrix,8, which are parameters controlling the k multinomial distributions over words. [sent-36, score-0.169]

24 The graphical model representation of LDA is shown in Figure 1. [sent-37, score-0.062]

25 In such a model the innermost plate would contain only W n ; the topic node would be sampled only once for each document; and the Dirichlet would be sampled only once for the whole collection. [sent-39, score-0.479]

26 In LDA, the Dirichlet is sampled for each document, and the multinomial topic node is sampled repeatedly within the document. [sent-40, score-0.538]

27 The Dirichlet is thus a component in the probability model rather than a prior distribution over the model parameters. [sent-41, score-0.088]

28 Having sampled B, words are drawn iid from the multinomial/unigram model given by p(wIB) = 2::=1 p(wl z)p(z IB). [sent-44, score-0.204]

29 Thus, LDA is a mixture model where the unigram models p(wIB) are the mixture components, and p(B ; a) gives the mixture weights. [sent-45, score-0.613]

30 Note that unlike a traditional mixture of unigrams model, this distribution has an infinite o 1'0 '. [sent-46, score-0.275]

31 The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. [sent-49, score-0.501]

32 Figure 2: An example distribution on unigram models p(wIB) under LDA for three words and four topics. [sent-50, score-0.354]

33 The triangle embedded in the x-y plane is the 2-D simplex over all possible multinomial distributions over three words. [sent-51, score-0.219]

34 , each of the vertices of the triangle corresponds to a deterministic distribution that assigns one of the words probability 1; the midpoint of an edge gives two of the words 0. [sent-54, score-0.286]

35 5 probability each; and the centroid of the triangle is the uniform distribution over all 3 words). [sent-55, score-0.062]

36 The four points marked with an x are the locations of the multinomial distributions p(wlz) for each of the four topics , and the surface shown on top of the simplex is an example of a resulting density over multinomial distributions given by LDA. [sent-56, score-0.607]

37 The example in Figure 2 illustrates this interpretation of LDA as defining a random distribution over unigram models p(wIB). [sent-58, score-0.242]

38 2 Related models The mixture of unigrams model [6] posits that every document is generated by a single randomly chosen topic: (2) This model allows for different documents to come from different topics, but fails to capture the possibility that a document may express multiple topics. [sent-60, score-1.198]

39 LDA captures this possibility, and does so with an increase in the parameter count of only one parameter: rather than having k - 1 free parameters for the multinomial p(z) over the k topics, we have k free parameters for the Dirichlet. [sent-61, score-0.193]

40 A second related model is Hofmann's probabilistic latent semantic indexing (pLSI) [3], which posits that a document label d and a word ware conditionally independent given the hidden topic z : p(d, w) = L~=l p(wlz)p(zld)p(d). [sent-62, score-0.864]

41 (3) This model does capture the possibility that a document may contain multiple topics since p(zld) serve as the mixture weights of the topics. [sent-63, score-0.746]

42 However, a subtlety of pLSIand the crucial difference between it and LDA-is that d is a dummy index into the list of documents in the training set. [sent-64, score-0.28]

43 Thus, d is a multinomial random variable with as many possible values as there are training documents, and the model learns the topic mixtures p(zld) only for those documents on which it is trained. [sent-65, score-0.685]

44 For this reason, pLSI is not a fully generative model and there is no clean way to use it to assign probability to a previously unseen document. [sent-66, score-0.168]

45 Furthermore, the number of parameters in pLSI is on the order of klVl + klDI, where IDI is the number of documents in the training set. [sent-67, score-0.236]

46 3 Inference and learning Let us begin our description of inference and learning problems for LDA by examining the contribution to the likelihood made by a single document. [sent-69, score-0.101]

47 To simplify our notation, let w~ = 1 iff Wn is the jth word in the vocabulary and z~ = 1 iff Zn is the ith topic. [sent-70, score-0.113]

48 Large text collections require fast inference and learning algorithms and thus we have utilized a variational approach [5] to approximate the likelihood in Eq. [sent-80, score-0.402]

49 Under this distribution, the terms in the variational lower bound are computable and differentiable, and we can maximize the bound with respect to, and ¢ to obtain the best approximation to p(w;a,j3). [sent-83, score-0.137]

50 Note that the third and fourth terms in the variational bound are not straightforward to compute since they involve the entropy of a Dirichlet distribution, a (k - I)-dimensional integral over B which is expensive to compute numerically. [sent-84, score-0.137]

51 ai (6) where \]i is the first Note that the resulting variational parameters can also be used and interpreted as an approximation of the parameters of the true posterior. [sent-89, score-0.185]

52 ' WM}, we utilize the EM algorithm with a variational E step, maximizing a lower bound on the log likelihood: M logp(V) 2:: l:= Eqm [logp(B, z, w)]- Eqm [logqm(B, z)]. [sent-94, score-0.137]

53 (7) m=l The E step refits qm for each document by running the inference step described above. [sent-95, score-0.377]

54 4 Experiments and Examples We first tested LDA on two text corpora. [sent-100, score-0.093]

55 By examining the (variational) posterior distribution on the topic mixture q(B; ')'), we can identify the topics which were most likely to have contributed to many words in a given document; specifically, these are the topics i with the largest ')'i. [sent-104, score-1.123]

56 Examining the most likely words in the corresponding multinomials can then further tell us what these topics might be about. [sent-105, score-0.428]

57 "Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health , medical research, education and the social services," Hearst Foundation President Randolph A. [sent-110, score-0.156]

58 The Juilliard School, where music and the performing arts are taught, will get $250,000 . [sent-115, score-0.082]

59 The (bottleneck) E step is distributed across nodes so that the qm for different documents are calculated in parallel. [sent-118, score-0.262]

60 -><--------------- k (number of topics) k (number of topiCS) Figure 4: Perplexity results on the CRAN and AP corpora for LDA, pLSI, mixture of unigrams, and t he unigram model. [sent-127, score-0.367]

61 This document is mostly a combination of words about school policy (topic 4) and music (topic 5). [sent-129, score-0.475]

62 The less prominent topics reflect other words about education (topic 1) , finance (topic 2), and health (topic 3). [sent-130, score-0.472]

63 1 Formal evaluation: Perplexity To compare the generalization performance of LDA with other models, we computed the perplexity of a test set for the AP and CRAN corpora. [sent-132, score-0.169]

64 We compared LDA to both the mixture of unigrams and pLSI described in Section 2. [sent-135, score-0.249]

65 We trained the pLSI model with and without tempering to reduce overfitting. [sent-137, score-0.115]

66 As mentioned previously, pLSI does not readily generate or assign probabilities to previously unseen documents; in our experiments, we assigned probability to a new document d by marginalizing out the dummy training set indices 2 : pew ) = l: d( rr : =1l:z p(w n lz)p(z ld))p(d) . [sent-139, score-0.424]

67 2 A second natural method, marginalizing out d and z to form a unigram model using the resulting p(w)'s, did not perform well (its performance was similar to the standard unigram model). [sent-140, score-0.442]

68 ~UrUg,ams I -:- W' • M" ~ x NaiveBaes k (number of topics) k (number of topics) Figure 5: Results for classification (left) and collaborative filtering (right) Figure 4 shows the perplexity for each model and both corpora for different values of k. [sent-142, score-0.479]

69 The latent variable models generally do better than the simple unigram model. [sent-143, score-0.343]

70 The pLSI model severely overfits when not tempered (the values beyond k = 10 are off the graph) but manages to outperform mixture of unigrams when tempered. [sent-144, score-0.28]

71 To our knowledge, these are by far the best text perplexity results obtained by a bag-of-words model. [sent-146, score-0.262]

72 2 Classification We also tested LDA on a text classification task. [sent-148, score-0.151]

73 For each class c, we learn a separate model p(wlc) of the documents in that class. [sent-149, score-0.243]

74 An unseen document is classified by picking argmaxcp(Clw) = argmaxcp(wlc)p(c). [sent-150, score-0.337]

75 Note that using a simple unigram distribution for p(wlc) recovers the traditional naive Bayes classification model. [sent-151, score-0.274]

76 Using the same (standard) subset of the WebKB dataset as used in [6], we obtained classification error rates illustrated in Figure 5 (left). [sent-152, score-0.058]

77 3 Collaborative filtering Our final experiment utilized the EachMovie collaborative filtering dataset. [sent-156, score-0.257]

78 In this dataset a collection of users indicates their preferred movie choices. [sent-157, score-0.077]

79 A user and the movies he chose are analogous to a document and the words in the document (respectively) . [sent-158, score-0.743]

80 Then, for each test user, we are shown all but one of the movies that she liked and are asked to predict what the held-out movie is. [sent-161, score-0.086]

81 More precisely define the predictive perplexity on M test users to be exp( - ~~=llogP(WmNd lwml' . [sent-163, score-0.215]

82 5 Conclusions We have presented a generative probabilistic framework for modeling the topical structure of documents and other collections of discrete data. [sent-168, score-0.466]

83 Topics are represented explicitly via a multinomial variable Zn that is repeatedly selected, once for each word, in a given document. [sent-169, score-0.171]

84 In this sense, the model generates an allocation of the words in a document to topics. [sent-170, score-0.498]

85 When computing the probability of a new document, this unknown allocation induces a mixture distribution across the words in the vocabulary. [sent-171, score-0.327]

86 There is a many-to-many relationship between topics and words as well as a many-to-many relationship between documents and topics. [sent-172, score-0.603]

87 While Dirichlet distributions are often used as conjugate priors for multinomials in Bayesian modeling, it is preferable to instead think of the Dirichlet in our model as a component of the likelihood. [sent-173, score-0.068]

88 The Dirichlet random variable e is a latent variable that gives generative probabilistic semantics to the notion of a "document" in the sense that it allows us to put a distribution on the space of possible documents. [sent-174, score-0.324]

89 The words that are actually obtained are viewed as a continuous mixture over this space, as well as being a discrete mixture over topics. [sent-175, score-0.388]

90 3 The generative nature of LDA makes it easy to use as a module in more complex architectures and to extend it in various directions. [sent-176, score-0.063]

91 We have already seen that collections of LDA can be used in a classification setting. [sent-177, score-0.135]

92 If the classification variable is treated as a latent variable we obtain a mixture of LDA models, a useful model for situations in which documents cluster not only according to their topic overlap, but along other dimensions as well. [sent-178, score-0.878]

93 Another extension arises from generalizing LDA to consider Dirichlet/multinomial mixtures of bigram or trigram models, rather than the simple unigram models that we have considered here. [sent-179, score-0.216]

94 Finally, we can readily fuse LDA models which have different vocabularies (e. [sent-180, score-0.057]

95 , words and images); these models interact via a common abstract topic variable and can elegantly use both vocabularies in determining the topic mixture of a given document. [sent-182, score-0.859]

96 The missing link- A probabilistic model of document content and hypertext connectivity. [sent-189, score-0.359]

97 Text classification from labeled and unlabeled documents using EM. [sent-225, score-0.27]

98 Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. [sent-234, score-0.129]

99 3These remarks also distinguish our model from the Bayesian Dirichlet/Multinomial allocation model (DMA)of [2], which is a finite alternative to the Dirichlet process . [sent-236, score-0.129]

100 The DMA places a mixture of Dirichlet priors on p(wl z ) and sets O i = 00 for all i . [sent-237, score-0.122]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lda', 0.358), ('plsi', 0.295), ('document', 0.288), ('topics', 0.279), ('topic', 0.271), ('dirichlet', 0.242), ('documents', 0.212), ('unigram', 0.19), ('perplexity', 0.169), ('zn', 0.15), ('multinomial', 0.145), ('variational', 0.137), ('unigrams', 0.127), ('mixture', 0.122), ('words', 0.112), ('collaborative', 0.103), ('latent', 0.101), ('text', 0.093), ('logp', 0.086), ('hearst', 0.084), ('tempering', 0.084), ('wib', 0.084), ('collections', 0.077), ('said', 0.074), ('wlz', 0.073), ('allocation', 0.067), ('filtering', 0.063), ('cran', 0.063), ('opera', 0.063), ('wlc', 0.063), ('zld', 0.063), ('generative', 0.063), ('wn', 0.062), ('ivi', 0.062), ('sampled', 0.061), ('classification', 0.058), ('corpora', 0.055), ('lincoln', 0.055), ('plate', 0.055), ('movies', 0.055), ('qm', 0.05), ('wl', 0.05), ('unseen', 0.049), ('posits', 0.047), ('users', 0.046), ('health', 0.044), ('bi', 0.043), ('argmaxcp', 0.042), ('arts', 0.042), ('dma', 0.042), ('eqm', 0.042), ('iwml', 0.042), ('juilliard', 0.042), ('mult', 0.042), ('philharmonic', 0.042), ('zib', 0.042), ('modeling', 0.042), ('ap', 0.042), ('semantics', 0.042), ('probabilistic', 0.04), ('music', 0.04), ('inference', 0.039), ('hofmann', 0.038), ('simplex', 0.038), ('education', 0.037), ('multinomials', 0.037), ('year', 0.037), ('hypergeometric', 0.037), ('subtlety', 0.037), ('trec', 0.037), ('triangle', 0.036), ('vocabulary', 0.036), ('school', 0.035), ('examining', 0.034), ('wm', 0.033), ('board', 0.033), ('metropolitan', 0.033), ('president', 0.033), ('discrete', 0.032), ('movie', 0.031), ('dummy', 0.031), ('marginalizing', 0.031), ('vocabularies', 0.031), ('model', 0.031), ('treated', 0.031), ('graphical', 0.031), ('semantic', 0.03), ('word', 0.029), ('likelihood', 0.028), ('utilized', 0.028), ('foundation', 0.027), ('indexing', 0.027), ('possibility', 0.026), ('variable', 0.026), ('models', 0.026), ('distribution', 0.026), ('assign', 0.025), ('million', 0.025), ('parameters', 0.024), ('iff', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

2 0.12771149 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

3 0.1229295 24 nips-2001-Active Information Retrieval

Author: Tommi Jaakkola, Hava T. Siegelmann

Abstract: In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance. The users may further refine their query as needed. This process may result in a lengthy correspondence without conclusion. We propose an alternative active learning approach, where the system responds to the initial user's query by successively probing the user for distinctions at multiple levels of abstraction. The system's initiated queries are optimized for speedy recovery and the user is permitted to respond with multiple selections or may reject the query. The information is in each case unambiguously incorporated by the system and the subsequent queries are adjusted to minimize the need for further exchange. The system's initiated queries are subject to resource constraints pertaining to the amount of information that can be presented to the user per iteration. 1

4 0.11700086 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

Author: Jorg Ontrup, Helge Ritter

Abstract: We introduce a new type of Self-Organizing Map (SOM) to navigate in the Semantic Space of large text collections. We propose a “hyperbolic SOM” (HSOM) based on a regular tesselation of the hyperbolic plane, which is a non-euclidean space characterized by constant negative gaussian curvature. The exponentially increasing size of a neighborhood around a point in hyperbolic space provides more freedom to map the complex information space arising from language into spatial relations. We describe experiments, showing that the HSOM can successfully be applied to text categorization tasks and yields results comparable to other state-of-the-art methods.

5 0.1011268 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

Author: Igor V. Cadez, Padhraic Smyth

Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1

6 0.095446281 43 nips-2001-Bayesian time series classification

7 0.087674327 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes

8 0.086137757 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

9 0.079195179 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

10 0.077384911 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

11 0.076954655 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

12 0.076136149 84 nips-2001-Global Coordination of Local Linear Models

13 0.071925983 21 nips-2001-A Variational Approach to Learning Curves

14 0.069577523 58 nips-2001-Covariance Kernels from Bayesian Generative Models

15 0.069218397 183 nips-2001-The Infinite Hidden Markov Model

16 0.067268856 30 nips-2001-Agglomerative Multivariate Information Bottleneck

17 0.065763712 61 nips-2001-Distribution of Mutual Information

18 0.064072601 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering

19 0.059868798 171 nips-2001-Spectral Relaxation for K-means Clustering

20 0.059722949 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.156), (1, 0.021), (2, -0.012), (3, -0.1), (4, -0.109), (5, -0.103), (6, 0.008), (7, -0.003), (8, -0.12), (9, -0.071), (10, 0.194), (11, -0.018), (12, -0.093), (13, -0.178), (14, -0.034), (15, -0.018), (16, 0.073), (17, 0.009), (18, -0.075), (19, 0.016), (20, 0.012), (21, 0.175), (22, 0.017), (23, -0.058), (24, -0.055), (25, -0.077), (26, 0.001), (27, 0.032), (28, 0.116), (29, 0.034), (30, -0.053), (31, -0.113), (32, -0.061), (33, -0.032), (34, -0.091), (35, -0.056), (36, -0.094), (37, -0.048), (38, 0.14), (39, -0.155), (40, 0.166), (41, -0.071), (42, -0.147), (43, 0.115), (44, -0.055), (45, -0.059), (46, -0.126), (47, 0.001), (48, -0.092), (49, 0.136)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95562148 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

2 0.70779002 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

3 0.63012576 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

Author: Jorg Ontrup, Helge Ritter

Abstract: We introduce a new type of Self-Organizing Map (SOM) to navigate in the Semantic Space of large text collections. We propose a “hyperbolic SOM” (HSOM) based on a regular tesselation of the hyperbolic plane, which is a non-euclidean space characterized by constant negative gaussian curvature. The exponentially increasing size of a neighborhood around a point in hyperbolic space provides more freedom to map the complex information space arising from language into spatial relations. We describe experiments, showing that the HSOM can successfully be applied to text categorization tasks and yields results comparable to other state-of-the-art methods.

4 0.51622778 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

Author: Igor V. Cadez, Padhraic Smyth

Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1

5 0.409374 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

Author: Ran El-Yaniv, Oren Souroujon

Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1

6 0.40632299 30 nips-2001-Agglomerative Multivariate Information Bottleneck

7 0.3622739 43 nips-2001-Bayesian time series classification

8 0.34534794 24 nips-2001-Active Information Retrieval

9 0.34499156 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes

10 0.33035323 84 nips-2001-Global Coordination of Local Linear Models

11 0.32250091 68 nips-2001-Entropy and Inference, Revisited

12 0.31929746 184 nips-2001-The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank

13 0.31798977 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

14 0.31557372 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

15 0.30735764 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

16 0.30211332 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

17 0.29007056 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

18 0.28876123 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

19 0.28273949 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering

20 0.27412474 140 nips-2001-Optimising Synchronisation Times for Mobile Devices


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.012), (17, 0.021), (19, 0.012), (27, 0.086), (30, 0.037), (38, 0.015), (59, 0.018), (72, 0.048), (79, 0.038), (83, 0.014), (91, 0.61)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99605525 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

Author: James M. Coughlan, Alan L. Yuille

Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1

2 0.99441922 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

Author: Gal Chechik, Amir Globerson, M. J. Anderson, E. D. Young, Israel Nelken, Naftali Tishby

Abstract: The way groups of auditory neurons interact to code acoustic information is investigated using an information theoretic approach. We develop measures of redundancy among groups of neurons, and apply them to the study of collaborative coding efficiency in two processing stations in the auditory pathway: the inferior colliculus (IC) and the primary auditory cortex (AI). Under two schemes for the coding of the acoustic content, acoustic segments coding and stimulus identity coding, we show differences both in information content and group redundancies between IC and AI neurons. These results provide for the first time a direct evidence for redundancy reduction along the ascending auditory pathway, as has been hypothesized for theoretical considerations [Barlow 1959,2001]. The redundancy effects under the single-spikes coding scheme are significant only for groups larger than ten cells, and cannot be revealed with the redundancy measures that use only pairs of cells. The results suggest that the auditory system transforms low level representations that contain redundancies due to the statistical structure of natural stimuli, into a representation in which cortical neurons extract rare and independent component of complex acoustic signals, that are useful for auditory scene analysis. 1

3 0.9924041 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

Author: Michael C. Mozer, Michael D. Colagrosso, David E. Huber

Abstract: We are interested in the mechanisms by which individuals monitor and adjust their performance of simple cognitive tasks. We model a speeded discrimination task in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). Response conflict arises when one stimulus class is infrequent relative to another, resulting in more errors and slower reaction times for the infrequent class. How do control processes modulate behavior based on the relative class frequencies? We explain performance from a rational perspective that casts the goal of individuals as minimizing a cost that depends both on error rate and reaction time. With two additional assumptions of rationality—that class prior probabilities are accurately estimated and that inference is optimal subject to limitations on rate of information transmission—we obtain a good fit to overall RT and error data, as well as trial-by-trial variations in performance. Consider the following scenario: While driving, you approach an intersection at which the traffic light has already turned yellow, signaling that it is about to turn red. You also notice that a car is approaching you rapidly from behind, with no indication of slowing. Should you stop or speed through the intersection? The decision is difficult due to the presence of two conflicting signals. Such response conflict can be produced in a psychological laboratory as well. For example, Stroop (1935) asked individuals to name the color of ink on which a word is printed. When the words are color names incongruous with the ink color— e.g., “blue” printed in red—reaction times are slower and error rates are higher. We are interested in the control mechanisms underlying performance of high-conflict tasks. Conflict requires individuals to monitor and adjust their behavior, possibly responding more slowly if errors are too frequent. In this paper, we model a speeded discrimination paradigm in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). The stimuli are letters of the alphabet, A–Z, presented in rapid succession. In a choice task, individuals are asked to press one response key if the letter is an X or another response key for any letter other than X (as a shorthand, we will refer to non-X stimuli as Y). In a go/no-go task, individuals are asked to press a response key when X is presented and to make no response otherwise. We address both tasks because they elicit slightly different decision-making behavior. In both tasks, Jones and Braver (2001) manipulated the relative frequency of the X and Y stimuli; the ratio of presentation frequency was either 17:83, 50:50, or 83:17. Response conflict arises when the two stimulus classes are unbalanced in frequency, resulting in more errors and slower reaction times. For example, when X’s are frequent but Y is presented, individuals are predisposed toward producing the X response, and this predisposition must be overcome by the perceptual evidence from the Y. Jones and Braver (2001) also performed an fMRI study of this task and found that anterior cingulate cortex (ACC) becomes activated in situations involving response conflict. Specifically, when one stimulus occurs infrequently relative to the other, event-related fMRI response in the ACC is greater for the low frequency stimulus. Jones and Braver also extended a neural network model of Botvinick, Braver, Barch, Carter, and Cohen (2001) to account for human performance in the two discrimination tasks. The heart of the model is a mechanism that monitors conflict—the posited role of the ACC—and adjusts response biases accordingly. In this paper, we develop a parsimonious alternative account of the role of the ACC and of how control processes modulate behavior when response conflict arises. 1 A RATIONAL ANALYSIS Our account is based on a rational analysis of human cognition, which views cognitive processes as being optimized with respect to certain task-related goals, and being adaptive to the structure of the environment (Anderson, 1990). We make three assumptions of rationality: (1) perceptual inference is optimal but is subject to rate limitations on information transmission, (2) response class prior probabilities are accurately estimated, and (3) the goal of individuals is to minimize a cost that depends both on error rate and reaction time. The heart of our account is an existing probabilistic model that explains a variety of facilitation effects that arise from long-term repetition priming (Colagrosso, in preparation; Mozer, Colagrosso, & Huber, 2000), and more broadly, that addresses changes in the nature of information transmission in neocortex due to experience. We give a brief overview of this model; the details are not essential for the present work. The model posits that neocortex can be characterized by a collection of informationprocessing pathways, and any act of cognition involves coordination among pathways. To model a simple discrimination task, we might suppose a perceptual pathway to map the visual input to a semantic representation, and a response pathway to map the semantic representation to a response. The choice and go/no-go tasks described earlier share a perceptual pathway, but require different response pathways. The model is framed in terms of probability theory: pathway inputs and outputs are random variables and microinference in a pathway is carried out by Bayesian belief revision.   To elaborate, consider a pathway whose input at time is a discrete random variable, denoted , which can assume values corresponding to alternative input states. Similarly, the output of the pathway at time is a discrete random variable, denoted , which can assume values . For example, the input to the perceptual pathway in the discrimination task is one of visual patterns corresponding to the letters of the alphabet, and the output is one of letter identities. (This model is highly abstract: the visual patterns are enumerated, but the actual pixel patterns are not explicitly represented in the model. Nonetheless, the similarity structure among inputs can be captured, but we skip a discussion of this issue because it is irrelevant for the current work.) To present a particular input alternative, , to the model for time steps, we clamp for . The model computes a probability distribution over given , i.e., P . ¡ # 4 0 ©2' &  0 ' ! 1)(

4 0.99173439 148 nips-2001-Predictive Representations of State

Author: Michael L. Littman, Richard S. Sutton

Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.

same-paper 5 0.98573685 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

6 0.97401035 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

7 0.95829141 144 nips-2001-Partially labeled classification with Markov random walks

8 0.93529415 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

9 0.90264678 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

10 0.87023467 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

11 0.87001401 30 nips-2001-Agglomerative Multivariate Information Bottleneck

12 0.84954304 24 nips-2001-Active Information Retrieval

13 0.83805978 183 nips-2001-The Infinite Hidden Markov Model

14 0.83246231 68 nips-2001-Entropy and Inference, Revisited

15 0.82915813 96 nips-2001-Information-Geometric Decomposition in Spike Analysis

16 0.82673168 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

17 0.82572675 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

18 0.82550621 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

19 0.82170832 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

20 0.81139952 51 nips-2001-Cobot: A Social Reinforcement Learning Agent