nips nips2001 nips2001-194 knowledge-graph by maker-knowledge-mining

194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation


Source: pdf

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. [sent-5, score-0.399]

2 Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. [sent-6, score-0.765]

3 We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. [sent-7, score-0.149]

4 We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. [sent-8, score-0.542]

5 1 Introduction Sparse multinomial distributions arise in many statistical domains, including natural language processing and graphical models. [sent-9, score-0.396]

6 Consequently, a number of approaches to parameter estimation for sparse multinomial distributions have been suggested [3]. [sent-10, score-0.558]

7 These approaches tend to be domain-independent: they make little use of prior knowledge about a specific domain. [sent-11, score-0.18]

8 In many domains where multinomial distributions are estimated there is often at least weak prior knowledge about' the potential structure of distributions, such as a set of hypotheses about restricted vocabularies from which the symbols might be generated. [sent-12, score-1.261]

9 Such knowledge can be solicited from experts or obtained from unlabeled data. [sent-13, score-0.236]

10 We present a method for Bayesian_parameter estimation in sparse discrete domains that exploits this weak form of prior knowledge to improve estimates over knowledge-free approaches. [sent-14, score-0.388]

11 1 Bayesian parameter estimation for multinomial distributions Following the presentation in [4], we consider a language ~ containing L distinct symbols. [sent-16, score-0.551]

12 A multinomial distribution is specified by a parameter vector f) == (Ol, . [sent-17, score-0.304]

13 The task of multinomial estimation is to take a data set D and produce a'vector f) that results in a good approximation to the distribution that produced D. [sent-25, score-0.359]

14 x N drawn from the distribution to be estimated, which can be summarized by the statistics N i specifying the number of times the ith symbol occurs in the data. [sent-29, score-0.052]

15 D also determines the set ~o of symbols that occur in the data. [sent-30, score-0.088]

16 Stated in this way, multinomial estimation involves predicting the next observation based on the data. [sent-31, score-0.359]

17 The Bayesian estimate for this probability is given by PL(xN+lID) = I p(XN+1IB)P(BID)dB where P(X N + 1 10) follows from the multinomial distribution corresponding to O. [sent-33, score-0.295]

18 The posterior probability P(OID) can be obtained via Bayes rule L P(OID) oc P(DIO)P(O) == P(8} II ONi i==l where P(O) is the prior probability of a given O. [sent-34, score-0.162]

19 Laplace used this method with a uniform prior over 0 to give the famous "law of succession" [6J. [sent-35, score-0.071]

20 A more general approach is to assume a Dirichlet prior over (), which is conjugate to the multinomial distribution and gives N i +LCY. [sent-36, score-0.338]

21 5 is the Jeffreys-Perks law or Expected Likelihood Estimation [2] [5J [9J, while using arbitrary O! [sent-45, score-0.074]

22 2 EstiIllating sparse Illultinomial distributions Several authors have extended the Bayesian approach to sparse multinomial distributions, in which only a restricted vocabulary of symbols are used, by maintaining uncertainty over these vocabularies. [sent-49, score-0.971]

23 Friedman and Singer consider the vocabulary V ~ :E to be a random variable, allowing them to write p. [sent-52, score-0.325]

24 Ivlaking use of weak prior knowiedge Friedman and Singer assume a prior that gives equal probability to all vocabularies of a given cardinality. [sent-61, score-0.709]

25 However, many real-world tasks provide limited knowledge about the structure of distributions that we can build into our methods for parameter estimation. [sent-62, score-0.214]

26 In the context of sparse multinomial estimation, one instance of such knowledge the importance of specific vocabularies. [sent-63, score-0.407]

27 For example, in predicting the next character in a file, our predictions could be facilitated by considering the fact that most files either use a vocabulary consisting of ASCII printing characters (such as text files), or all possible characters (suc~ as object files). [sent-64, score-1.001]

28 This kind of structural knowledge about a domain is typically easier to solicit from experts than accurate distributional information, and forms a valuable informational resource. [sent-65, score-0.225]

29 If we have this kind of prior knowledge, we can restrict our attention to a subset of the 2L possible vocabularies. [sent-66, score-0.099]

30 fu particular, we can specify a set of vocabularies V which we consider as hypotheses for the vocabulary used in producing D, where the elements of V are specified by our knowledge of the domain. [sent-67, score-0.932]

31 This stands as a compromise between Friedman and Singer's approach, in which V consists of all vocabularies, and traditional Bayesian parameter estimation as represented by Equation 1, in which V consists of only the vocabulary containing all words. [sent-68, score-0.48]

32 The intuition behind this approach is that it attempts to classify the target distribution as using one of a known set of vocabularies, where the vocabularies are obtained either from experts or from unlabeled data. [sent-71, score-0.604]

33 Applying standard Bayesian multinomial estimation within this vocabulary gives enough flexibility for the method to capture a range of distributions, while making use of our weak prior knowledge. [sent-72, score-0.799]

34 1 An illustration: Text compression Text compression is an effective test of methods for multinomial estimation. [sent-74, score-0.533]

35 Adaptive coding can be performed by specifying a method for calculating a distribution over the probability of the next byte in a file based upon the preceding bytes [1]. [sent-75, score-0.394]

36 The extent to which the file is compressed depends upon the quality of these predictions. [sent-76, score-0.252]

37 To illustrate the utility of including prior knowledge, we follow Ristad in using the Calgary text compression corpus [1]. [sent-77, score-0.398]

38 The files include Bib'IEXsource (bib), formatted English text (book*, paper*), geological data (geo), newsgroup articles (news), object files (obj*), a bit-mapped picture (pic), programs in three different languages (prog*) and a terminal transcript (trans). [sent-79, score-0.785]

39 The task was to estimate the distribution from which characters in the file were drawn based upon the first N characters and thus predict the N + 1st character. [sent-80, score-0.534]

40 Performance was measured in terms of the length of the resulting file, where the contribution of the N + 1st character to the length is log2 P(XN+lID). [sent-81, score-0.044]

41 The results are expressed as the number of bytes required to encode the file relative to the empirical entropy NH(Ni/N) as assessed by Ristad [10]. [sent-82, score-0.27]

42 P v is the restricted vocabulary model outlined above, with V consisting of just two hypotheses: one corresponding to binary files, containing all 256 characters, and one consisting of a 107 character vocabulary representing formatted English. [sent-84, score-0.844]

43 The latter vocabulary was estimated from 5MB of English text, C code, Bib'IEXsource, and newsgroup data from outside the Calgary corpus. [sent-85, score-0.46]

44 P y outperformed the other methods on all files based upon English text, bar bookl, and all files using all 256 symbols l . [sent-89, score-0.55]

45 The high performance followed from rapid classification of these files as using the appropriate vocabulary in V. [sent-90, score-0.556]

46 When the vocabulary included all symbols Py performed as PJ, which gave the best predictions for these files. [sent-91, score-0.443]

47 1 A number of excellent techniques for· text compression exist that outperform all of those presented here. [sent-92, score-0.263]

48 We have not included these techniques for comparison because our interest is in using text compression as a means of assessing estimation procedures, rather than as an end in itself. [sent-93, score-0.355]

49 We thus consider only methods for multinomial estimation as our comparison. [sent-94, score-0.359]

50 2 Maintaining uncertainty in vocabularies The results for book1 illustrate a weakness of the approach outlined above. [sent-97, score-0.61]

51 The file length for P y is higher than those for PF and PR , despite the fact that the file uses a text-based vocabulary. [sent-98, score-0.416]

52 This file contains two characters that were not encountered in the data used to construct V. [sent-99, score-0.377]

53 These characters caused P y to default to the unrestricted vocabulary of all 256 characters. [sent-100, score-0.466]

54 This behavior results from the assumption that the candidate vocabularies in V are completely accurate. [sent-102, score-0.529]

55 Since in many cases the knowledge that informs the vocabularies in V may be imperfect, it is desirable to allow for uncertainty in vocabularies. [sent-103, score-0.629]

56 When V is determined by the judgments of domain experts, ty is the probability that an unmentioned word actually belongs to a particular vocabulary. [sent-105, score-0.174]

57 While it may not be the most efficient use of such data, the V E V can also be estimated from some form of unlabeled data. [sent-106, score-0.075]

58 Friedman and Singer explicitly calculate the probability that an unseen word is in V based upon a dataset: from the second condition of Equation 5, we find that we should set ty == L_1IYI (1- C(D, L)). [sent-108, score-0.225]

59 3 Bayesian parameter estimation in natural language Statistical natural language processing often uses sparse multinomial distributions over large vocabularies of words. [sent-110, score-1.139]

60 By specifying a basis set of vocabularies, we can perform parameter estimation by classifying distributions according to their vocabulary. [sent-112, score-0.252]

61 This dataset is commonly used in testing text classification algorithms (eg. [sent-114, score-0.214]

62 Ten newsgroups were used to estimate a set of vocabularies V with corresponding ty. [sent-116, score-0.725]

63 These vocabularies were used in estimating multinomial distributions on these newsgroups and ten others. [sent-117, score-1.18]

64 The articl~s in each of the 20 newsgroups were then divided into three sets. [sent-119, score-0.23]

65 The first 500 articles from ten newsgroups were used to estimate the candidate vocabularies V and uncertainty parameters ty. [sent-120, score-1.019]

66 Articles 501700 for all 20 newsgroups were used as training data for multinomial estimation. [sent-121, score-0.497]

67 Following [8], a dictionary was built up by running over the 13,000 articles resulting from this division, and all words that occurred only once were mapped to an "unknown" word. [sent-123, score-0.24]

68 As before, the restricted vocabulary method (Py), Friedman and Singer's method (PF ), and Ristad's (PR ), Laplace's (PL ) and the Jeffreys-Perks (PJ ) laws were ap- alt. [sent-125, score-0.403]

69 hardware 100 10000 Number of words 50000 talk. [sent-161, score-0.069]

70 graphics ~~~ 100 10000 F~gure 1: Cross-entropy of predictions on newsgroup data as a function of the logarithm of the number of words. [sent-164, score-0.136]

71 mideast and those to its right) are for the newsgroups with unknown vocabularies. [sent-168, score-0.23]

72 The bottom ten are for those that contributed vocabularies to V, trained and tested on novel data. [sent-169, score-0.649]

73 mideast indicates the point at which Pv defaults to the full vocabulary, as the number of unseen words makes this vocabulary more likely. [sent-173, score-0.469]

74 'V featured one vocabulary that contained all words in the dictionary, and ten vocabularies each corresponding to the words used in the first 500 articles of one of the newsgroups designated for this purpose. [sent-178, score-1.393]

75 Testing for each newsgroup consisted of taking words from the 200 articles assigned for training purposes, estimating a. [sent-180, score-0.329]

76 The cross-entropy is H{Q; P) == Ei Qi log2 Pi, where Q is the true distribution and P is the distribution produced by the estimation method. [sent-182, score-0.092]

77 Q was given by the maximum likelihood estimate formed from the word frequencies in all 200 articles assigned for testing purposes. [sent-183, score-0.205]

78 As expected, P y consistently outperformed the other methods on the newsgroups that contributed to V. [sent-191, score-0.301]

79 However, performance on novel newsgroups was also greatly improved. [sent-192, score-0.267]

80 As can be seen in Figure 2, the novel newsgroups were classified to appropriate vocabularies - for example all words rec. [sent-193, score-0.858]

81 graphics o 10000 Number of words Figure 2: Classification of newsgroup vocabularies. [sent-221, score-0.175]

82 The lines illustrate the vocabulary which had maximum posterior probability for each of the ten test newsgroups after exposure to differing numbers of words. [sent-222, score-0.722]

83 The vocabularies in V are listed along the left hand side of the axis, and the lines are identified with newsgroups by the labels on the right hand side. [sent-223, score-0.753]

84 The proportion of word types occurring in the test data but not the vocabulary to which the novel newsgroups were classified ranged between 30. [sent-233, score-0.66]

85 This illustrates that even approximate knowledge can facilitate predictions: the basis set of vocabularies allowed the high frequency words in the data to be modelled effectively, without excess mass being attributed to the low frequency novel word tokens. [sent-237, score-0.762]

86 mideast illustrates the same defaulting behavior that was shown for text classification: when the data become more probable under the vocabulary containing all words than under a restricted vocabulary the method defaults to the Jeffreys-Perks law. [sent-240, score-0.96]

87 This guarantees that the method will tend to perform no worse than P J when unseen words are encountered in sufficient proportions. [sent-241, score-0.13]

88 4 Discussion Bayesian approaches to parameter estimation for sparse multinomial distributions have employed the notion of a restricted vocabulary from which symbols are produced. [sent-243, score-1.014]

89 In many domains where such distributions are estimated; there is often at least limited knowledge about the structure of these vocabularies. [sent-244, score-0.218]

90 By considering just the vocabularies suggested by such knowledge, together with some uncertainty concerning those vocabularies, we can achieve very good estimates of distributions in these domains. [sent-245, score-0.621]

91 We have presented a Bayesian approach that employs limited prior knowledge, and shown that it outperforms a range of approaches to multinomial estimation for both text compression and a task involving natural language. [sent-246, score-0.75]

92 While our applications in this paper estimated approximate vocabularies from data, the real promise of this approach lies with domain knowledge solicited from experts. [sent-247, score-0.677]

93 Experts are typically better at providing qualitative structural information than quantitative distributional information, and our approach provides a way of using this information in parameter estimation. [sent-248, score-0.066]

94 For example, in the context of parameter estimation for graphical models to be used in medical diagnosis, distinguishing classes of symptoms might be informative in determining the parameters governing their relationship to diseases. [sent-249, score-0.129]

95 This form of knowledge is naturally translated into a set of vocabularies to be considered for each such distribution. [sent-250, score-0.574]

96 More complex applications to natural language lllay also be possible, such as using syntactic information in estimating probabilities for n-gram models. [sent-251, score-0.091]

97 The approach we have presented in this paper provides a simple way to allow this kind of limited domain knowledge to be useful in Bayesian parameter estimation. [sent-252, score-0.197]

98 An empirical study of smoothing techniques for language modeling. [sent-274, score-0.058]

99 An invariant form for the prior probability in estimation problems. [sent-283, score-0.191]

100 Text classification fro'in labeled and unlabeled documents using EM. [sent-301, score-0.087]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vocabularies', 0.495), ('vocabulary', 0.325), ('multinomial', 0.267), ('newsgroups', 0.23), ('file', 0.208), ('files', 0.19), ('ristad', 0.144), ('ko', 0.141), ('characters', 0.141), ('compression', 0.133), ('text', 0.13), ('articles', 0.121), ('friedman', 0.117), ('singer', 0.114), ('newsgroup', 0.106), ('ild', 0.104), ('estimation', 0.092), ('symbols', 0.088), ('pf', 0.087), ('ten', 0.084), ('ilv', 0.083), ('ty', 0.079), ('knowledge', 0.079), ('pj', 0.076), ('pl', 0.076), ('xn', 0.074), ('law', 0.074), ('calgary', 0.072), ('distributions', 0.071), ('prior', 0.071), ('words', 0.069), ('experts', 0.063), ('bayesian', 0.063), ('bytes', 0.062), ('div', 0.062), ('sparse', 0.061), ('language', 0.058), ('laplace', 0.058), ('pv', 0.057), ('uncertainty', 0.055), ('specifying', 0.052), ('dirichlet', 0.052), ('pr', 0.051), ('dictionary', 0.05), ('bib', 0.048), ('formatted', 0.048), ('geo', 0.048), ('misc', 0.048), ('sclcrypt', 0.048), ('sclmed', 0.048), ('sclspace', 0.048), ('solicited', 0.048), ('vid', 0.048), ('unlabeled', 0.046), ('weak', 0.044), ('upon', 0.044), ('character', 0.044), ('english', 0.044), ('restricted', 0.043), ('testing', 0.043), ('lid', 0.042), ('trans', 0.042), ('oid', 0.042), ('actuaries', 0.042), ('nh', 0.042), ('pic', 0.042), ('defaults', 0.042), ('word', 0.041), ('classification', 0.041), ('domains', 0.041), ('facilitate', 0.041), ('ka', 0.038), ('outperformed', 0.038), ('py', 0.038), ('corpus', 0.037), ('parameter', 0.037), ('novel', 0.037), ('ni', 0.036), ('oc', 0.035), ('laws', 0.035), ('candidate', 0.034), ('estimating', 0.033), ('hypotheses', 0.033), ('outlined', 0.033), ('unseen', 0.033), ('contributed', 0.033), ('predictions', 0.03), ('approaches', 0.03), ('distributional', 0.029), ('estimated', 0.029), ('otherwise', 0.029), ('kind', 0.028), ('probability', 0.028), ('encountered', 0.028), ('lines', 0.028), ('classified', 0.027), ('limited', 0.027), ('illustrate', 0.027), ('containing', 0.026), ('domain', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

2 0.12771149 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

3 0.082752548 68 nips-2001-Entropy and Inference, Revisited

Author: Ilya Nemenman, F. Shafee, William Bialek

Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j

4 0.073450893 171 nips-2001-Spectral Relaxation for K-means Clustering

Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon

Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1

5 0.068426199 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network

Author: Michiro Negishi, Stephen J. Hanson

Abstract: It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. 1

6 0.066753112 43 nips-2001-Bayesian time series classification

7 0.06553109 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

8 0.063430078 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

9 0.056721389 35 nips-2001-Analysis of Sparse Bayesian Learning

10 0.052941907 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

11 0.051128369 86 nips-2001-Grammatical Bigrams

12 0.050987739 190 nips-2001-Thin Junction Trees

13 0.049934179 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

14 0.049634121 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model

15 0.048630066 143 nips-2001-PAC Generalization Bounds for Co-training

16 0.047907751 58 nips-2001-Covariance Kernels from Bayesian Generative Models

17 0.04790296 144 nips-2001-Partially labeled classification with Markov random walks

18 0.047621313 183 nips-2001-The Infinite Hidden Markov Model

19 0.047502387 31 nips-2001-Algorithmic Luckiness

20 0.045683265 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.142), (1, 0.003), (2, -0.013), (3, -0.05), (4, -0.065), (5, -0.105), (6, -0.022), (7, 0.024), (8, -0.098), (9, -0.032), (10, 0.071), (11, -0.008), (12, -0.085), (13, -0.1), (14, -0.01), (15, -0.072), (16, 0.053), (17, -0.042), (18, -0.005), (19, -0.009), (20, -0.052), (21, 0.079), (22, -0.035), (23, -0.005), (24, 0.093), (25, -0.066), (26, -0.105), (27, 0.012), (28, 0.131), (29, 0.06), (30, 0.02), (31, -0.009), (32, -0.015), (33, -0.047), (34, -0.022), (35, 0.079), (36, -0.03), (37, -0.003), (38, 0.209), (39, 0.026), (40, 0.128), (41, -0.079), (42, -0.074), (43, 0.172), (44, -0.016), (45, 0.011), (46, -0.161), (47, -0.109), (48, -0.03), (49, 0.098)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95355213 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

2 0.70047593 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

3 0.47969863 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

Author: Jeff Bilmes, Gang Ji, Marina Meila

Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.

4 0.45094746 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

Author: Jorg Ontrup, Helge Ritter

Abstract: We introduce a new type of Self-Organizing Map (SOM) to navigate in the Semantic Space of large text collections. We propose a “hyperbolic SOM” (HSOM) based on a regular tesselation of the hyperbolic plane, which is a non-euclidean space characterized by constant negative gaussian curvature. The exponentially increasing size of a neighborhood around a point in hyperbolic space provides more freedom to map the complex information space arising from language into spatial relations. We describe experiments, showing that the HSOM can successfully be applied to text categorization tasks and yields results comparable to other state-of-the-art methods.

5 0.4283821 68 nips-2001-Entropy and Inference, Revisited

Author: Ilya Nemenman, F. Shafee, William Bialek

Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j

6 0.42114165 86 nips-2001-Grammatical Bigrams

7 0.41338688 30 nips-2001-Agglomerative Multivariate Information Bottleneck

8 0.39538541 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

9 0.36740568 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

10 0.36196429 43 nips-2001-Bayesian time series classification

11 0.33052427 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

12 0.32800058 61 nips-2001-Distribution of Mutual Information

13 0.32149443 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

14 0.31239456 35 nips-2001-Analysis of Sparse Bayesian Learning

15 0.30896708 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network

16 0.30178693 120 nips-2001-Minimax Probability Machine

17 0.29085097 144 nips-2001-Partially labeled classification with Markov random walks

18 0.2863645 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

19 0.28254417 183 nips-2001-The Infinite Hidden Markov Model

20 0.28206784 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.286), (14, 0.017), (17, 0.043), (19, 0.034), (27, 0.097), (30, 0.069), (36, 0.019), (38, 0.016), (59, 0.043), (72, 0.065), (79, 0.067), (83, 0.011), (91, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81933743 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

Author: Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1

2 0.59786642 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1

3 0.59729648 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite— consider, for example, symbols being possible words appearing in English text.

4 0.59547883 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.59473181 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

Author: Hilbert J. Kappen, Wim Wiegerinck

Abstract: The Cluster Variation method is a class of approximation methods containing the Bethe and Kikuchi approximations as special cases. We derive two novel iteration schemes for the Cluster Variation Method. One is a fixed point iteration scheme which gives a significant improvement over loopy BP, mean field and TAP methods on directed graphical models. The other is a gradient based method, that is guaranteed to converge and is shown to give useful results on random graphs with mild frustration. We conclude that the methods are of significant practical value for large inference problems. 1

6 0.59119892 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

7 0.59114635 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

8 0.58990479 89 nips-2001-Grouping with Bias

9 0.58869684 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

10 0.58865905 36 nips-2001-Approximate Dynamic Programming via Linear Programming

11 0.58818233 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

12 0.58712196 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

13 0.58655709 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

14 0.58518922 169 nips-2001-Small-World Phenomena and the Dynamics of Information

15 0.58471513 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

16 0.58468115 121 nips-2001-Model-Free Least-Squares Policy Iteration

17 0.58406544 3 nips-2001-ACh, Uncertainty, and Cortical Inference

18 0.58343279 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

19 0.5817672 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

20 0.58142245 59 nips-2001-Direct value-approximation for factored MDPs